-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2005.14165
-
deepseek-ai/DeepSeek-R1
Text Generation • 685B • Updated • 2.59M • • 13.1k -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation • 33B • Updated • 934k • • 2k -
google/gemma-2-27b-it
Text Generation • 27B • Updated • 284k • 561 -
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Paper • 2201.11903 • Published • 15
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Paper • 2510.23581 • Published • 42
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Large Language Models Are Human-Level Prompt Engineers
Paper • 2211.01910 • Published • 1 -
Lost in the Middle: How Language Models Use Long Contexts
Paper • 2307.03172 • Published • 44 -
Large Language Models are Zero-Shot Reasoners
Paper • 2205.11916 • Published • 3
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 265 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 94 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Paper • 1910.10683 • Published • 17
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
GPT-4 Technical Report
Paper • 2303.08774 • Published • 7
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
LLaMA: Open and Efficient Foundation Language Models
Paper • 2302.13971 • Published • 22 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251
-
Neural Machine Translation by Jointly Learning to Align and Translate
Paper • 1409.0473 • Published • 7 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Paper • 2006.03654 • Published • 3 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 10 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20
-
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper • 2504.07128 • Published • 87 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
BitNet b1.58 2B4T Technical Report
Paper • 2504.12285 • Published • 83 -
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper • 2501.09747 • Published • 29
-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
GPT-4 Technical Report
Paper • 2303.08774 • Published • 7
-
deepseek-ai/DeepSeek-R1
Text Generation • 685B • Updated • 2.59M • • 13.1k -
Qwen/Qwen2.5-Coder-32B-Instruct
Text Generation • 33B • Updated • 934k • • 2k -
google/gemma-2-27b-it
Text Generation • 27B • Updated • 284k • 561 -
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Paper • 2201.11903 • Published • 15
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
LLaMA: Open and Efficient Foundation Language Models
Paper • 2302.13971 • Published • 22 -
Llama 2: Open Foundation and Fine-Tuned Chat Models
Paper • 2307.09288 • Published • 251
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
Paper • 2510.23581 • Published • 42
-
Neural Machine Translation by Jointly Learning to Align and Translate
Paper • 1409.0473 • Published • 7 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Large Language Models Are Human-Level Prompt Engineers
Paper • 2211.01910 • Published • 1 -
Lost in the Middle: How Language Models Use Long Contexts
Paper • 2307.03172 • Published • 44 -
Large Language Models are Zero-Shot Reasoners
Paper • 2205.11916 • Published • 3
-
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Paper • 2006.03654 • Published • 3 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 10 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20
-
Reinforcement Pre-Training
Paper • 2506.08007 • Published • 265 -
A Survey on Latent Reasoning
Paper • 2507.06203 • Published • 94 -
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Paper • 1910.10683 • Published • 17
-
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper • 2504.07128 • Published • 87 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
BitNet b1.58 2B4T Technical Report
Paper • 2504.12285 • Published • 83 -
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper • 2501.09747 • Published • 29