-
Attention Is All You Need
Paper • 1706.03762 • Published • 98 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 23 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 9 -
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Paper • 1910.01108 • Published • 21
Collections
Discover the best community collections!
Collections including paper arxiv:2203.15556
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 23 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 1 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 9 -
Scaling Laws for Autoregressive Generative Modeling
Paper • 2010.14701 • Published • 1 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4
-
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Paper • 2408.15545 • Published • 38 -
Controllable Text Generation for Large Language Models: A Survey
Paper • 2408.12599 • Published • 65 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 -
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 63 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Precision
Paper • 2411.04330 • Published • 8 -
Transcending Scaling Laws with 0.1% Extra Compute
Paper • 2210.11399 • Published
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Paper • 2206.10789 • Published • 4 -
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Paper • 2401.00448 • Published • 30 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 9
-
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 376 -
Qwen2.5-Coder Technical Report
Paper • 2409.12186 • Published • 151 -
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Paper • 2409.12122 • Published • 4 -
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 211
-
Self-Play Preference Optimization for Language Model Alignment
Paper • 2405.00675 • Published • 27 -
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paper • 2205.14135 • Published • 15 -
Attention Is All You Need
Paper • 1706.03762 • Published • 98 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper • 2307.08691 • Published • 9
-
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length
Paper • 2310.00576 • Published • 2 -
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3 -
Transformers Can Achieve Length Generalization But Not Robustly
Paper • 2402.09371 • Published • 15
-
Attention Is All You Need
Paper • 1706.03762 • Published • 98 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 23 -
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper • 1907.11692 • Published • 9 -
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Paper • 1910.01108 • Published • 21
-
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 63 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Precision
Paper • 2411.04330 • Published • 8 -
Transcending Scaling Laws with 0.1% Extra Compute
Paper • 2210.11399 • Published
-
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
Paper • 2211.04325 • Published • 1 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 23 -
On the Opportunities and Risks of Foundation Models
Paper • 2108.07258 • Published • 1 -
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Paper • 2204.07705 • Published • 2
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Paper • 2206.10789 • Published • 4 -
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Paper • 2401.00448 • Published • 30 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 9
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 9 -
Scaling Laws for Autoregressive Generative Modeling
Paper • 2010.14701 • Published • 1 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4
-
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 376 -
Qwen2.5-Coder Technical Report
Paper • 2409.12186 • Published • 151 -
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Paper • 2409.12122 • Published • 4 -
Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 211
-
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Paper • 2408.15545 • Published • 38 -
Controllable Text Generation for Large Language Models: A Survey
Paper • 2408.12599 • Published • 65 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 44 -
Automated Design of Agentic Systems
Paper • 2408.08435 • Published • 40
-
Self-Play Preference Optimization for Language Model Alignment
Paper • 2405.00675 • Published • 27 -
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paper • 2205.14135 • Published • 15 -
Attention Is All You Need
Paper • 1706.03762 • Published • 98 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper • 2307.08691 • Published • 9
-
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length
Paper • 2310.00576 • Published • 2 -
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Paper • 2305.13169 • Published • 3 -
Transformers Can Achieve Length Generalization But Not Robustly
Paper • 2402.09371 • Published • 15