Attention Is All You Need for KV Cache in Diffusion LLMs Paper • 2510.14973 • Published 24 days ago • 37
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention Paper • 2509.24006 • Published Sep 28 • 115
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning Paper • 2509.25760 • Published Sep 30 • 54
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain Paper • 2509.26507 • Published Sep 30 • 526
Set Block Decoding is a Language Model Inference Accelerator Paper • 2509.04185 • Published Sep 4 • 52
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Paper • 2504.20966 • Published Apr 29 • 32
Tokenization Falling Short: The Curse of Tokenization Paper • 2406.11687 • Published Jun 17, 2024 • 16
TroL: Traversal of Layers for Large Language and Vision Models Paper • 2406.12246 • Published Jun 18, 2024 • 35
PowerInfer-2: Fast Large Language Model Inference on a Smartphone Paper • 2406.06282 • Published Jun 10, 2024 • 38
The Prompt Report: A Systematic Survey of Prompting Techniques Paper • 2406.06608 • Published Jun 6, 2024 • 68
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Paper • 2405.18669 • Published May 29, 2024 • 12
Chameleon: Mixed-Modal Early-Fusion Foundation Models Paper • 2405.09818 • Published May 16, 2024 • 131
Better & Faster Large Language Models via Multi-token Prediction Paper • 2404.19737 • Published Apr 30, 2024 • 79
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing Paper • 2404.12253 • Published Apr 18, 2024 • 55