On the SDEs and Scaling Rules for Adaptive Gradient Algorithms Paper • 2205.10287 • Published May 20, 2022
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Paper • 2501.02669 • Published Jan 5 • 1
AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models Paper • 2505.00147 • Published Apr 30 • 4
Exposing Attention Glitches with Flip-Flop Language Modeling Paper • 2306.00946 • Published Jun 1, 2023 • 2
TinyGSM: achieving >80% on GSM8k with small language models Paper • 2312.09241 • Published Dec 14, 2023 • 40
Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression Paper • 2306.00788 • Published Jun 1, 2023
Repeat After Me: Transformers are Better than State Space Models at Copying Paper • 2402.01032 • Published Feb 1, 2024 • 24
Task-Specific Skill Localization in Fine-tuned Language Models Paper • 2302.06600 • Published Feb 13, 2023
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding Paper • 2303.04245 • Published Mar 7, 2023