view article Article Building for an Open Future - our new partnership with Google Cloud 9 days ago • 43
Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements Paper • 2511.05560 • Published 18 days ago • 1
Pre-training Dataset Samples Collection A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations. • 19 items • Updated 10 days ago • 13
view article Article The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix 18 days ago • 41
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data Paper • 2510.10159 • Published Oct 11 • 3
Gaperon: A Peppered English-French Generative Language Model Suite Paper • 2510.25771 • Published 23 days ago • 14
gpt-oss-safeguard Collection gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are safety reasoning models built-upon gpt-oss • 2 items • Updated 23 days ago • 56
Gaperon Collection Our French-English LLM suite (SFT models are coming soon) • 10 items • Updated 18 days ago • 14
Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine Paper • 2510.21614 • Published 28 days ago • 22
view article Article huggingface_hub v1.0: Five Years of Building the Foundation of Open Machine Learning 26 days ago • 65
SindBERT, the Sailor: Charting the Seas of Turkish NLP Paper • 2510.21364 • Published 28 days ago • 1
The Art of Asking: Multilingual Prompt Optimization for Synthetic Data Paper • 2510.19806 • Published 30 days ago • 1
Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs Paper • 2510.20475 • Published 29 days ago • 1