Stefan Schweter PRO

stefan-it

AI & ML interests

Flair Library 💕, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models, Bavarian NLP 🥨

Recent Activity

upvoted an article 1 day ago

Building for an Open Future - our new partnership with Google Cloud

liked a model 3 days ago

ZurichNLP/DETECT

upvoted a paper 3 days ago

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

View all activity

Organizations

upvoted an article 1 day ago

Article

Building for an Open Future - our new partnership with Google Cloud

3 days ago

•

liked a model 3 days ago

ZurichNLP/DETECT

Updated 24 days ago • 2

upvoted a paper 3 days ago

Sample-Efficient Language Modeling with Linear Attention and Lightweight Enhancements

Paper • 2511.05560 • Published 12 days ago • 1

updated a Space 4 days ago

README

📈

liked a dataset 4 days ago

HuggingFaceFW/finepdfs-edu

Viewer • Updated 4 days ago • 49.5M • 4.76k • 33

reacted to Locutusque's post with 🔥 4 days ago

Post

2514

🚀 AutoXLA - Accelerating Large Models on TPU
AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster.
With quantization and Splash Attention kernels, AutoXLA achieves up to 4× speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads.
Whether you’re experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless.
⚠️ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues.
🔗 GitHub Repository: https://github.com/Locutusque/AutoXLA

reacted to codelion's post with 🔥 4 days ago

Post

3906

Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered.

We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples

These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.

The collection includes:
- finePDFs-1B: High-quality textbook-style educational content
- DCLM-baseline-1B: Filtered, diverse web content
- FineWeb-Edu-1B: Curated educational web resources

We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.

Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.

Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing