GLiNER-PII Collection PII detection models developed in collaboration with Wordcab • 5 items • Updated Sep 24 • 21
NeMo Curator - Classifier Models Collection Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 11 items • Updated about 6 hours ago • 24
Tiny Language Model Datasets Collection Collection of Synthetic Datasets that can be used in pretraining of any the Tiny Language Model • 14 items • Updated Sep 21 • 29
GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface Paper • 2507.18546 • Published Jul 24 • 25
SauerkrautLM-Multilingual-(Reason)-ColBERT Collection SauerkrautLM ColBERT is a suite of Late-Interaction retrieval models built with PyLate’s ColBERT architecture and tuned for seven European languages. • 7 items • Updated Aug 3 • 18
view article Article Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face Jul 29 • 198
EuroBERT Collection Scaling Multilingual Encoders for European Languages • 4 items • Updated Mar 10 • 13
OpenReasoning-Nemotron Collection Collection of models for OpenReasoning-Nemotron which are trained on 5M reasoning traces for Math, Code and Science. • 6 items • Updated about 6 hours ago • 44
view article Article FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages Jul 8 • 32
ModernBERT Collection Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated Dec 19, 2024 • 151
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Paper • 2501.08828 • Published Jan 15 • 31
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28, 2024 • 104
view article Article How to build a custom text classifier without days of human labeling Oct 17, 2024 • 56
Unifying Multimodal Retrieval via Document Screenshot Embedding Paper • 2406.11251 • Published Jun 17, 2024 • 10
view article Article Docmatix - a huge dataset for Document Visual Question Answering Jul 18, 2024 • 78