stereoplegic 's Collections Dataset pruning/cleaning/dedup
updated
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
• 2307.08701
• Published
• 24
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Paper
• 2303.03915
• Published
• 7
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Paper
• 2309.04662
• Published
• 25
SlimPajama-DC: Understanding Data Combinations for LLM Training
Paper
• 2309.10818
• Published
• 11
When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale
Paper
• 2309.04564
• Published
• 17
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages
Paper
• 2309.09400
• Published
• 87
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only
Paper
• 2306.01116
• Published
• 43
Self-Alignment with Instruction Backtranslation
Paper
• 2308.06259
• Published
• 43
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient
Framework
Paper
• 2111.04130
• Published
• 1
Magicoder: Source Code Is All You Need
Paper
• 2312.02120
• Published
• 82
LLM360: Towards Fully Transparent Open-Source LLMs
Paper
• 2312.06550
• Published
• 57
Automated Data Curation for Robust Language Model Fine-Tuning
Paper
• 2403.12776
• Published
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
Reference Models
Paper
• 2405.20541
• Published
• 24
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language
Model Pre-training
Paper
• 2406.10670
• Published
• 4
DataComp-LM: In search of the next generation of training sets for
language models
Paper
• 2406.11794
• Published
• 55