@AmanPriyanshu on Hugging Face: "Stratified LLM Subsets: Balanced Training Data at 100K-1M Scale Released…"

Post

322

Stratified LLM Subsets: Balanced Training Data at 100K-1M Scale

Released three training datasets using embedding-based k-means clustering to create balanced subsets from large-scale corpora:

Interactive cluster visualization:
https://amanpriyanshu.github.io/Stratified-LLM-Subsets-100K-1M-Scale/

Pre-Training (FineWeb-Edu + Proof-Pile-2)
AmanPriyanshu/stratified-kmeans-diverse-pretraining-100K-1M

Instruction-Following (Tulu-3 + Orca AgentInstruct)
AmanPriyanshu/stratified-kmeans-diverse-instruction-following-100K-1M

Reasoning (Llama-Nemotron with sqrt balancing)
AmanPriyanshu/stratified-kmeans-diverse-reasoning-100K-1M

Methodology: k-means clustering on Snowflake Arctic-embed-xs embeddings (100 iterations), selecting cluster centroids as representatives. Balancing applied to imbalanced datasets to reduce category dominance.

Available at 50k, 100k, 250k, 500k, and 1M scales.