princeton-nlp/prolong-data-64K
Updated
•
5.23k
•
18
The midtraining datasets for KORMo-10B were collected from diverse, publicly available source.
Note Long Context Training English - princeton-nlp/prolong-data-64K (7.21B, sampling) Korean - KORMo-Team/Cosmopedia-ko-synth (0.51B, sampling) - KORMo-Team/korean-web-collection (1.44B, sampling)
Note Reasoning Mid-Training English - nvidia/Nemotron-Post-Training-Dataset-v1 (~144.75B, filtering & sampling) - open-thoughts/OpenThoughts3-1.2M (~5.46B, filtering & sampling) Korean - KORMo-Team/NemoPost-ko-translated (~2.83B) - KORMo-Team/NemoPost-ko-synth (~7.05B)