Update README.md (#1)
Browse files- Update README.md (555f2182ed4f571b5757b816f1dbca1144af197d)
README.md
CHANGED
|
@@ -20,12 +20,34 @@ and can be easily loaded by libraries like `transformers`.
|
|
| 20 |
|
| 21 |
Please use [`demo.py`](demo.py) as an example of use.
|
| 22 |
|
| 23 |
-
|
| 24 |
reinforcement learning (RL), or any other post-training procedures.
|
| 25 |
It is not ready for direct conversation.
|
| 26 |
Users are recommended to employ few-shot prompting to guide model outputs,
|
| 27 |
-
or to fine-tune the model for specific downstream applications
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
## Citation
|
| 31 |
|
|
|
|
| 20 |
|
| 21 |
Please use [`demo.py`](demo.py) as an example of use.
|
| 22 |
|
| 23 |
+
*Note: This is a pretrained base model only and has not undergone fine-tuning,
|
| 24 |
reinforcement learning (RL), or any other post-training procedures.
|
| 25 |
It is not ready for direct conversation.
|
| 26 |
Users are recommended to employ few-shot prompting to guide model outputs,
|
| 27 |
+
or to fine-tune the model for specific downstream applications.*
|
| 28 |
|
| 29 |
+
## Features
|
| 30 |
+
|
| 31 |
+

|
| 32 |
+
|
| 33 |
+
Our data preprocessing and pre-training pipeline is designed for enhanced training efficiency and model quality,
|
| 34 |
+
achieved through several key innovations:
|
| 35 |
+
|
| 36 |
+
1. **High-Performance Data Preprocessing:** We built an open-source,
|
| 37 |
+
Spark-based framework optimized with [Chukonu](https://pacman.cs.tsinghua.edu.cn/~cwg/publication/chukonu-2021/),
|
| 38 |
+
delivering exceptional efficiency for large-scale deduplication and sorting tasks.
|
| 39 |
+
|
| 40 |
+
2. **Dataset Quality Benchmarking:** A quantile benchmarking approach applied to major open-source pretraining datasets (e.g., DCLM Baseline, Fineweb-Edu)
|
| 41 |
+
reveals their quality distributions via small-scale training runs, informing better data selection.
|
| 42 |
+
|
| 43 |
+
3. **Multi-Phase Pre-Training:** The training progresses through 5 phases, strategically increasing the ratio of reasoning-intensive and knowledge-intensive samples
|
| 44 |
+
while selectively repeating high-quality data portions.
|
| 45 |
+
|
| 46 |
+
4. **Multi-Domain Curriculum Learning:** We keep a stable data mixture across different datasets while ordering samples within each dataset by ascending quality.
|
| 47 |
+
This curriculum is further leveraged through [accommodated learning rate decay and model averaging](https://arxiv.org/abs/2511.18903).
|
| 48 |
+
|
| 49 |
+
5. **Architecture for Training Stability:** Optimized for training on 910A clusters (FP16 precision, similar to V100),
|
| 50 |
+
the Kaiyuan-2B architecture integrates QK norm, sandwich norm, and soft-capping techniques to ensure stable and robust pre-training.
|
| 51 |
|
| 52 |
## Citation
|
| 53 |
|