openhonor commited on
Commit
f34886c
·
verified ·
1 Parent(s): bed3881

Update README.md (#1)

Browse files

- Update README.md (555f2182ed4f571b5757b816f1dbca1144af197d)

Files changed (1) hide show
  1. README.md +24 -2
README.md CHANGED
@@ -20,12 +20,34 @@ and can be easily loaded by libraries like `transformers`.
20
 
21
  Please use [`demo.py`](demo.py) as an example of use.
22
 
23
- _Note: This is a pretrained base model only and has not undergone fine-tuning,
24
  reinforcement learning (RL), or any other post-training procedures.
25
  It is not ready for direct conversation.
26
  Users are recommended to employ few-shot prompting to guide model outputs,
27
- or to fine-tune the model for specific downstream applications.
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Citation
31
 
 
20
 
21
  Please use [`demo.py`](demo.py) as an example of use.
22
 
23
+ *Note: This is a pretrained base model only and has not undergone fine-tuning,
24
  reinforcement learning (RL), or any other post-training procedures.
25
  It is not ready for direct conversation.
26
  Users are recommended to employ few-shot prompting to guide model outputs,
27
+ or to fine-tune the model for specific downstream applications.*
28
 
29
+ ## Features
30
+
31
+ ![](model_performance_comparison.png)
32
+
33
+ Our data preprocessing and pre-training pipeline is designed for enhanced training efficiency and model quality,
34
+ achieved through several key innovations:
35
+
36
+ 1. **High-Performance Data Preprocessing:** We built an open-source,
37
+ Spark-based framework optimized with [Chukonu](https://pacman.cs.tsinghua.edu.cn/~cwg/publication/chukonu-2021/),
38
+ delivering exceptional efficiency for large-scale deduplication and sorting tasks.
39
+
40
+ 2. **Dataset Quality Benchmarking:** A quantile benchmarking approach applied to major open-source pretraining datasets (e.g., DCLM Baseline, Fineweb-Edu)
41
+ reveals their quality distributions via small-scale training runs, informing better data selection.
42
+
43
+ 3. **Multi-Phase Pre-Training:** The training progresses through 5 phases, strategically increasing the ratio of reasoning-intensive and knowledge-intensive samples
44
+ while selectively repeating high-quality data portions.
45
+
46
+ 4. **Multi-Domain Curriculum Learning:** We keep a stable data mixture across different datasets while ordering samples within each dataset by ascending quality.
47
+ This curriculum is further leveraged through [accommodated learning rate decay and model averaging](https://arxiv.org/abs/2511.18903).
48
+
49
+ 5. **Architecture for Training Stability:** Optimized for training on 910A clusters (FP16 precision, similar to V100),
50
+ the Kaiyuan-2B architecture integrates QK norm, sandwich norm, and soft-capping techniques to ensure stable and robust pre-training.
51
 
52
  ## Citation
53