thu-pacman
/

PCMind-2.1-Kaiyuan-2B

Text Generation

text-generation-inference

Model card Files Files and versions

openhonor commited on 13 days ago

Commit

f34886c

·

verified ·

1 Parent(s): bed3881

Update README.md (#1)

- Update README.md (555f2182ed4f571b5757b816f1dbca1144af197d)

Files changed (1) hide show

README.md +24 -2

README.md CHANGED Viewed

@@ -20,12 +20,34 @@ and can be easily loaded by libraries like `transformers`.
 Please use [`demo.py`](demo.py) as an example of use.
-_Note: This is a pretrained base model only and has not undergone fine-tuning,
 reinforcement learning (RL), or any other post-training procedures.
 It is not ready for direct conversation.
 Users are recommended to employ few-shot prompting to guide model outputs,
-or to fine-tune the model for specific downstream applications.
 ## Citation

 Please use [`demo.py`](demo.py) as an example of use.
+*Note: This is a pretrained base model only and has not undergone fine-tuning,
 reinforcement learning (RL), or any other post-training procedures.
 It is not ready for direct conversation.
 Users are recommended to employ few-shot prompting to guide model outputs,
+or to fine-tune the model for specific downstream applications.*
+## Features
+![](model_performance_comparison.png)
+Our data preprocessing and pre-training pipeline is designed for enhanced training efficiency and model quality,
+achieved through several key innovations:
+1.  **High-Performance Data Preprocessing:** We built an open-source,
+    Spark-based framework optimized with [Chukonu](https://pacman.cs.tsinghua.edu.cn/~cwg/publication/chukonu-2021/),
+    delivering exceptional efficiency for large-scale deduplication and sorting tasks.
+2.  **Dataset Quality Benchmarking:** A quantile benchmarking approach applied to major open-source pretraining datasets (e.g., DCLM Baseline, Fineweb-Edu)
+    reveals their quality distributions via small-scale training runs, informing better data selection.
+3.  **Multi-Phase Pre-Training:** The training progresses through 5 phases, strategically increasing the ratio of reasoning-intensive and knowledge-intensive samples
+    while selectively repeating high-quality data portions.
+4.  **Multi-Domain Curriculum Learning:** We keep a stable data mixture across different datasets while ordering samples within each dataset by ascending quality.
+    This curriculum is further leveraged through [accommodated learning rate decay and model averaging](https://arxiv.org/abs/2511.18903).
+5.  **Architecture for Training Stability:** Optimized for training on 910A clusters (FP16 precision, similar to V100),
+    the Kaiyuan-2B architecture integrates QK norm, sandwich norm, and soft-capping techniques to ensure stable and robust pre-training.
 ## Citation