Nanbeige
/

Nanbeige4-3B-Thinking-2510

Text Generation

text-generation-inference

Model card Files Files and versions

flust commited on Oct 23

Commit

00ce2e9

·

verified ·

1 Parent(s): 5171cfe

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -37,8 +37,8 @@ Pre-Training<br>
   We ultimately constructed a **23T-token** training corpus, comprising web pages, books, code, papers, and more.
   Besides realistic data, we also incorporate synthetic data with high knowledge density and reasoning density, such as QAs, Textbooks, and Long-COTs, which significantly benefited the downstream task performance.   -->
-  * We designed an innovative **FG-WSD** (Fine-Grained Warmup-Stable-Decay) training scheduler, meticulously refining the conventional WSD approach.
-This scheduler was implemented with a fine-grained, quality-progressive data curriculum, dividing the Stable stage into multiple phases with progressively improved data mixtures. Compared to the vanilla WSD, our method achieved notable performance gains. During the Decay stage, we increased the proportion of math, code, synthetic QA, and synthetic Long-COT data to further enhance reasoning capabilities.
   <!-- * For training recipe, we innovatively proposed the **FG-WSD** (Fine-Grained Warmup-Stable-Decay) scheduler, as an improvement upon the WSD (Warm-Stable-Decay).
   In the Stable stage, we divided 19T tokens into multiple fine-grained phases, with later phases using an overall higher-quality data mix. Compared to the vanilla WSD scheduler, FG-WSD achieved promising benefits.
   In the Decay stage, we use 4T tokens, in which the proportion of math, code, synthetic QA, and synthetic Long-COT is increased to enhance the model's reasoning capabilities. -->

   We ultimately constructed a **23T-token** training corpus, comprising web pages, books, code, papers, and more.
   Besides realistic data, we also incorporate synthetic data with high knowledge density and reasoning density, such as QAs, Textbooks, and Long-COTs, which significantly benefited the downstream task performance.   -->
+  * We designed an innovative **FG-WSD (Fine-Grained Warmup-Stable-Decay)** training scheduler, meticulously refining the conventional WSD approach.
+This scheduler was implemented with a **fine-grained, quality-progressive data curriculum**, dividing the Stable stage into multiple phases with progressively improved data mixtures. Compared to the vanilla WSD, our method achieved notable performance gains. During the Decay stage, we increased the proportion of math, code, synthetic QA, and synthetic Long-COT data to further enhance reasoning capabilities.
   <!-- * For training recipe, we innovatively proposed the **FG-WSD** (Fine-Grained Warmup-Stable-Decay) scheduler, as an improvement upon the WSD (Warm-Stable-Decay).
   In the Stable stage, we divided 19T tokens into multiple fine-grained phases, with later phases using an overall higher-quality data mix. Compared to the vanilla WSD scheduler, FG-WSD achieved promising benefits.
   In the Decay stage, we use 4T tokens, in which the proportion of math, code, synthetic QA, and synthetic Long-COT is increased to enhance the model's reasoning capabilities. -->