1. Introduction

Nanbeige4-3B-Base is a 3B-parameter base model within the fourth-generation Nanbeige LLM family. It showcases that even a compact model can achieve advanced performances through continuous enhancements in data quality and training methodologies. When performing supervised fine-tuning (SFT) on the same training data, our model significantly outperforms open-source models of the same size, and even surpasses larger models such as Qwen3-8B.

Technical Report: https://arxiv.org/pdf/2512.06266

2. Model Summary

Training Data

We constructed a comprehensive 23T-tokens training corpus from web texts, books, code, and papers, meticulously filtered through a hybrid strategy of tagging-based scoring and retrieval-based recalling. This foundation was then augmented with knowledge-dense and reasoning-intensive synthetic data, including Q&A pairs, textbooks, and Long-COTs, which significantly benefited the downstream task performance.

Training Recipe

We designed an innovative FG-WSD (Fine-Grained Warmup-Stable-Decay) training scheduler, meticulously refining the conventional WSD approach. This scheduler was implemented with a fine-grained, quality-progressive data curriculum, dividing the Stable stage into multiple phases with progressively improved data mixtures. Compared to the vanilla WSD, our method achieved notable performance gains. During the Decay stage, we increased the proportion of math, code, synthetic QA, and synthetic Long-COT data to further enhance reasoning capabilities.

Stage	Training Tokens	Learning Rate
Warmup Stage	0.1T	0 ——> 4.5e-4
Diversity-Enriched Stable Stage	12.4T	Constant 4.5e-4
High-Quality Stable Stage	6.5T	Constant 4.5e-4
Decay and Long-Context Stage	4T	4.5e-4 ——> 1.5e-6

3. Model Performance

For model performance comparison, we fine-tuned both our base model and the Qwen series base models using the same fine-tuning data and evaluated their downstream task metrics. We believe that when evaluating base models, this end-to-end validation approach better reflects the model's ultimate performance in downstream tasks compared to the few-shot testing approach.

To ensure a fair comparison, we conducted experiments with three distinct datasets, including Nemotron-Dataset-v1, Ring-lite-sft-data, and OpenThoughts3. For each dataset, we randomly select 500k training samples for SFT experiments.

Finetuned with Nemotron-Dataset-v1

Model AIME2024 AIME2025 Math-500 GPQA

Qwen3-4B-Base 24.6 25.0 90.4 44.6

Qwen3-8B-Base 37.9 29.6 91.1 48.9

Nanbeige4-3B-Base 52.9 40.8 93.4 53.4
Finetuned with Ring-lite-sft-data

Model AIME2024 AIME2025 Math-500 GPQA

Qwen3-4B-Base 40.4 31.3 93.6 51.4

Qwen3-8B-Base 50.0 35.8 94.4 55.1

Nanbeige4-3B-Base 56.8 45.3 95.5 57.7
Finetuned with OpenThoughts3

Model AIME2024 AIME2025 Math-500 GPQA

Qwen3-4B-Base 52.9 42.1 93.2 49.6

Qwen3-8B-Base 60.4 47.1 95.0 55.3

Nanbeige4-3B-Base 62.4 49.2 94.6 56.9

Model	AIME2024	AIME2025	Math-500	GPQA
Qwen3-4B-Base	24.6	25.0	90.4	44.6
Qwen3-8B-Base	37.9	29.6	91.1	48.9
Nanbeige4-3B-Base	52.9	40.8	93.4	53.4

Model	AIME2024	AIME2025	Math-500	GPQA
Qwen3-4B-Base	40.4	31.3	93.6	51.4
Qwen3-8B-Base	50.0	35.8	94.4	55.1
Nanbeige4-3B-Base	56.8	45.3	95.5	57.7

Model	AIME2024	AIME2025	Math-500	GPQA
Qwen3-4B-Base	52.9	42.1	93.2	49.6
Qwen3-8B-Base	60.4	47.1	95.0	55.3
Nanbeige4-3B-Base	62.4	49.2	94.6	56.9

The results demonstrate that Nanbeige4-3B-Base significantly outperforms Qwen3-4B-Base, and even surpasses the larger Qwen3-8B-Base, highlighting the greater potential of our base model after fine-tuning. This advantage stems from the optimized training recipe during our Stable stage and the extensive high-quality synthetic data incorporated during the Decay stage.

4. Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
  'Nanbeige/Nanbeige4-3B-Base',
  use_fast=False,
  trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
  'Nanbeige/Nanbeige4-3B-Base',
  torch_dtype='auto',
  device_map='auto',
  trust_remote_code=True
)
prompt = "中国的首都是"
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'))
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)

5. Limitations

While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.

6. Citation

If you find our model useful or want to use it in your projects, please cite as follows:

@misc{yang2025nanbeige43btechnicalreportexploring,
      title={Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models}, 
      author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Wei Ruan and Xiaoqi Liu and Xiaoxue Cheng and Xiyun Xu and Yang Song and Yanzipeng Gao and Yiming Jia and Yun Xing and Yuntao Wen and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen},
      year={2025},
      eprint={2512.06266},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.06266}, 
}

7. Contact

If you have any questions, please raise an issue or contact us at nanbeige@126.com.

Downloads last month: 949

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Nanbeige/Nanbeige4-3B-Base

Finetunes

10 models

Quantizations

8 models

Paper for Nanbeige/Nanbeige4-3B-Base

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

Paper • 2512.06266 • Published Dec 6, 2025 • 7