---
license: apache-2.0
language:
- en
- zh
library_name: transformers
pipeline_tag: text-generation
tags:
- llm
- nanbeige
---
# 1. Introduction
Nanbeige4-3B-Thinking is a 3B-parameter reasoning model from the fourth-generation Nanbeige LLM family.
It demonstrates that continuously improving data quality and training recipes could lead to exciting reasoning capabilities, even in edge-sized models.
To support research and technological advancement in the open-source community, we have open-sourced the Nanbeige4-3B-Thinking model together with its technical methodology.
# 2. Model Summary
Pre-Training
* For training Data, after extensive data curation and crawling, we designed and employed a data filtering strategy that combines tagging-based scoring with retrieval-based recalling to filter for high-quality data.
We ultimately constructed a **23T-token** training corpus, comprising web pages, books, code, papers, and more.
Besides realistic data, we also incorporate synthetic data with high knowledge density and reasoning density, such as QAs, Textbooks, and Long-COTs, which significantly benefited the downstream task performance.
* For training recipe, we innovatively proposed the **FG-WSD** (Fine-Grained Warmup-Stable-Decay) scheduler, as an improvement upon the WSD (Warm-Stable-Decay).
In the Stable stage, we divided 19T tokens into multiple fine-grained phases, with later phases using an overall higher-quality data mix. Compared to the vanilla WSD scheduler, FG-WSD achieved promising benefits.
In the Decay stage, we use 4T tokens, in which the proportion of math, code, synthetic QA, and synthetic Long-COT is increased to enhance the model's reasoning capabilities.
| Stage | Training Tokens | Learning Rate |
|-------------------------------|-----------------|-----------------------|
| Warmup Stage | 0.1T | 0 ——> 4.5e-4 |
| Diversity-Enriched Stable Stage| 12.4T | Constant 4.5e-4 |
| High-Quality Stable Stage | 6.5T | Constant 4.5e-4 |
| Decay and Long-Context Stage | 4T | 4.5e-4 ——> 1.5e-6 |
Post-Training
* During the SFT phase, we constructed over **30 million** high-quality Long-CoT samples for multi-stage Curriculum learning.
We combine rule-based and model-based verifying approaches, not only ensuring the accuracy, but also ensuring that every training sample is more comprehensive and helpful than other candidate options.
Sufficient instruction diversity and response quality enabled the model to excel in various benchmarks.
* After SFT, we employed the Nanbeige flagship reasoning model as the teacher model to distill the Nanbeige4-3B-Thinking, and further enhanced the performance.
We observed that on-policy distillation provides greater benefits for mathematical reasoning tasks, while off-policy distillation is more effective for general tasks such as human-preference alignment.
* We then apply multi-stage on-policy reinforcement learning.
We strengthen reasoning with verifiable rewards and improve human alignment with a preference reward model.
We use both real-world and synthetic data, and we carefully filter the dataset for appropriate difficulty.
# 3. Model Performance
For model performance comparison, we incorporate recent Qwen3 series reasoning LLMs as baselines.
For each metric, we conduct inference under identical configurations to ensure fairness, and each metric is inferred at least twice to calculate the average.
| Model | AIME24 | AIME25 | GPQA | Super-GPQA | Science-QA | Writing-Bench | BFCL-V4-Agentic | Arena-hard2 |
|----------------|--------|--------|------|------------|------------|--------------|----------------|-------------|
| Qwen3-8B-Thinking-2504 | 76.0 | 67.3 | 62.0 | 39.1 | 24.8 | 74.8 | 14.4 | 26.4 |
| Qwen3-14B-Thinking-2504 | 79.3 | 70.4 | 64.0 | 46.8 | 23.2 | 77.2 | 17.0 |40.5 |
| Qwen3-4B-Thinking-2507 | 83.3 | 81.3 | 67.2 | 46.7 | 24.4 | 84.3 | 14.3 | 37.7 |
| **Nanbeige4-3B-Thinking-2510** | **87.5** | **81.7** | **77.2** | **51.4** | **26.0** | **85.5** | **17.2** | **42.9** |
The results demonstrate that our model achieves superior results on mainstream benchmarks in math, science, creative writing, tool use, and human preference alignment.
## 4. Quickstart
For the chat scenario:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
use_fast=False,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
torch_dtype='auto',
device_map='auto',
trust_remote_code=True
)
messages = [
{'role': 'user', 'content': 'Which number is bigger, 9.11 or 9.8?'}
]
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)
```
For the tool use scenario:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
use_fast=False,
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'Nanbeige/Nanbeige4-3B-Thinking-2510',
torch_dtype='auto',
device_map='auto',
trust_remote_code=True
)
messages = [
{'role': 'user', 'content': 'Help me check the weather in Beijing now'}
]
tools = [{'type': 'function',
'function': {'name': 'SearchWeather',
'description': 'Find out current weather in a certain place on a certain day.',
'parameters': {'type': 'dict',
'properties': {'location': {'type': 'string',
'description': 'A city in china.'},
'required': ['location']}}}}]
prompt = tokenizer.apply_chat_template(
messages,
tools,
add_generation_prompt=True,
tokenize=False
)
input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids
output_ids = model.generate(input_ids.to('cuda'), eos_token_id=166101)
resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True)
print(resp)
```
# 4. Limitations
While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.
# 5. Citation
If you find our model useful or want to use it in your projects, please kindly cite this Huggingface project.
# 6. Contact
If you have any questions, please raise an issue or contact us at nanbeige@126.com.