Update README.md
Browse files
README.md
CHANGED
|
@@ -86,7 +86,7 @@ September 2025 \- December 2025
|
|
| 86 |
|
| 87 |
Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
|
| 88 |
|
| 89 |
-
The model employs a hybrid Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 and MoE layers, along with 6 Attention layers. Each MoE layer includes 128 experts plus 1 shared expert, with
|
| 90 |
|
| 91 |
The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen.
|
| 92 |
|
|
@@ -174,7 +174,7 @@ December 15, 2025 via [Hugging Face](https://huggingface.co/nvidia/NVIDIA-Nemotr
|
|
| 174 |
|
| 175 |
## Model Design
|
| 176 |
|
| 177 |
-
The model was trained with 25T tokens, with a batch size of 3072, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 1e-3 and minimum learning rate of 1e-5. There are a total of 52 layers, of which there are 23 of each MoE and Mamba-2 and the remaining 6 layers use grouped query attention (GQA) with 2 groups. Each MoE layer
|
| 178 |
|
| 179 |
## Training Methodology
|
| 180 |
|
|
|
|
| 86 |
|
| 87 |
Nemotron-3-Nano-30B-A3B-BF16 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be configured through a flag in the chat template. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
|
| 88 |
|
| 89 |
+
The model employs a hybrid Mixture-of-Experts (MoE) architecture, consisting of 23 Mamba-2 and MoE layers, along with 6 Attention layers. Each MoE layer includes 128 experts plus 1 shared expert, with 6 experts activated per token. The model has 3.5B active parameters and 30B parameters in total.
|
| 90 |
|
| 91 |
The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen.
|
| 92 |
|
|
|
|
| 174 |
|
| 175 |
## Model Design
|
| 176 |
|
| 177 |
+
The model was trained with 25T tokens, with a batch size of 3072, and used the Warmup-Stable-Decay (WSD) learning rate schedule with 8B tokens of learning rate warm up, peak learning rate of 1e-3 and minimum learning rate of 1e-5. There are a total of 52 layers, of which there are 23 of each MoE and Mamba-2 and the remaining 6 layers use grouped query attention (GQA) with 2 groups. Each MoE layer includes 128 routed experts plus 1 shared expert, with 6 experts activated per token.
|
| 178 |
|
| 179 |
## Training Methodology
|
| 180 |
|