Update README.md
Browse files
README.md
CHANGED
|
@@ -191,7 +191,7 @@ Stage 2: Supervised Fine-Tuning
|
|
| 191 |
|
| 192 |
Stage 3: Reinforcement Learning
|
| 193 |
|
| 194 |
-
* The model underwent multi-environment reinforcement learning using synchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. Conversational quality was further refined through RLHF using a [generative reward model](https://huggingface.co/nvidia/Qwen3-Nemotron-235B-A22B-GenRM). All datasets are disclosed in the *Training, Testing, and Evaluation Datasets* section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA/
|
| 195 |
* Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
|
| 196 |
|
| 197 |
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is a result of the above work.
|
|
|
|
| 191 |
|
| 192 |
Stage 3: Reinforcement Learning
|
| 193 |
|
| 194 |
+
* The model underwent multi-environment reinforcement learning using synchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. Conversational quality was further refined through RLHF using a [generative reward model](https://huggingface.co/nvidia/Qwen3-Nemotron-235B-A22B-GenRM). All datasets are disclosed in the *Training, Testing, and Evaluation Datasets* section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym).
|
| 195 |
* Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
|
| 196 |
|
| 197 |
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is a result of the above work.
|