suhara commited on
Commit
eab7377
·
verified ·
1 Parent(s): fdb60b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -191,7 +191,7 @@ Stage 2: Supervised Fine-Tuning
191
 
192
  Stage 3: Reinforcement Learning
193
 
194
- * The model underwent multi-environment reinforcement learning using synchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. Conversational quality was further refined through RLHF using a [generative reward model](https://huggingface.co/nvidia/Qwen3-Nemotron-235B-A22B-GenRM). All datasets are disclosed in the *Training, Testing, and Evaluation Datasets* section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA/NeMo-Gym).
195
  * Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
196
 
197
  NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is a result of the above work.
 
191
 
192
  Stage 3: Reinforcement Learning
193
 
194
+ * The model underwent multi-environment reinforcement learning using synchronous GRPO (Group Relative Policy Optimization) across math, code, science, instruction following, multi-step tool use, multi-turn conversations, and structured output environments. Conversational quality was further refined through RLHF using a [generative reward model](https://huggingface.co/nvidia/Qwen3-Nemotron-235B-A22B-GenRM). All datasets are disclosed in the *Training, Testing, and Evaluation Datasets* section of this document. The RL environments and datasets are released as part of [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym).
195
  * Software used for reinforcement learning: [NeMo RL](https://github.com/NVIDIA-NeMo/RL), [NeMo Gym](https://github.com/NVIDIA-NeMo/Gym)
196
 
197
  NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model is a result of the above work.