title: Reward Policy Intuition
emoji: ππ
colorFrom: purple
colorTo: red
sdk: docker
pinned: true
license: mit
arxiv: 2601.05242
short_description: 'GRPO vs GDPO: Understanding Multi-Reward Policy Optimization'
GRPO vs GDPO: Why Normalization Order Matters
Interactive visualization demonstrating advantage collapse in multi-reward reinforcement learning, and how GDPO fixes it.
Based on NVIDIA's GDPO paper (arXiv:2601.05242).
The Problem
When training LLMs with multiple reward signals (correctness, format, style), GRPO normalizes the combined reward. This causes advantage collapseβsmal ler-scale rewards get washed out by larger-scale ones.
| Method | Normalization | Result |
|---|---|---|
| GRPO | Aggregate β Normalize | Small-scale signals lost |
| GDPO | Normalize β Aggregate | All signals preserved |
The Solution
GDPO normalizes each reward dimension independently (to mean=0, std=1) before combining them. This ensures every reward contributes proportionally to its weight, regardless of original scale.
Binary Rewards Widget
Based on the Berkeley Function Calling Leaderboard (BFCL) dataset. Toggle binary rewards for 12 rollouts:
- Correctness: Does the function call execute?
- Style: Are arguments formatted correctly?
- Conciseness: Free of redundant parameters?
See how GRPO assigns identical advantages to [1,0,1] and [0,1,1] (same total), while GDPO differentiates them.
Training Convergence
Train a toy Bernoulli policy on 3 binary rewards:
- GDPO: All dimensions converge to pβ1 independently
- GRPO: All dimensions collapse to the same trajectory
Key Visualizations
Advantage Bar Chart
Side-by-side comparison of GRPO vs GDPO advantages, sorted by GDPO rank. Detects and highlights advantage collapse when multiple books receive identical GRPO advantages.
Policy Convergence Plot
Shows probability trajectories over 150 training epochs. GDPO learns each reward dimension independently; GRPO can't distinguish which rewards matter.
When to Use Each
| Use GDPO | Use GRPO |
|---|---|
| Multiple reward scales | Single reward |
| Binary + continuous rewards | Similar scales |
| All signals matter equally | One dominant reward |
Implementation
It's a one-line change:
- TRL:
apply_gdpo: True - VERL:
adv_estimator: 'gdpo'
References
- GDPO Paper: NVIDIA, arXiv:2601.05242
- Code: github.com/NVlabs/GDPO
- Dataset: Berkeley Function Calling Leaderboard
Check out marimo at https://github.com/marimo-team/marimo