Spaces:

dylan-marimo-io
/

Reward-Policy-Intuition

Sleeping

App Files Files Community

Reward-Policy-Intuition / README.md

dylan-marimo-io

Update README.md

b01d078 verified about 1 month ago

preview code

raw

history blame contribute delete

4.07 kB

metadata

title: Reward Policy Intuition
emoji: 🍃📊
colorFrom: purple
colorTo: red
sdk: docker
pinned: true
license: mit
arxiv: 2601.05242
short_description: 'GRPO vs GDPO: Understanding Multi-Reward Policy Optimization'

GRPO vs GDPO: Why Normalization Order Matters

Interactive visualization demonstrating advantage collapse in multi-reward reinforcement learning, and how GDPO fixes it.

Based on NVIDIA's GDPO paper (arXiv:2601.05242).

The Problem

When training LLMs with multiple reward signals (correctness, format, style), GRPO normalizes the combined reward. This causes advantage collapse—smal ler-scale rewards get washed out by larger-scale ones.

Method	Normalization	Result
GRPO	Aggregate → Normalize	Small-scale signals lost
GDPO	Normalize → Aggregate	All signals preserved

The Solution

GDPO normalizes each reward dimension independently (to mean=0, std=1) before combining them. This ensures every reward contributes proportionally to its weight, regardless of original scale.

$\text{GRPO: } A_j = \frac{\sum_i r_j^{(i)} - \mu}{\sigma} \quad \text{vs} \quad \text{GDPO: } A_j = \sum_i \frac{r_j^{(i)} - \mu^{(i)}}{\sigma^{(i)}}$

Binary Rewards Widget

Based on the Berkeley Function Calling Leaderboard (BFCL) dataset. Toggle binary rewards for 12 rollouts:

Correctness: Does the function call execute?
Style: Are arguments formatted correctly?
Conciseness: Free of redundant parameters?

See how GRPO assigns identical advantages to [1,0,1] and [0,1,1] (same total), while GDPO differentiates them.

Training Convergence

Train a toy Bernoulli policy on 3 binary rewards:

GDPO: All dimensions converge to p≈1 independently
GRPO: All dimensions collapse to the same trajectory

Key Visualizations

Advantage Bar Chart

Side-by-side comparison of GRPO vs GDPO advantages, sorted by GDPO rank. Detects and highlights advantage collapse when multiple books receive identical GRPO advantages.

Policy Convergence Plot

Shows probability trajectories over 150 training epochs. GDPO learns each reward dimension independently; GRPO can't distinguish which rewards matter.

When to Use Each

Use GDPO	Use GRPO
Multiple reward scales	Single reward
Binary + continuous rewards	Similar scales
All signals matter equally	One dominant reward

Implementation

It's a one-line change:

TRL: apply_gdpo: True
VERL: adv_estimator: 'gdpo'

References

GDPO Paper: NVIDIA, arXiv:2601.05242
Code: github.com/NVlabs/GDPO
Dataset: Berkeley Function Calling Leaderboard

Check out marimo at https://github.com/marimo-team/marimo