dylan-marimo-io's picture
Update README.md
b01d078 verified
metadata
title: Reward Policy Intuition
emoji: πŸƒπŸ“Š
colorFrom: purple
colorTo: red
sdk: docker
pinned: true
license: mit
arxiv: 2601.05242
short_description: 'GRPO vs GDPO: Understanding Multi-Reward Policy Optimization'

GRPO vs GDPO: Why Normalization Order Matters

Interactive visualization demonstrating advantage collapse in multi-reward reinforcement learning, and how GDPO fixes it.

Based on NVIDIA's GDPO paper (arXiv:2601.05242).

The Problem

When training LLMs with multiple reward signals (correctness, format, style), GRPO normalizes the combined reward. This causes advantage collapseβ€”smal ler-scale rewards get washed out by larger-scale ones.

Method Normalization Result
GRPO Aggregate β†’ Normalize Small-scale signals lost
GDPO Normalize β†’ Aggregate All signals preserved

The Solution

GDPO normalizes each reward dimension independently (to mean=0, std=1) before combining them. This ensures every reward contributes proportionally to its weight, regardless of original scale.

GRPO: Aj=βˆ‘irj(i)βˆ’ΞΌΟƒvsGDPO: Aj=βˆ‘irj(i)βˆ’ΞΌ(i)Οƒ(i)\text{GRPO: } A_j = \frac{\sum_i r_j^{(i)} - \mu}{\sigma} \quad \text{vs} \quad \text{GDPO: } A_j = \sum_i \frac{r_j^{(i)} - \mu^{(i)}}{\sigma^{(i)}}

Binary Rewards Widget

Based on the Berkeley Function Calling Leaderboard (BFCL) dataset. Toggle binary rewards for 12 rollouts:

  • Correctness: Does the function call execute?
  • Style: Are arguments formatted correctly?
  • Conciseness: Free of redundant parameters?

See how GRPO assigns identical advantages to [1,0,1] and [0,1,1] (same total), while GDPO differentiates them.

Training Convergence

Train a toy Bernoulli policy on 3 binary rewards:

  • GDPO: All dimensions converge to pβ‰ˆ1 independently
  • GRPO: All dimensions collapse to the same trajectory

Key Visualizations

Advantage Bar Chart

Side-by-side comparison of GRPO vs GDPO advantages, sorted by GDPO rank. Detects and highlights advantage collapse when multiple books receive identical GRPO advantages.

Policy Convergence Plot

Shows probability trajectories over 150 training epochs. GDPO learns each reward dimension independently; GRPO can't distinguish which rewards matter.

When to Use Each

Use GDPO Use GRPO
Multiple reward scales Single reward
Binary + continuous rewards Similar scales
All signals matter equally One dominant reward

Implementation

It's a one-line change:

  • TRL: apply_gdpo: True
  • VERL: adv_estimator: 'gdpo'

References


Check out marimo at https://github.com/marimo-team/marimo