Abstract
SAGE is an on-policy reinforcement learning framework that enhances GRPO by injecting self-hints during training to increase outcome diversity under sparse rewards, improving alignment of large language models.
Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.
Community
RL for LLMs often stalls under sparse rewards — especially with GRPO, where whole rollout groups get identical 0 rewards and learning just… dies.
💡 SAGE fixes this with a simple but powerful idea:
👉 Let the model give itself hints during training.
How it works:
- The model samples a compact hint (plan / decomposition) before solving
- Rewards stay unchanged (same verifier, same objective)
- Hints only reshape sampling, preventing advantage collapse
- At test time? No hints at all. Clean deployment.
🔥 Why it matters:
- Turns dead-end prompts into useful learning signals
- Acts as an adaptive curriculum driven by the model itself
- Stays fully on-policy (no external teachers required)
📊 Results across 6 benchmarks & 3 LLMs over GRPO:
- +2.0 on Llama-3.2-3B
- +1.2 on Qwen2.5-7B
- +1.3 on Qwen3-4B
Sometimes the best teacher is… yourself 😌
Code: https://github.com/BaohaoLiao/SAGE
Slide by NotebookLM:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT (2026)
- Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing (2026)
- Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification (2026)
- RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents (2026)
- KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning (2026)
- Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models (2026)
- ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper