File size: 6,861 Bytes
2113194 7e73c69 2113194 116ce2d 2113194 03b7d0b 2113194 03b7d0b 2113194 7e73c69 ceb902a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
datasets:
- nvidia/HelpSteer3
- Skywork/Skywork-Reward-Preference-80K-v0.2
- Vezora/Code-Preference-Pairs
- xinlai/Math-Step-DPO-10K
language:
- en
base_model:
- Qwen/Qwen3-14B
library_name: transformers
tags:
- reward_model
- nvidia
- qwen3
license: other
license_name: nvidia-internal-scientific-research-and-development-model-license
license_link: >-
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-internal-scientific-research-and-development-model-license/
---
# BR-RM: Branch-and-Rethink Reasoning Reward Model
## Model Overview
**BR-RM (Branch-and-Rethink Reasoning Reward Model)** is a model that implements a novel two-turn reasoning framework to evaluate LLM-generated responses. Unlike traditional reward models that compress all quality dimensions into a single scalar in one shot, BR-RM performs **adaptive branching** to focus on instance-critical dimensions, followed by **branch-conditioned rethinking** for targeted deep analysis.
This model achieves **state-of-the-art performance** on the average score on three major reward modeling benchmarks (RewardBench, RM-Bench, and RMB) by addressing the "judgment diffusion" problem where models spread attention too thinly across evaluation criteria.
### Key Features
- π― **Adaptive Focus**: Dynamically selects 1-3 critical evaluation dimensions per instance
- π **Two-Turn Reasoning**: First turn branches, second turn performs deep conditioned analysis
- π **SOTA Performance**: Top results on RewardBench (92.1%), RM-Bench (85.9%), and RMB (74.7%)
- π§ **RLHF Compatible**: Designed to integrate seamlessly with standard RLHF pipelines
### Model Variants
| Model | Parameters | RewardBench | RM-Bench | RMB | Average |
|-------|-----------|-------------|----------|-----|---------|
| **Qwen3-Nemotron-8B-BRRM** | 8B | 91.0 | 85.0 | 71.8 | 82.6 |
| **Qwen3-Nemotron-14B-BRRM** | 14B | 92.1 | 85.9 | 74.7 | 84.2 |
## How It Works
### Two-Turn Framework
**Turn 1: Adaptive Branching**
```
Input: User query + Two candidate responses
Output:
1. Selected critical dimensions (e.g., "Logical Reasoning", "Computational Precision")
2. Initial issue detection for each response
```
**Turn 2: Branch-Conditioned Rethinking**
```
Input: Turn 1 results + Evaluation hierarchy
Output: Final comparative judgment and preference ranking
```
## Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "nvidia/Qwen3-Nemotron-14B-BRRM" # or nvidia/Qwen3-Nemotron-8B-BRRM
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example usage
context = "What is 2+2?"
response1 = "2+2=4"
response2 = "2+2=5"
# Format Turn 1: Adaptive Branching
turn1_prompt = f"""You are a response quality evaluator. Given the context and two responses, select the most important cognitive abilities and analyze critical issues.
**Context:**
{context}
**Responses:**
[The Begin of Response 1]
{response1}
[The End of Response 1]
[The Begin of Response 2]
{response2}
[The End of Response 2]
**Output Format:**
[Quality Assessment Focus]
Choose 1-3 abilities: Information Accuracy, Computational Precision, Logical Reasoning, Implementation Capability, Safety Awareness, Response Completeness, Instruction Adherence, Communication Clarity.
[End of Quality Assessment Focus]
[Quality Analysis for Response 1]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 1]
[Quality Analysis for Response 2]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 2]"""
# Generate Turn 1
messages = [{"role": "user", "content": turn1_prompt}]
input_ids = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=8192,
temperature=1.0,
top_p=0.95,
top_k=20,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
turn1_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)
# Format Turn 2: Branch-Conditioned Rethinking
turn2_prompt = f"""You are making final comparative judgments using established evaluation priorities.
**Evaluation Hierarchies:**
- **Accuracy-Critical**: Correctness > Process > Presentation
- **Creative/Open-Ended**: User Intent > Content Quality > Creativity
- **Instruction-Following**: Adherence > Content > Clarity
[The Begin of Analysis on Response 1]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 1]
[The Begin of Analysis on Response 2]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 2]
[The Begin of Ranking Score]
\\boxed{{1 or 2}}
[The End of Ranking Score]"""
# Generate Turn 2
messages.append({"role": "assistant", "content": turn1_response})
messages.append({"role": "user", "content": turn2_prompt})
input_ids = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=8192,
temperature=1.0,
top_p=0.95,
top_k=20,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
final_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)
```
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety and Security](safety.md), and [Privacy](privacy.md) Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
If you find this model useful, please cite the following work:
```bibtex
@misc{jiao2025thinktwicebranchandrethinkreasoning,
title={Think Twice: Branch-and-Rethink Reasoning Reward Model},
author={Yizhu Jiao and Jiaqi Zeng and Julien Veron Vialard and Oleksii Kuchaiev and Jiawei Han and Olivier Delalleau},
year={2025},
eprint={2510.23596},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.23596},
}
``` |