Qwen3-32B Bradley-Terry Reward Model

A Bradley-Terry reward model fine-tuned from Qwen/Qwen3-32B for scoring question quality about research papers.

Training Details

  • Base model: Qwen/Qwen3-32B
  • Dataset: Vidushee/BT_Preference_Dataset (28,049 train, 3,090 test pairs)
  • Training: 1 epoch, 400/877 steps, batch size 8 (1 per device x 8 gradient accumulation x 4 GPUs)
  • Hardware: 4x NVIDIA H100 80GB GPUs (single node)
  • Framework: HuggingFace Trainer + DeepSpeed ZeRO-3 + CPU optimizer offload
  • Learning rate: 1e-6 with cosine schedule and 3% warmup
  • Max sequence length: 12,288 tokens
  • Eval accuracy: 90.9% pairwise accuracy at step 400
  • Training approach: Adapted from RLHFlow/RLHF-Reward-Modeling

Usage

import re
import torch
from transformers import AutoTokenizer, pipeline

model_path = "Vidushee/Qwen3-32B-BT-RewardModel"
tokenizer = AutoTokenizer.from_pretrained(model_path)

rm_pipe = pipeline(
    "sentiment-analysis",
    model=model_path,
    device=0,
    tokenizer=tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16, "attn_implementation": "flash_attention_2"},
    truncation=True,
    max_length=12288,
)

pipe_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 1,
}

# Format your conversation
chat = [
    {"role": "user", "content": "Your paper context here"},
    {"role": "assistant", "content": "Question to score"},
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=False, enable_thinking=False
)
# Strip empty think blocks that Qwen3 inserts even with enable_thinking=False
text = re.sub(r"<think>\s*</think>\s*", "", text)
# Strip trailing newline so reward pools from <|im_end|>
text = text.rstrip("\n")

outputs = rm_pipe([text], **pipe_kwargs)
reward = outputs[0][0]["score"]
print(f"Reward: {reward}")

Comparing Two Responses

# Score chosen vs rejected responses
chosen_chat = [
    {"role": "user", "content": "Paper context..."},
    {"role": "assistant", "content": "Good question about the paper"},
]
rejected_chat = [
    {"role": "user", "content": "Paper context..."},
    {"role": "assistant", "content": "Bad question about the paper"},
]

def format_text(messages):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False, enable_thinking=False
    )
    text = re.sub(r"<think>\s*</think>\s*", "", text)
    return text.rstrip("\n")

outputs = rm_pipe([format_text(chosen_chat), format_text(rejected_chat)], **pipe_kwargs)
chosen_reward = outputs[0][0]["score"]
rejected_reward = outputs[1][0]["score"]

print(f"Chosen reward: {chosen_reward:.4f}")
print(f"Rejected reward: {rejected_reward:.4f}")
print(f"Chosen is better: {chosen_reward > rejected_reward}")
Downloads last month
107
Safetensors
Model size
682k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vidushee/Qwen3-32B-BT-RewardModel

Base model

Qwen/Qwen3-32B
Finetuned
(185)
this model

Dataset used to train Vidushee/Qwen3-32B-BT-RewardModel