Qwen3-8B JEE SDPO

A fine-tuned version of Qwen3-8B specialized for solving IIT JEE Advanced problems in Physics, Chemistry, and Mathematics with detailed chain-of-thought reasoning.

This model was trained in two stages:

  1. SFT — Supervised fine-tuning on JEE + competition math problems
  2. SDPO — Self-Distillation Preference Optimization using DPO with LLM-as-judge feedback

Model Details

Property Value
Base Model Qwen/Qwen3-8B
SFT Base vipsehgal/qwen3-8b-jee-sft
Format Full-precision safetensors (bfloat16)
Size ~15 GB
Architecture Qwen3ForCausalLM, 36 layers, 32 heads, 4096 hidden

Training Pipeline

Stage 1: SFT (Supervised Fine-Tuning)

  • Method: QLoRA (4-bit) on Apple M3 Pro using MLX
  • Data: 3,515 examples (3,163 train / 352 validation)
    • JEEBench CoT: 457 JEE Advanced questions with Claude Opus 4.6-generated step-by-step solutions (Physics, Chemistry, Mathematics)
    • NuminaMath-CoT: 2,706 filtered competition math problems (AMC, AIME, Olympiad-level)
  • Hyperparameters: lr=1e-5, 1000 iterations, batch=1, grad_accum=4, 8 LoRA layers, max_seq=512
  • Result: Validation loss 1.478 → 0.582

Stage 2: SDPO (Self-Distillation Preference Optimization)

  • Method: DPO with LoRA on A100 GPU (Google Colab Pro+) using TRL
  • Process:
    1. Generated rollouts — the SFT model attempted JEE questions with temperature sampling
    2. Judged rollouts — rule-based answer extraction + correctness checking (numerical tolerance ±1%)
    3. Built DPO preference pairs:
      • Chosen: Correct rollout, LLM judge feedback, or gold training solution
      • Rejected: Incorrect rollout from the model
    4. Trained with DPO (beta=0.1) to prefer correct reasoning over incorrect attempts
  • Hyperparameters: lr=5e-6, 2 epochs, batch=2, grad_accum=4, LoRA r=16/alpha=32, max_length=1024, cosine scheduler
  • Optimizer: Paged AdamW 8-bit with gradient checkpointing

Evaluation Results

Evaluated on 200 held-out questions from JEEBench covering Physics, Chemistry, and Mathematics. All models used greedy decoding with max 2,048 tokens.

Subject Base Qwen3-8B SFT SDPO (this model) Delta vs Base
Overall 78/200 (39.0%) 90/200 (45.0%) 69/200 (34.5%) -4.5%
Mathematics 24/66 (36.4%) 36/66 (54.5%) 17/66 (25.8%) -10.6%
Chemistry 32/70 (45.7%) 35/70 (50.0%) 34/70 (48.6%) +2.9%
Physics 22/64 (34.4%) 19/64 (29.7%) 18/64 (28.1%) -6.3%

Key takeaways:

  • SDPO regressed overall compared to both the base model and the SFT model
  • Mathematics took the largest hit (-10.6% vs base, -28.7% vs SFT), likely because the DPO preference data (500 prompts, 2 rollouts each) had insufficient coverage of competition math
  • Chemistry held steady and slightly improved over the base (+2.9%)
  • Physics regressed modestly (-6.3% vs base)
  • These results suggest the DPO training needs more diverse preference data and possibly more rollouts per prompt to be effective

Intended Use

This model is designed for:

  • Solving IIT JEE Advanced level problems in Physics, Chemistry, and Mathematics
  • Generating step-by-step solutions with LaTeX notation
  • Educational tutoring for competitive exam preparation

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo")

messages = [
    {"role": "system", "content": "You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer."},
    {"role": "user", "content": "A particle of mass 2 kg is projected vertically upward with velocity 20 m/s. Find the maximum height reached. (Take g = 10 m/s²)"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=True))

With MLX on Apple Silicon (recommended for Mac users)

Use the quantized version for fast local inference:

pip install mlx-lm

mlx_lm.generate \
    --model vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit \
    --prompt "Solve: Find the number of real solutions of x^3 - 3x + 1 = 0"

See vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit for the 4-bit MLX version (4.3 GB, ~30 tokens/sec on M3 Pro).

System Prompt

You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer.

Architecture

Parameter Value
Model type Qwen3ForCausalLM
Hidden size 4,096
Layers 36
Attention heads 32 (8 KV heads, GQA)
FFN intermediate 12,288
Activation SiLU
Vocab size 151,936
Max context 40,960 tokens
RoPE theta 1,000,000

Related Models

Model Size Description
vipsehgal/qwen3-8b-jee-sft 16.4 GB SFT-only model (bf16)
vipsehgal/qwen3-8b-jee-sdpo 15 GB SFT + SDPO model (bf16) — this model
vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit 4.3 GB Quantized MLX version for Mac inference

Limitations

  • Training data is skewed toward mathematics (83%) vs physics/chemistry (17%)
  • DPO training used a subset of 500 prompts with 2 rollouts each — more data could improve results
  • SDPO regressed from SFT, particularly in Mathematics — the preference data needs more coverage
  • May produce incorrect reasoning steps while arriving at correct final answers — always verify solutions
  • Evaluated on 200 questions; full 515-question JEEBench benchmark not yet run

License

Apache 2.0 (following the base Qwen3-8B license)

Downloads last month
27
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vipsehgal/qwen3-8b-jee-sdpo

Finetuned
(1)
this model
Quantizations
1 model

Datasets used to train vipsehgal/qwen3-8b-jee-sdpo

Evaluation results

  • Overall Accuracy on JEEBench (200-question eval split)
    self-reported
    34.500
  • Mathematics Accuracy on JEEBench (200-question eval split)
    self-reported
    25.800
  • Chemistry Accuracy on JEEBench (200-question eval split)
    self-reported
    48.600
  • Physics Accuracy on JEEBench (200-question eval split)
    self-reported
    28.100