Qwen3-8B JEE SDPO
A fine-tuned version of Qwen3-8B specialized for solving IIT JEE Advanced problems in Physics, Chemistry, and Mathematics with detailed chain-of-thought reasoning.
This model was trained in two stages:
- SFT — Supervised fine-tuning on JEE + competition math problems
- SDPO — Self-Distillation Preference Optimization using DPO with LLM-as-judge feedback
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-8B |
| SFT Base | vipsehgal/qwen3-8b-jee-sft |
| Format | Full-precision safetensors (bfloat16) |
| Size | ~15 GB |
| Architecture | Qwen3ForCausalLM, 36 layers, 32 heads, 4096 hidden |
Training Pipeline
Stage 1: SFT (Supervised Fine-Tuning)
- Method: QLoRA (4-bit) on Apple M3 Pro using MLX
- Data: 3,515 examples (3,163 train / 352 validation)
- JEEBench CoT: 457 JEE Advanced questions with Claude Opus 4.6-generated step-by-step solutions (Physics, Chemistry, Mathematics)
- NuminaMath-CoT: 2,706 filtered competition math problems (AMC, AIME, Olympiad-level)
- Hyperparameters: lr=1e-5, 1000 iterations, batch=1, grad_accum=4, 8 LoRA layers, max_seq=512
- Result: Validation loss 1.478 → 0.582
Stage 2: SDPO (Self-Distillation Preference Optimization)
- Method: DPO with LoRA on A100 GPU (Google Colab Pro+) using TRL
- Process:
- Generated rollouts — the SFT model attempted JEE questions with temperature sampling
- Judged rollouts — rule-based answer extraction + correctness checking (numerical tolerance ±1%)
- Built DPO preference pairs:
- Chosen: Correct rollout, LLM judge feedback, or gold training solution
- Rejected: Incorrect rollout from the model
- Trained with DPO (beta=0.1) to prefer correct reasoning over incorrect attempts
- Hyperparameters: lr=5e-6, 2 epochs, batch=2, grad_accum=4, LoRA r=16/alpha=32, max_length=1024, cosine scheduler
- Optimizer: Paged AdamW 8-bit with gradient checkpointing
Evaluation Results
Evaluated on 200 held-out questions from JEEBench covering Physics, Chemistry, and Mathematics. All models used greedy decoding with max 2,048 tokens.
| Subject | Base Qwen3-8B | SFT | SDPO (this model) | Delta vs Base |
|---|---|---|---|---|
| Overall | 78/200 (39.0%) | 90/200 (45.0%) | 69/200 (34.5%) | -4.5% |
| Mathematics | 24/66 (36.4%) | 36/66 (54.5%) | 17/66 (25.8%) | -10.6% |
| Chemistry | 32/70 (45.7%) | 35/70 (50.0%) | 34/70 (48.6%) | +2.9% |
| Physics | 22/64 (34.4%) | 19/64 (29.7%) | 18/64 (28.1%) | -6.3% |
Key takeaways:
- SDPO regressed overall compared to both the base model and the SFT model
- Mathematics took the largest hit (-10.6% vs base, -28.7% vs SFT), likely because the DPO preference data (500 prompts, 2 rollouts each) had insufficient coverage of competition math
- Chemistry held steady and slightly improved over the base (+2.9%)
- Physics regressed modestly (-6.3% vs base)
- These results suggest the DPO training needs more diverse preference data and possibly more rollouts per prompt to be effective
Intended Use
This model is designed for:
- Solving IIT JEE Advanced level problems in Physics, Chemistry, and Mathematics
- Generating step-by-step solutions with LaTeX notation
- Educational tutoring for competitive exam preparation
Usage
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo")
messages = [
{"role": "system", "content": "You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer."},
{"role": "user", "content": "A particle of mass 2 kg is projected vertically upward with velocity 20 m/s. Find the maximum height reached. (Take g = 10 m/s²)"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=True))
With MLX on Apple Silicon (recommended for Mac users)
Use the quantized version for fast local inference:
pip install mlx-lm
mlx_lm.generate \
--model vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit \
--prompt "Solve: Find the number of real solutions of x^3 - 3x + 1 = 0"
See vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit for the 4-bit MLX version (4.3 GB, ~30 tokens/sec on M3 Pro).
System Prompt
You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer.
Architecture
| Parameter | Value |
|---|---|
| Model type | Qwen3ForCausalLM |
| Hidden size | 4,096 |
| Layers | 36 |
| Attention heads | 32 (8 KV heads, GQA) |
| FFN intermediate | 12,288 |
| Activation | SiLU |
| Vocab size | 151,936 |
| Max context | 40,960 tokens |
| RoPE theta | 1,000,000 |
Related Models
| Model | Size | Description |
|---|---|---|
| vipsehgal/qwen3-8b-jee-sft | 16.4 GB | SFT-only model (bf16) |
| vipsehgal/qwen3-8b-jee-sdpo | 15 GB | SFT + SDPO model (bf16) — this model |
| vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit | 4.3 GB | Quantized MLX version for Mac inference |
Limitations
- Training data is skewed toward mathematics (
83%) vs physics/chemistry (17%) - DPO training used a subset of 500 prompts with 2 rollouts each — more data could improve results
- SDPO regressed from SFT, particularly in Mathematics — the preference data needs more coverage
- May produce incorrect reasoning steps while arriving at correct final answers — always verify solutions
- Evaluated on 200 questions; full 515-question JEEBench benchmark not yet run
License
Apache 2.0 (following the base Qwen3-8B license)
- Downloads last month
- 27
Model tree for vipsehgal/qwen3-8b-jee-sdpo
Datasets used to train vipsehgal/qwen3-8b-jee-sdpo
Evaluation results
- Overall Accuracy on JEEBench (200-question eval split)self-reported34.500
- Mathematics Accuracy on JEEBench (200-question eval split)self-reported25.800
- Chemistry Accuracy on JEEBench (200-question eval split)self-reported48.600
- Physics Accuracy on JEEBench (200-question eval split)self-reported28.100