Qwen3-8B JEE SDPO

A fine-tuned version of Qwen3-8B specialized for solving IIT JEE Advanced problems in Physics, Chemistry, and Mathematics with detailed chain-of-thought reasoning.

This model was trained in two stages:

SFT — Supervised fine-tuning on JEE + competition math problems
SDPO — Self-Distillation Preference Optimization using DPO with LLM-as-judge feedback

Model Details

Property	Value
Base Model	Qwen/Qwen3-8B
SFT Base	vipsehgal/qwen3-8b-jee-sft
Format	Full-precision safetensors (bfloat16)
Size	~15 GB
Architecture	Qwen3ForCausalLM, 36 layers, 32 heads, 4096 hidden

Training Pipeline

Stage 1: SFT (Supervised Fine-Tuning)

Method: QLoRA (4-bit) on Apple M3 Pro using MLX
Data: 3,515 examples (3,163 train / 352 validation)
- JEEBench CoT: 457 JEE Advanced questions with Claude Opus 4.6-generated step-by-step solutions (Physics, Chemistry, Mathematics)
- NuminaMath-CoT: 2,706 filtered competition math problems (AMC, AIME, Olympiad-level)
Hyperparameters: lr=1e-5, 1000 iterations, batch=1, grad_accum=4, 8 LoRA layers, max_seq=512
Result: Validation loss 1.478 → 0.582

Stage 2: SDPO (Self-Distillation Preference Optimization)

Method: DPO with LoRA on A100 GPU (Google Colab Pro+) using TRL
Process:
1. Generated rollouts — the SFT model attempted JEE questions with temperature sampling
2. Judged rollouts — rule-based answer extraction + correctness checking (numerical tolerance ±1%)
3. Built DPO preference pairs:
  - Chosen: Correct rollout, LLM judge feedback, or gold training solution
  - Rejected: Incorrect rollout from the model
4. Trained with DPO (beta=0.1) to prefer correct reasoning over incorrect attempts
Hyperparameters: lr=5e-6, 2 epochs, batch=2, grad_accum=4, LoRA r=16/alpha=32, max_length=1024, cosine scheduler
Optimizer: Paged AdamW 8-bit with gradient checkpointing

Evaluation Results

Evaluated on 200 held-out questions from JEEBench covering Physics, Chemistry, and Mathematics. All models used greedy decoding with max 2,048 tokens.

Subject	Base Qwen3-8B	SFT	SDPO (this model)	Delta vs Base
Overall	78/200 (39.0%)	90/200 (45.0%)	69/200 (34.5%)	-4.5%
Mathematics	24/66 (36.4%)	36/66 (54.5%)	17/66 (25.8%)	-10.6%
Chemistry	32/70 (45.7%)	35/70 (50.0%)	34/70 (48.6%)	+2.9%
Physics	22/64 (34.4%)	19/64 (29.7%)	18/64 (28.1%)	-6.3%

Key takeaways:

SDPO regressed overall compared to both the base model and the SFT model
Mathematics took the largest hit (-10.6% vs base, -28.7% vs SFT), likely because the DPO preference data (500 prompts, 2 rollouts each) had insufficient coverage of competition math
Chemistry held steady and slightly improved over the base (+2.9%)
Physics regressed modestly (-6.3% vs base)
These results suggest the DPO training needs more diverse preference data and possibly more rollouts per prompt to be effective

Intended Use

This model is designed for:

Solving IIT JEE Advanced level problems in Physics, Chemistry, and Mathematics
Generating step-by-step solutions with LaTeX notation
Educational tutoring for competitive exam preparation

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("vipsehgal/qwen3-8b-jee-sdpo")

messages = [
    {"role": "system", "content": "You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer."},
    {"role": "user", "content": "A particle of mass 2 kg is projected vertically upward with velocity 20 m/s. Find the maximum height reached. (Take g = 10 m/s²)"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(output[0], skip_special_tokens=True))

With MLX on Apple Silicon (recommended for Mac users)

Use the quantized version for fast local inference:

pip install mlx-lm

mlx_lm.generate \
    --model vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit \
    --prompt "Solve: Find the number of real solutions of x^3 - 3x + 1 = 0"

See vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit for the 4-bit MLX version (4.3 GB, ~30 tokens/sec on M3 Pro).

System Prompt

You are an expert IIT JEE tutor. Solve problems step-by-step using LaTeX notation. Show all work clearly and arrive at the final answer.

Architecture

Parameter	Value
Model type	Qwen3ForCausalLM
Hidden size	4,096
Layers	36
Attention heads	32 (8 KV heads, GQA)
FFN intermediate	12,288
Activation	SiLU
Vocab size	151,936
Max context	40,960 tokens
RoPE theta	1,000,000

Related Models

Model	Size	Description
vipsehgal/qwen3-8b-jee-sft	16.4 GB	SFT-only model (bf16)
vipsehgal/qwen3-8b-jee-sdpo	15 GB	SFT + SDPO model (bf16) — this model
vipsehgal/qwen3-8b-jee-sdpo-mlx-4bit	4.3 GB	Quantized MLX version for Mac inference

Limitations

Training data is skewed toward mathematics (~~83%) vs physics/chemistry (~~17%)
DPO training used a subset of 500 prompts with 2 rollouts each — more data could improve results
SDPO regressed from SFT, particularly in Mathematics — the preference data needs more coverage
May produce incorrect reasoning steps while arriving at correct final answers — always verify solutions
Evaluated on 200 questions; full 515-question JEEBench benchmark not yet run

License

Apache 2.0 (following the base Qwen3-8B license)

Downloads last month: 27

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for vipsehgal/qwen3-8b-jee-sdpo

Base model

Qwen/Qwen3-8B-MLX-4bit

Adapter

vipsehgal/qwen3-8b-jee-sft

Finetuned

(1)

this model

Quantizations

1 model

Datasets used to train vipsehgal/qwen3-8b-jee-sdpo

Evaluation results

Overall Accuracy on JEEBench (200-question eval split)
self-reported

34.500
Mathematics Accuracy on JEEBench (200-question eval split)
self-reported

25.800
Chemistry Accuracy on JEEBench (200-question eval split)
self-reported

48.600
Physics Accuracy on JEEBench (200-question eval split)
self-reported

28.100