Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Metrics

Metric/Model	Avg.	En-De SPA (%)	En-De $Acc^*_eq$	En-Es SPA (%)	En-Es $Acc^*_eq$	Ja-Zh SPA (%)	Ja-Zh $Acc^*_eq$
QwQ 32B	68.3	79.8	46.8	76.1	68.0	91.9	46.9
+ ThinMQM	72.2 (+3.9)	83.2 (+3.4)	52.5 (+5.7)	80.7 (+4.6)	69.2 (+1.2)	91.3 (−0.6)	56.1 (+9.2)
R1-Distill-Llama-8B	64.9	71.8	42.9	78.5	68.0	84.7	43.5
+ ThinMQM	70.8 (+5.9)	85.5 (+13.7)	48.6 (+5.7)	81.3 (+2.8)	68.2 (+0.2)	90.5 (+5.8)	51.0 (+7.5)
R1-Distill-Qwen-7B	61.1	67.3	42.9	61.0	68.0	83.8	43.5
+ ThinMQM	69.8 (+8.7)	84.5 (+17.2)	48.5 (+5.6)	77.8 (+16.8)	68.0 (+0.0)	89.0 (+5.2)	51.3 (+7.8)

Model & Data Card

Released Models	HF Model	Template	Trained Dataset
rzzhan/ThinMQM-32B	https://huggingface.co/rzzhan/ThinMQM-32B	`thinking`	https://huggingface.co/datasets/rzzhan/ThinMQM-12k/ `thinmqm12k_src`
rzzhan/ThinMQM-8B	https://huggingface.co/rzzhan/ThinMQM-32B	`thinking_ref`	https://huggingface.co/datasets/rzzhan/ThinMQM-12k/ `thinmqm12k_ref`
rzzhan/ThinMQM-7B	https://huggingface.co/rzzhan/ThinMQM-32B	`thinking_ref`	https://huggingface.co/datasets/rzzhan/ThinMQM-12k/ `thinmqm12k_ref`

📝 Citation

If you find our model, data, or evaluation code useful, please kindly cite our paper:

@article{zhan2025thinmqm,
      title={Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost}, 
      author={Zhan, Runzhe and Huang, Zhihong and Yang, Xinyi and Chao, Lidia S and Yang, Min and Wong, Derek F},
      year={2025},
      journal = {ArXiv preprint},
      volume = {2510.20780},
      url={https://arxiv.org/abs/2510.20780}, 
}