π Remedy-R: Generative Reasoning Models for MT Evaluation
Reasoning-driven, reinforcement-trained metrics for machine translation evaluation
β¨ What is Remedy-R?
Remedy-R is a family of reasoning-based MT evaluation models trained with reinforcement learning via verifiable rewards (RLVR) on pairwise human translation preferences.
Instead of directly regressing a scalar score, Remedy-R:
- Generates step-by-step analyses of accuracy, fluency, and completeness.
- Outputs a final numeric score in [0, 100] that can be parsed and used like a standard metric.
- Is trained with PPO + rule-based rewards that check whether predicted preferences match human rankings and calibrate scores toward human ratings.
- Supports both reference-based and reference-free (QE) evaluation.
On WMT22β24 and MSLC24-style OOD stress tests, Remedy-R:
- Surpasses strong LLM-as-judge methods.
- Matches top-performing scalar SOTA metrics.
- Remains robust under OOD conditions such as source copy, empty translations, wrong language, and mixed-language outputs.
- Enables Test-Time Scaling (TTS) via multiple reasoning passes, improving segment-level meta-evaluation.
- Powers Remedy-R Agent, an evaluateβrevise pipeline that improves translations for diverse base systems.
π Contents
- β¨ What is Remedy-R?
- π Contents
- π¦ Installation
- βοΈ Requirements
- π§ Model Zoo
- π Quickstart
- π Optional: vLLM Online Serving
- π Outputs
- π Citation
π¦ Installation
From PyPI (unavailable for now)
pip install --upgrade pip
pip install remedy-r-mt-eval
This installs the remedy_r package and the CLI entrypoint remedy-r-score (plus related tools).
From source
git clone https://github.com/Smu-Tan/Remedy-R.git
cd Remedy-R
pip install -e .
βοΈ Requirements
Core runtime dependencies (see pyproject.toml for exact versions):
- Python β₯ 3.10 (tested mostly with 3.12)
- PyTorch with GPU support
- vLLM for efficient batched inference
transformers,numpy,pandas,tqdm
You also need:
- At least 1 GPU (16β24 GB) for 7B models
- More memory/GPUs for 14B/32B models or large batch sizes
π§ Model Zoo
Remedy-R models are hosted on HuggingFace under ShaomuTan/:
| Model | Size | Base model | Mode | Link |
|---|---|---|---|---|
| Remedy-R-7B | 7B | Qwen2.5-7B | Ref + QE | π€ HuggingFace |
| Remedy-R-14B | 14B | Qwen2.5-14B | Ref + QE | π€ HuggingFace |
| Remedy-R-32B | 32B | Qwen2.5-32B | Ref + QE | π€ HuggingFace |
You can cache them locally:
HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download ShaomuTan/Remedy-R-14B \
--local-dir Models/Remedy-R-14B
Then point --model to either the HF ID or the local path.
π Quickstart
CLI: Local vLLM Inference
The main entrypoint is:
remedy-r-score \
--model "$MODEL_CHECKPOINT" \
--save_metric_name "$METRIC_NAME" \
--output_dir "$DATA_DIR" \
--max-tokens "$MAX_TOKENS" \
--tp_size "$TP_SIZE" \
--dp_size "$DP_SIZE" \
--temperature "$DEC_TEMPERATURE" \
--repetition_penalty "$REPETITION_PENALTY" \
--gpu-memory-utilization "$GPU_MEM_UTIL" \
--max-model-len "$MAX_MODEL_LEN" \
--seed "$SEED" \
--src-file "$SRC_FILE" \
--mt-file "$MT_FILE" \
--lp "$LP" \
Key arguments
--model: HF repo ID or local checkpoint--src-file: Source sentences (one per line)--mt-file: MT outputs (one per line)--ref-file: Reference translations (optional; enables ref-based mode)--lp: Language-pair codes (e.g.,en-de)--output_dir: Output folder--temperature: Generation temperature--tp_size: Tensor parallel size--dp_size: Data parallel size--num-seqs: Max parallel sequences per step--max-tokens: Max generation token numebrs--gpu-memory-utilization: vLLM memory ratio (e.g. 0.9)
You can also call the CLI via Python:
python -m remedy_r.cli.score \
--model ShaomuTan/Remedy-R-7B \
...
Reference-Free / QE Mode
If you donβt have references, just drop --ref-file and add --no-ref:
remedy-r-score \
--model ShaomuTan/Remedy-R-7B \
--src-file ./testcase/en.src \
--mt-file ./testcase/en-de.hyp \
--no-ref \
--src-lang en \
--tgt-lang de \
--save-dir ./testcase \
--cache-dir ./Models
The prompt automatically switches to reference-free quality estimation while keeping the same [0, 100] score scale.
Test-Time Scaling (TTS)
Remedy-R supports Test-Time Scaling by averaging multiple independent evaluation passes with different seeds:
remedy-r-score \
--model ShaomuTan/Remedy-R-14B \
--src-file ./testcase/en.src \
--mt-file ./testcase/en-de.hyp \
--ref-file ./testcase/de.ref \
--src-lang en --tgt-lang de \
--save-dir ./testcase_tts \
--TTS \
--best-of-n 4 \
--seed 42
--TTS: Enable multi-pass evaluation--best-of-n: Number of independent passes (e.g., 2β6)- Scores are averaged; the detailed per-pass scores can be optionally logged.
TTS typically improves segment-level pairwise accuracy and stabilizes scores for difficult segments.
π Optional: vLLM Online Serving
To avoid re-loading the model for every scoring run, you can:
- Start a local vLLM server (OpenAI-compatible):
remedy-r-serve \
--model ShaomuTan/Remedy-R-14B \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9
- Score via the server:
remedy-r-score \
--src-file ./testcase/en.src \
--mt-file ./testcase/en-de.hyp \
--ref-file ./testcase/de.ref \
--lp en-de \
--save_metric_name Remedy-R-14B \
--save-dir ./testcase_server \
--server-url http://localhost:8000/v1
Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating LLM() in every process.
π Outputs
For each language pair SRC-TGT, Remedy-R writes:
results.jsonlsegment_scores.tsvsystem_score.txt
π Citation
If you use Remedy-R or this codebase, please cite:
Arxiv coming soon...
- Downloads last month
- 2