You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

πŸš€ Remedy-R: Generative Reasoning Models for MT Evaluation

Reasoning-driven, reinforcement-trained metrics for machine translation evaluation


✨ What is Remedy-R?

Remedy-R is a family of reasoning-based MT evaluation models trained with reinforcement learning via verifiable rewards (RLVR) on pairwise human translation preferences.

Instead of directly regressing a scalar score, Remedy-R:

  • Generates step-by-step analyses of accuracy, fluency, and completeness.
  • Outputs a final numeric score in [0, 100] that can be parsed and used like a standard metric.
  • Is trained with PPO + rule-based rewards that check whether predicted preferences match human rankings and calibrate scores toward human ratings.
  • Supports both reference-based and reference-free (QE) evaluation.

On WMT22–24 and MSLC24-style OOD stress tests, Remedy-R:

  • Surpasses strong LLM-as-judge methods.
  • Matches top-performing scalar SOTA metrics.
  • Remains robust under OOD conditions such as source copy, empty translations, wrong language, and mixed-language outputs.
  • Enables Test-Time Scaling (TTS) via multiple reasoning passes, improving segment-level meta-evaluation.
  • Powers Remedy-R Agent, an evaluate–revise pipeline that improves translations for diverse base systems.

πŸ“š Contents


πŸ“¦ Installation

From PyPI (unavailable for now)

pip install --upgrade pip
pip install remedy-r-mt-eval

This installs the remedy_r package and the CLI entrypoint remedy-r-score (plus related tools).

From source

git clone https://github.com/Smu-Tan/Remedy-R.git
cd Remedy-R
pip install -e .

βš™οΈ Requirements

Core runtime dependencies (see pyproject.toml for exact versions):

  • Python β‰₯ 3.10 (tested mostly with 3.12)
  • PyTorch with GPU support
  • vLLM for efficient batched inference
  • transformers, numpy, pandas, tqdm

You also need:

  • At least 1 GPU (16–24 GB) for 7B models
  • More memory/GPUs for 14B/32B models or large batch sizes

🧠 Model Zoo

Remedy-R models are hosted on HuggingFace under ShaomuTan/:

Model Size Base model Mode Link
Remedy-R-7B 7B Qwen2.5-7B Ref + QE πŸ€— HuggingFace
Remedy-R-14B 14B Qwen2.5-14B Ref + QE πŸ€— HuggingFace
Remedy-R-32B 32B Qwen2.5-32B Ref + QE πŸ€— HuggingFace

You can cache them locally:

HF_HUB_ENABLE_HF_TRANSFER=1 \
huggingface-cli download ShaomuTan/Remedy-R-14B \
  --local-dir Models/Remedy-R-14B

Then point --model to either the HF ID or the local path.


πŸš€ Quickstart

CLI: Local vLLM Inference

The main entrypoint is:

remedy-r-score \
  --model "$MODEL_CHECKPOINT" \
  --save_metric_name "$METRIC_NAME" \
  --output_dir "$DATA_DIR" \
  --max-tokens "$MAX_TOKENS" \
  --tp_size "$TP_SIZE" \
  --dp_size "$DP_SIZE" \
  --temperature "$DEC_TEMPERATURE" \
  --repetition_penalty "$REPETITION_PENALTY" \
  --gpu-memory-utilization "$GPU_MEM_UTIL" \
  --max-model-len "$MAX_MODEL_LEN" \
  --seed "$SEED" \
  --src-file "$SRC_FILE" \
  --mt-file  "$MT_FILE" \
  --lp "$LP" \

Key arguments

  • --model : HF repo ID or local checkpoint
  • --src-file : Source sentences (one per line)
  • --mt-file : MT outputs (one per line)
  • --ref-file : Reference translations (optional; enables ref-based mode)
  • --lp : Language-pair codes (e.g., en-de)
  • --output_dir : Output folder
  • --temperature : Generation temperature
  • --tp_size : Tensor parallel size
  • --dp_size : Data parallel size
  • --num-seqs : Max parallel sequences per step
  • --max-tokens : Max generation token numebrs
  • --gpu-memory-utilization : vLLM memory ratio (e.g. 0.9)

You can also call the CLI via Python:

python -m remedy_r.cli.score \
  --model ShaomuTan/Remedy-R-7B \
  ...

Reference-Free / QE Mode

If you don’t have references, just drop --ref-file and add --no-ref:

remedy-r-score \
  --model ShaomuTan/Remedy-R-7B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --no-ref \
  --src-lang en \
  --tgt-lang de \
  --save-dir ./testcase \
  --cache-dir ./Models

The prompt automatically switches to reference-free quality estimation while keeping the same [0, 100] score scale.


Test-Time Scaling (TTS)

Remedy-R supports Test-Time Scaling by averaging multiple independent evaluation passes with different seeds:

remedy-r-score \
  --model ShaomuTan/Remedy-R-14B \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --src-lang en --tgt-lang de \
  --save-dir ./testcase_tts \
  --TTS \
  --best-of-n 4 \
  --seed 42
  • --TTS : Enable multi-pass evaluation
  • --best-of-n : Number of independent passes (e.g., 2–6)
  • Scores are averaged; the detailed per-pass scores can be optionally logged.

TTS typically improves segment-level pairwise accuracy and stabilizes scores for difficult segments.


🌐 Optional: vLLM Online Serving

To avoid re-loading the model for every scoring run, you can:

  1. Start a local vLLM server (OpenAI-compatible):
remedy-r-serve \
  --model ShaomuTan/Remedy-R-14B \
  --port 8000 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9
  1. Score via the server:
remedy-r-score \
  --src-file ./testcase/en.src \
  --mt-file ./testcase/en-de.hyp \
  --ref-file ./testcase/de.ref \
  --lp en-de \
  --save_metric_name Remedy-R-14B \
  --save-dir ./testcase_server \
  --server-url http://localhost:8000/v1

Internally this reuses the same Remedy-R prompting and scoring logic, but routes generation requests through the running vLLM server instead of instantiating LLM() in every process.


πŸ“„ Outputs

For each language pair SRC-TGT, Remedy-R writes:

  • results.jsonl
  • segment_scores.tsv
  • system_score.txt

πŸ“š Citation

If you use Remedy-R or this codebase, please cite:

Arxiv coming soon...

Downloads last month
2
Safetensors
Model size
15B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ShaomuTan/Remedy-R-14B