RM -RF: Reward Model for Run-Free Unit Test Evaluation
Abstract
RM-RF is a lightweight reward model that predicts execution outcomes from source code alone, offering faster and more cost-effective evaluation than traditional compile-and-run methods.
We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.
Community
RM -RF: Reward Model for Run-Free Unit Test Evaluation proposes a novel lightweight reward model that predicts unit test quality without compiling or executing code by inferring three execution-derived signals directly from source and test code:
- whether the augmented test suite would compile and run,
- whether the new tests increase code coverage,
- whether they improve mutation kill rate.
This work is motivated by the high computational cost of traditional compile-and-run validation in automated test generation, especially when using large language models (LLMs) for code tasks. RM-RF is trained on a multilingual dataset across Java, Python, and Go that pairs code + test files with execution-derived labels, and various model families and tuning regimes (zero-shot, fine-tuned, PEFT/LoRA) achieve ~0.69 F1 on the three targets.
Key contributions include:
- a scalable “run-free” reward model that reduces latency and infrastructure needs compared to execution-based evaluation,
- a curated dataset and methodology for comparative assessment of unit test quality signals,
- empirical analysis of RM-RF with different model sizes and fine-tuning strategies.
This approach can provide rapid feedback for large-scale test generation and RL-based code optimization, bridging a gap between high-fidelity execution feedback and scalable automated unit test evaluation.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper