Enhance model card with metadata, paper link, and usage example (#1)
Browse files- Enhance model card with metadata, paper link, and usage example (7b7070dcb74aaaad2c510724886b0c7c8ebd630e)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1 +1,117 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
+
datasets:
|
| 6 |
+
- Kwai-Klear/RLEP_dataset
|
| 7 |
+
- BytedTsinghua-SIA/DAPO-Math-17k
|
| 8 |
+
base_model: Qwen/Qwen2.5-Math-7B
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
|
| 12 |
+
|
| 13 |
+
This repository contains the `qwen2.5-math-rlep` model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://huggingface.co/papers/2507.07451).
|
| 14 |
+
|
| 15 |
+
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. **RLEP** -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
|
| 16 |
+
|
| 17 |
+
[[Paper](https://huggingface.co/papers/2507.07451)] [[Code](https://github.com/Kwai-Klear/RLEP)] [[Checkpoints](https://huggingface.co/Kwai-Klear/qwen2.5-math-rlep)] [[Dataset](https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset)]
|
| 18 |
+
|
| 19 |
+
<p align="center">
|
| 20 |
+
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/rlep_method.png" width="85%" alt="RLEP Method Overview">
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
## ✨ Key Highlights
|
| 24 |
+
|
| 25 |
+
* **Rapid early gains**: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
|
| 26 |
+
* **Higher final performance**: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
|
| 27 |
+
|
| 28 |
+
<p align="center">
|
| 29 |
+
<img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/exp_acc.png" width="85%" alt="RLEP Experimental Accuracy">
|
| 30 |
+
</p>
|
| 31 |
+
|
| 32 |
+
## 🚀 Quick Start (Inference)
|
| 33 |
+
|
| 34 |
+
You can use the RLEP model for accelerated text generation by leveraging its custom `EaModel` class. Ensure you have the `rlep` package and its `vllm` dependencies installed as per the official repository.
|
| 35 |
+
|
| 36 |
+
First, install the necessary packages by cloning the repository and installing its dependencies:
|
| 37 |
+
```bash
|
| 38 |
+
git clone https://github.com/Kwai-Klear/RLEP.git
|
| 39 |
+
cd RLEP
|
| 40 |
+
pip3 install -e .[vllm]
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Then, you can use the model in your Python code:
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
import torch
|
| 47 |
+
from transformers import AutoTokenizer
|
| 48 |
+
from eagle.model.ea_model import EaModel
|
| 49 |
+
from fastchat.model import get_conversation_template
|
| 50 |
+
|
| 51 |
+
# Define paths for your base model and RLEP model checkpoint
|
| 52 |
+
# This model is based on Qwen2.5-Math-7B.
|
| 53 |
+
base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
|
| 54 |
+
rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
|
| 55 |
+
|
| 56 |
+
# Load the RLEP-enhanced model
|
| 57 |
+
# trust_remote_code=True might be necessary depending on your environment
|
| 58 |
+
model = EaModel.from_pretrained(
|
| 59 |
+
base_model_path=base_model_path,
|
| 60 |
+
ea_model_path=rlep_model_path,
|
| 61 |
+
torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
|
| 62 |
+
low_cpu_mem_usage=True,
|
| 63 |
+
device_map="auto",
|
| 64 |
+
total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
|
| 65 |
+
)
|
| 66 |
+
model.eval()
|
| 67 |
+
|
| 68 |
+
# Example usage for text generation:
|
| 69 |
+
user_message = "What is the capital of France?"
|
| 70 |
+
|
| 71 |
+
# Get conversation template for your base model.
|
| 72 |
+
# Adjust "qwen2" if your base model uses a different chat format.
|
| 73 |
+
conv = get_conversation_template("qwen2")
|
| 74 |
+
conv.append_message(conv.roles[0], user_message)
|
| 75 |
+
conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
|
| 76 |
+
|
| 77 |
+
prompt = conv.get_prompt()
|
| 78 |
+
input_ids = model.tokenizer([prompt]).input_ids
|
| 79 |
+
input_ids = torch.as_tensor(input_ids).cuda()
|
| 80 |
+
|
| 81 |
+
# Generate text using the RLEP-accelerated generation method
|
| 82 |
+
output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
|
| 83 |
+
output = model.tokenizer.decode(output_ids[0])
|
| 84 |
+
|
| 85 |
+
print(output)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## Evaluation Results
|
| 89 |
+
|
| 90 |
+
We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
|
| 91 |
+
|
| 92 |
+
| | AIME-2024 | AIME-2025 | AMC-2023 |
|
| 93 |
+
|-------------------|-----------|-----------|----------|
|
| 94 |
+
| DAPO | 32.6 | 18.9 | 77.5 |
|
| 95 |
+
| DAPO-nodyn-bs64 | 37.4 | 19.4 | 77.3 |
|
| 96 |
+
| **RLEP** | **38.5** | **21.3** | **83.0** |
|
| 97 |
+
|
| 98 |
+
## Citation
|
| 99 |
+
|
| 100 |
+
If you find our paper or code helpful, we would appreciate it if you could cite our work:
|
| 101 |
+
|
| 102 |
+
```bibtex
|
| 103 |
+
@misc{zhang2025rlepreinforcementlearningexperience,
|
| 104 |
+
title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
|
| 105 |
+
author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
|
| 106 |
+
year={2025},
|
| 107 |
+
eprint={2507.07451},
|
| 108 |
+
archivePrefix={arXiv},
|
| 109 |
+
primaryClass={cs.CL},
|
| 110 |
+
url={https://arxiv.org/abs/2507.07451},
|
| 111 |
+
}
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Acknowledgement
|
| 115 |
+
|
| 116 |
+
We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
|
| 117 |
+
Many thanks to the open-sourced works and the broader community for making these resources available!
|