Enhance model card with metadata, paper link, and usage example (#1)

Browse files

- Enhance model card with metadata, paper link, and usage example (7b7070dcb74aaaad2c510724886b0c7c8ebd630e)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +117 -1

README.md CHANGED Viewed

	@@ -1 +1,117 @@
1	- ~~See our paper for details [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://arxiv.org/abs/2507.07451).~~

+---
+license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+datasets:
+  - Kwai-Klear/RLEP_dataset
+  - BytedTsinghua-SIA/DAPO-Math-17k
+base_model: Qwen/Qwen2.5-Math-7B
+---
+# RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
+This repository contains the `qwen2.5-math-rlep` model, which is a key checkpoint from the RLEP training process based on Qwen2.5-Math-7B, as presented in the paper [RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning](https://huggingface.co/papers/2507.07451).
+Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. **RLEP** -- Reinforcement Learning with Experience rePlay -- is a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance.
+[[Paper](https://huggingface.co/papers/2507.07451)] [[Code](https://github.com/Kwai-Klear/RLEP)] [[Checkpoints](https://huggingface.co/Kwai-Klear/qwen2.5-math-rlep)] [[Dataset](https://huggingface.co/datasets/Kwai-Klear/RLEP_dataset)]
+<p align="center">
+  <img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/rlep_method.png" width="85%" alt="RLEP Method Overview">
+</p>
+## ✨ Key Highlights
+*   **Rapid early gains**: On AIME-2024 RLEP hits the baseline’s peak accuracy by step 135 (the baseline needs 380). On AIME-2025 it surpasses the baseline’s best score after only 50 steps.
+*   **Higher final performance**: RLEP ultimately lifts the peak accuracy from 38.2% → 39.9% (AIME-2024), 19.8% → 22.3% (AIME-2025), and 77.0% → 82.2% on AMC-2023 benchmark.
+<p align="center">
+  <img src="https://github.com/Kwai-Klear/RLEP/raw/main/image/exp_acc.png" width="85%" alt="RLEP Experimental Accuracy">
+</p>
+## 🚀 Quick Start (Inference)
+You can use the RLEP model for accelerated text generation by leveraging its custom `EaModel` class. Ensure you have the `rlep` package and its `vllm` dependencies installed as per the official repository.
+First, install the necessary packages by cloning the repository and installing its dependencies:
+```bash
+git clone https://github.com/Kwai-Klear/RLEP.git
+cd RLEP
+pip3 install -e .[vllm]
+```
+Then, you can use the model in your Python code:
+```python
+import torch
+from transformers import AutoTokenizer
+from eagle.model.ea_model import EaModel
+from fastchat.model import get_conversation_template
+# Define paths for your base model and RLEP model checkpoint
+# This model is based on Qwen2.5-Math-7B.
+base_model_path = "Qwen/Qwen2.5-Math-7B" # Original Qwen2.5 base model
+rlep_model_path = "Kwai-Klear/qwen2.5-math-rlep" # This RLEP checkpoint
+# Load the RLEP-enhanced model
+# trust_remote_code=True might be necessary depending on your environment
+model = EaModel.from_pretrained(
+    base_model_path=base_model_path,
+    ea_model_path=rlep_model_path,
+    torch_dtype=torch.float16, # or torch.bfloat16 for Qwen2 models
+    low_cpu_mem_usage=True,
+    device_map="auto",
+    total_token=-1 # -1 allows EAGLE-2 to auto-configure this parameter
+)
+model.eval()
+# Example usage for text generation:
+user_message = "What is the capital of France?"
+# Get conversation template for your base model.
+# Adjust "qwen2" if your base model uses a different chat format.
+conv = get_conversation_template("qwen2")
+conv.append_message(conv.roles[0], user_message)
+conv.append_message(conv.roles[1], None) # Append None for the assistant's turn
+prompt = conv.get_prompt()
+input_ids = model.tokenizer([prompt]).input_ids
+input_ids = torch.as_tensor(input_ids).cuda()
+# Generate text using the RLEP-accelerated generation method
+output_ids = model.eagenerate(input_ids, temperature=0.5, max_new_tokens=512)
+output = model.tokenizer.decode(output_ids[0])
+print(output)
+```
+## Evaluation Results
+We evaluated the converged RLEP model at 320 training steps and the DAPO-nodyn-bs64 baseline at 400 steps.
+|                   | AIME-2024 | AIME-2025 | AMC-2023 |
+|-------------------|-----------|-----------|----------|
+| DAPO              | 32.6      | 18.9      | 77.5     |
+| DAPO-nodyn-bs64   | 37.4      | 19.4      | 77.3     |
+| **RLEP**          | **38.5**  | **21.3**  | **83.0** |
+## Citation
+If you find our paper or code helpful, we would appreciate it if you could cite our work:
+```bibtex
+@misc{zhang2025rlepreinforcementlearningexperience,
+      title={RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning},
+      author={Hongzhi Zhang and Jia Fu and Jingyuan Zhang and Kai Fu and Qi Wang and Fuzheng Zhang and Guorui Zhou},
+      year={2025},
+      eprint={2507.07451},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2507.07451},
+}
+```
+## Acknowledgement
+We conducted our experiments with the [VERL](https://github.com/volcengine/verl) framework and the [Qwen2.5-7B-Math](https://huggingface.co/Qwen/Qwen2.5-Math-7B) model, using the dataset and training scripts provided by [DAPO](https://dapo-sia.github.io/).
+Many thanks to the open-sourced works and the broader community for making these resources available!