Improve model card: Add pipeline tag, library name, paper link, and abstract
Browse filesThis PR enhances the model card by:
- Adding `pipeline_tag: text-generation` to accurately categorize the model's primary function on the Hugging Face Hub.
- Specifying `library_name: transformers` as the model is compatible with the 🤗 Transformers library (evidenced by `config.json` and `tokenizer_config.json`), which enables the automated "how to use" widget.
- Including a direct link to the research paper: [FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning](https://huggingface.co/papers/2510.22543).
- Adding the paper's abstract to provide immediate context and an overview of the model's background and purpose.
Please note: A sample usage code snippet has not been added, as the provided GitHub README did not contain an explicit inference snippet for this specific model, in adherence to the task's instructions.
|
@@ -1,8 +1,15 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
Generative Reward Model trained with [FAPO-Critic](https://huggingface.co/datasets/dyyyyyyyy/FAPO-Critic)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
---
|
| 8 |
|
|
@@ -20,4 +27,4 @@ BibTeX citation:
|
|
| 20 |
journal={arXiv preprint arXiv:2510.22543},
|
| 21 |
year={2025}
|
| 22 |
}
|
| 23 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
|
| 7 |
+
This model is the Generative Reward Model (FAPO-GenRM-4B) described in the paper [FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning](https://huggingface.co/papers/2510.22543), and was trained with [FAPO-Critic](https://huggingface.co/datasets/dyyyyyyyy/FAPO-Critic).
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Abstract
|
| 12 |
+
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.
|
| 13 |
|
| 14 |
---
|
| 15 |
|
|
|
|
| 27 |
journal={arXiv preprint arXiv:2510.22543},
|
| 28 |
year={2025}
|
| 29 |
}
|
| 30 |
+
```
|