Improve model card: update pipeline tag, add library name, paper details & content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +149 -5
README.md CHANGED
@@ -1,12 +1,156 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - QiWang98/VideoRFT-Data
5
  language:
6
  - en
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - Qwen/Qwen2.5-VL-7B-Instruct
11
- pipeline_tag: visual-question-answering
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - QiWang98/VideoRFT-Data
6
  language:
7
  - en
8
+ license: apache-2.0
9
  metrics:
10
  - accuracy
11
+ pipeline_tag: video-text-to-text
12
+ library_name: transformers
13
+ ---
14
+
15
+ # 🎥 $\text{VideoRFT}$: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning
16
+
17
+ 📑 [Paper: VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning](https://huggingface.co/papers/2505.12434)
18
+ ⭐️ [Code: https://github.com/QiWang98/VideoRFT](https://github.com/QiWang98/VideoRFT)
19
+ 📀 [CoT Dataset: https://huggingface.co/datasets/QiWang98/VideoRFT-Data](https://huggingface.co/datasets/QiWang98/VideoRFT-Data)
20
+ 📀 [RL Dataset: https://huggingface.co/datasets/QiWang98/VideoRFT-Data](https://huggingface.co/datasets/QiWang98/VideoRFT-Data)
21
+ 🤗 [Models: https://huggingface.co/QiWang98/VideoRFT](https://huggingface.co/QiWang98/VideoRFT)
22
+
23
+ ## 📰 News
24
+ - [2025/09/19] Our paper has been **accepted to NeurIPS 2025** 🎉!
25
+ - [2025/06/01] We released our 3B Models ([🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B) and [🤗VideoRFT-3B](https://huggingface.co/QiWang98/VideoRFT-3B)) to huggingface.
26
+ - [2025/05/25] We released our 7B Models ([🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) and [🤗VideoRFT-7B](https://huggingface.co/QiWang98/VideoRFT)) to huggingface.
27
+ - [2025/05/20] We released our Datasets ([📀CoT Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀RL Dataset](https://huggingface.co/datasets/QiWang98/VideoRFT-Data)) to huggingface.
28
+ - [2025/05/18] Our paper is released on [ArXiv](https://arxiv.org/abs/2505.12434), and we have open-sourced our code on [GitHub](https://github.com/QiWang98/VideoRFT)!
29
+
30
+ ## 🔎 Overview
31
+
32
+ Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose $\textbf{VideoRFT}$, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. $\textbf{VideoRFT}$ follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets $-$ VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that $\textbf{VideoRFT}$ achieves state-of-the-art performance on six video reasoning benchmarks.
33
+
34
+ <div align="center">
35
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/overview.png" />
36
+ </div>
37
+
38
+ ## ✨ Methodology
39
+
40
+ To overcome the scarcity of video CoTs, we develop a scalable, cognitively inspired pipeline for high-quality video CoT dataset construction.
41
+
42
+ <div align="center">
43
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/pipeline.png" width="95%" />
44
+ </div>
45
+
46
+ To further strength the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning with visual evidence.
47
+
48
+ <div align="center">
49
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/grpo.png" width="95%" />
50
+ </div>
51
+
52
+ ## 📀 Datasets
53
+
54
+ Based on above pipeline, we construct two large-scale datasets, i.e., [📀VideoRFT-CoT-102K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data) and [📀VideoRFT-RL-310K](https://huggingface.co/datasets/QiWang98/VideoRFT-Data).
55
+ <div align="center">
56
+ <img src="https://github.com/QiWang98/VideoRFT/raw/main/images/dataset.png" width="50%" />
57
+ </div>
58
+
59
+ ## 🛠️ Set up
60
+
61
+ ### Requirements
62
+ * `Python >= 3.11`
63
+ * `Pytorch >= 2.5.1`
64
+ * `transformers == 4.51.3`
65
+ * `vLLM == 0.7.3`
66
+ * `trl == 0.16.0`
67
+
68
+ ### Installation
69
+ ```bash
70
+ git clone https://github.com/QiWang98/VideoRFT
71
+ cd VideoRFT
72
+
73
+ # Create and activate environment
74
+ conda create -n VideoRFT python=3.11
75
+ conda activate VideoRFT
76
+ bash setup.sh
77
+
78
+ # Install decord for improved video processing
79
+ cd src/qwen-vl-utils
80
+ pip install -e .[decord]
81
+ ```
82
+
83
+ ## 🚀 Training
84
+
85
+ ### Supervised Fine-Tuning (SFT)
86
+ We begin with supervised fine-tuning on the VideoRFT-CoT dataset for one epoch:
87
+
88
+ ```bash
89
+ bash ./src/scripts/run_sft_video.sh
90
+ ```
91
+
92
+ This step can be skipped by directly using our pretrained SFT models, available at [🤗VideoRFT-SFT-7B](https://huggingface.co/QiWang98/VideoRFT-SFT) or [🤗VideoRFT-SFT-3B](https://huggingface.co/QiWang98/VideoRFT-SFT-3B).
93
+
94
+ ### Reinforcement Learning (RL)
95
+
96
+ Next, perform reinforcement learning using the VideoRFT-RL dataset:
97
+
98
+ ```bash
99
+ bash ./src/scripts/run_grpo_video.sh
100
+ ```
101
+
102
+ To enable faster training via vLLM acceleration:
103
+
104
+ ```bash
105
+ bash ./src/scripts/run_grpo_vllm_qwen25vl.sh
106
+ ```
107
+
108
+ > **Note:** During training, we adopt the following settings for efficiency:
109
+
110
+ * **VIDEO PIXELS**: 128 × 28 × 28
111
+ * **FPS FRAMES**: 16
112
+
113
+ All frame-related configurations can be adjusted in `src/qwen-vl-utils`.
114
+
115
+ ## 📈 Inference & Evaluation
116
+
117
+ > During inference, we increase the maximum frame resolution and length to boost performance:
118
+
119
+ * **VIDEO PIXELS**: 256 × 28 × 28
120
+ * **FPS FRAMES**: 32
121
+
122
+ You can configure these parameters in `src/qwen-vl-utils`.
123
+
124
+ > We evaluate all models under a unified decoding configuration following the official Qwen2.5-VL demo:
125
+
126
+ * `top_p = 0.001`
127
+ * `temperature = 0.01`
128
+
129
+ ### Evaluation Procedure
130
+
131
+ 1. Download preprocessed evaluation JSONs from: \[[🤗 eval](https://huggingface.co/datasets/Video-R1/Video-R1-eval)]
132
+
133
+ 2. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files.
134
+
135
+ 3. Run the evaluation across all benchmarks:
136
+
137
+ ```bash
138
+ bash ./src/eval_bench.sh
139
+ ```
140
+
141
+ ## 🙏 Acknowledgements
142
+
143
+ We gratefully acknowledge the contributions of the open-source community, particularly [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1), [Open-R1](https://github.com/huggingface/open-r1), and [R1-V](https://github.com/Deep-Agent/R1-V).
144
+
145
+ ## 📚 Citations
146
+
147
+ If you find this work helpful, please consider citing:
148
+
149
+ ```
150
+ @article{VideoRFT,
151
+ title={VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning},
152
+ author={Wang, Qi and Yu, Yanrui and Yuan, Ye and Mao, Rui and Zhou, Tianfei},
153
+ journal={arXiv preprint arXiv:2505.12434},
154
+ year={2025}
155
+ }
156
+ ```