LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling
Overview
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce LongVT, an end-to-end agentic framework that enables ``Thinking with Long Videos'' via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.
This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.
Model Card
The model is the SFT version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.
Usage & Evaluation
For detailed instructions on inference and evaluation, please refer to our GitHub repository. We recommend using the scripts and environment provided there to reproduce our results.
Evaluation Results
| Model | Reasoning Prompt | Tool Calling | VideoMME (≈1018s) |
VideoMMMU (subtitle) |
VideoMMMU (adaptation) |
VideoMMMU (comprehension) |
LVBench (≈4101s) |
VideoSIAH-Eval (≈1688s) |
Average Score |
|---|---|---|---|---|---|---|---|---|---|
| Proprietary LMMs | |||||||||
| GPT-4o | ✗ | ✗ | 77.2† | 66.0† | 62.0† | 55.7† | 30.8† | 17.4 | 51.5 |
| Gemini 1.5 Pro | ✗ | ✗ | 81.3† | 59.0† | 53.3† | 49.3† | 33.1† | - | 55.2 |
| Open-Source (Sparse) | |||||||||
| Qwen2.5-VL-7B | ✗ | ✗ | 62.6 | 37.3 | 28.0 | 36.7 | 30.7 | 28.1 | 37.2 |
| Video-R1-7B | ✓ | ✗ | 61.0 | 36.3 | 40.7 | 52.3 | 37.2 | 27.9 | 42.6 |
| VideoRFT-7B | ✓ | ✗ | 60.9 | 36.7 | 42.0 | 53.0 | 34.7 | 26.5 | 42.3 |
| Video-Thinker-7B | ✓ | ✗ | 61.0 | 34.3 | 44.7 | 53.0 | 52.2 | 10.4 | 42.6 |
| LongVT-7B-SFT (Ours) | ✓ | ✓ | 12.5 | 37.7 | 46.0 | 58.3 | 36.0 | 26.8 | 36.2 |
| LongVT-7B-RL (Ours) | ✓ | ✓ | 66.1 | 32.7 | 44.7 | 50.0 | 37.8 | 31.0 | 43.7 |
| Open-Source (Dense) | |||||||||
| Qwen2.5-VL-7B | ✗ | ✗ | 64.3 | 35.7 | 44.3 | 56.7 | 40.9 | 33.8 | 46.0 |
| Video-R1-7B | ✓ | ✗ | 60.5 | 37.3 | 38.7 | 46.3 | 40.1 | 33.1 | 42.7 |
| VideoRFT-7B | ✓ | ✗ | 49.2 | 37.7 | 40.7 | 48.7 | 18.7 | 26.9 | 37.0 |
| Video-Thinker-7B | ✓ | ✗ | 60.8 | 37.7 | 42.7 | 55.3 | 54.3 | 6.6 | 42.9 |
| LongVT-7B-SFT (Ours) | ✓ | ✓ | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
| LongVT-7B-RL (Ours) | ✓ | ✓ | 66.1 | 37.7 | 42.3 | 56.3 | 41.4 | 35.9 | 46.6 |
| LongVT-7B-RFT (Ours) | ✓ | ✓ | 67.0 | 35.7 | 43.7 | 56.7 | 41.3 | 42.0 | 47.7 |
Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best result among open-source models in each column is marked in bold and underlined, respectively. The numbers with "≈" denote the average video duration of each benchmark. † indicates results sourced from official reports. Reasoning Prompt indicates whether a standard reasoning-style prompt (✓) or a direct question-answering prompt (✗) is applied; Tool Calling denotes whether native tool calling is enabled (✓) or disabled (✗) in the prompt.
Citation
If you find LongVT useful for your research and applications, please cite using this BibTeX:
@misc{yang2025longvtincentivizingthinkinglong,
title={LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling},
author={Zuhao Yang and Sudong Wang and Kaichen Zhang and Keming Wu and Sicong Leng and Yifan Zhang and Chengwei Qin and Shijian Lu and Xingxuan Li and Lidong Bing},
year={2025},
eprint={2511.20785},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.20785},
}
Check out this paper: https://arxiv.org/abs/2511.20785
Acknowledgements
We gratefully acknowledge the following open-source projects that made this work possible:
- lmms-eval for providing the comprehensive evaluation framework for large multimodal models.
- lmms-engine for the SFT training infrastructure and tools.
- verl for the reinforcement learning training framework.
We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.
- Downloads last month
- 41
Model tree for longvideotool/LongVT-SFT
Base model
Qwen/Qwen2.5-VL-7B-Instruct