LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling

Overview

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce LongVT, an end-to-end agentic framework that enables ``Thinking with Long Videos'' via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.

This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.

Model Card

The model is the SFT version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.

Usage & Evaluation

For detailed instructions on inference and evaluation, please refer to our GitHub repository. We recommend using the scripts and environment provided there to reproduce our results.

Evaluation Results

Model	Reasoning Prompt	Tool Calling	VideoMME (≈1018s)	VideoMMMU (subtitle)	VideoMMMU (adaptation)	VideoMMMU (comprehension)	LVBench (≈4101s)	VideoSIAH-Eval (≈1688s)	Average Score
Proprietary LMMs
GPT-4o	✗	✗	77.2^†	66.0^†	62.0^†	55.7^†	30.8^†	17.4	51.5
Gemini 1.5 Pro	✗	✗	81.3^†	59.0^†	53.3^†	49.3^†	33.1^†	-	55.2
Open-Source (Sparse)
Qwen2.5-VL-7B	✗	✗	62.6	37.3	28.0	36.7	30.7	28.1	37.2
Video-R1-7B	✓	✗	61.0	36.3	40.7	52.3	37.2	27.9	42.6
VideoRFT-7B	✓	✗	60.9	36.7	42.0	53.0	34.7	26.5	42.3
Video-Thinker-7B	✓	✗	61.0	34.3	44.7	53.0	52.2	10.4	42.6
LongVT-7B-SFT (Ours)	✓	✓	12.5	37.7	46.0	58.3	36.0	26.8	36.2
LongVT-7B-RL (Ours)	✓	✓	66.1	32.7	44.7	50.0	37.8	31.0	43.7
Open-Source (Dense)
Qwen2.5-VL-7B	✗	✗	64.3	35.7	44.3	56.7	40.9	33.8	46.0
Video-R1-7B	✓	✗	60.5	37.3	38.7	46.3	40.1	33.1	42.7
VideoRFT-7B	✓	✗	49.2	37.7	40.7	48.7	18.7	26.9	37.0
Video-Thinker-7B	✓	✗	60.8	37.7	42.7	55.3	54.3	6.6	42.9
LongVT-7B-SFT (Ours)	✓	✓	64.9	32.3	42.0	49.7	41.1	34.8	44.1
LongVT-7B-RL (Ours)	✓	✓	66.1	37.7	42.3	56.3	41.4	35.9	46.6
LongVT-7B-RFT (Ours)	✓	✓	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best result among open-source models in each column is marked in bold and underlined, respectively. The numbers with "≈" denote the average video duration of each benchmark. ^† indicates results sourced from official reports. Reasoning Prompt indicates whether a standard reasoning-style prompt (✓) or a direct question-answering prompt (✗) is applied; Tool Calling denotes whether native tool calling is enabled (✓) or disabled (✗) in the prompt.

Citation

If you find LongVT useful for your research and applications, please cite using this BibTeX:

@misc{yang2025longvtincentivizingthinkinglong,
      title={LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling}, 
      author={Zuhao Yang and Sudong Wang and Kaichen Zhang and Keming Wu and Sicong Leng and Yifan Zhang and Chengwei Qin and Shijian Lu and Xingxuan Li and Lidong Bing},
      year={2025},
      eprint={2511.20785},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.20785}, 
}

Check out this paper: https://arxiv.org/abs/2511.20785

Acknowledgements

We gratefully acknowledge the following open-source projects that made this work possible:

lmms-eval for providing the comprehensive evaluation framework for large multimodal models.
lmms-engine for the SFT training infrastructure and tools.
verl for the reinforcement learning training framework.

We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.

Downloads last month: 41

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for longvideotool/LongVT-SFT

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(894)

this model

longvideotool
/

LongVT-SFT