skip to content

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

9 min read

LongVT introduces a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought with on-demand clip inspection over hours-long videos, enabling large multimodal models to perform more effective and reliable long-video reasoning.

Overview

Our contributions are threefold:

(1) LongVT: An End-to-End Agentic Framework for “Thinking with Long Videos”
We introduce a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought (CoT) with on-demand clip inspection over hours-long videos, thereby enabling large multimodal models (LMMs) to perform more effective and reliable long-video reasoning.

(2) VideoSIAH: A Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
We construct a scalable data pipeline that produces diverse and high-quality question-answering (QA) data and tool-integrated reasoning traces, and a dedicated benchmark under a video segment-in-a-haystack setting.

(3) LongVT-7B-RFT: A State-of-the-Art Baseline with Invaluable Insights
Through extensive quantitative comparisons, systematic ablations on data recipes, training strategies, and design choices, as well as in-depth analyses of training dynamics, we establish and open-source a powerful baseline model with “thinking with long videos” capabilities.

LongVT Interleaved Multimodal Chain-of-Tool-Thought

Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based CoT reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop_video(start_time, end_time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.


Motivation of VideoSIAH

Long-video reasoning presents a fundamentally different challenge from previous video QA settings: LMMs must locate sparse, fine-grained, and causally decisive moments embedded within hours-long content. However, existing LMMs are mostly trained with coarse-grained and clip-level data. This mismatch leaves modern LMMs lacking the supervision needed to learn how temporal hypotheses are formed, verified, or revised—a critical yet underexplored capability for agentic long-video reasoning.

Moreover, most existing video understanding benchmarks only offer multiple-choice QAs, which can be solved without genuine temporal grounding and are vulnerable to dataset leakage or shortcut exploitation. To fill this gap, we introduce VideoSIAH, a large-scale, diverse, and high-quality data suite that serves collectively as a training dataset capturing the reasoning dynamics required for video segment-in-a-haystack QA, and a fine-grained evaluation benchmark, VideoSIAH-Eval, with human-in-the-loop validation for long-video open-ended question-answering.

We conduct a rigorous contamination study on the Qwen-VL series across two probing settings: (1) No Visual, where we feed the text prompt without video frames to test for direct memorization; (2) Rearranged Choices, where we randomize the mapping between option labels and their textual content for multiple-choice questions to detect label memorization. Our experimental results reveal significant vulnerabilities in existing benchmarks and highlight the necessity of our proposed VideoSIAH-Eval.

SettingVideoMME (w/o sub)VideoMMMU adapt.VideoMMMU comp.VideoMMMU perc.VideoSIAH-Eval
Qwen2.5-VL-7B-Instruct
Original64.335.744.356.733.8
No Visual40.125.738.339.312.7
Rearranged Choices56.029.740.367.0-
Qwen3-VL-8B-Instruct
Original69.340.760.371.346.6
No Visual44.133.739.346.70.00
Rearranged Choices69.036.347.769.3-

Contamination Tests for Qwen-VL Series on Long Video Understanding and Reasoning Benchmarks. The VideoSIAH-Eval column shows ”-” entries for Rearranged Choices since our proposed benchmark is fully open-ended QA, where random option-answer mapping is not applicable.


Data Pipeline

VideoSIAH Data Pipeline

Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start supervised fine-tuning (SFT) stage, whereas reinforcement learning (RL) operates solely on the filtered QA pairs.


Dataset Statistics

SplitSourcePurposeSamplesTotal
SFT (w/o tool)LongVideo-Reason CoTReasoning-augmented Open-ended QA5,238228,835
Video-R1 CoTReasoning-augmented Video QA165,575
Image-based CoTReasoning-augmented Image QA58,022
SFT (w/ tool)Gemini-distilled iMCoTTTool-augmented Open-ended QA12,76619,161
Qwen-distilled iMCoTTTool-augmented Temporal Grounding6,395
RLGemini-distilled QAsOpen-ended QA over Long Videos1,66717,020
RFTSelf-distilled iMCoTTAgentic Behaviors15,353

Dataset Statistics of VideoSIAH. Our proposed dataset contains a large-scale of non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled reinforcement fine-tuning (RFT) traces.

Video Category Distribution
Question Category Distribution

Category Distribution of VideoSIAH-Eval. We present the distribution of video types (left) and question types (right), highlighting the diversity of our proposed benchmark.


Quantitative Comparisons

We compare our LongVT models against proprietary LMMs and state-of-the-art open-source video reasoning models across various long video understanding and reasoning benchmarks.

ModelReasoningToolVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
PromptCallingw/ subadapt.comp.perc.
Proprietary LMMs
GPT-4o77.266.062.055.730.817.451.5
Gemini 1.5 Pro81.359.053.349.333.1-55.2
Open-Source (Sparse Sampling)
Qwen2.5-VL-7B62.637.328.036.730.728.137.2
Video-R1-7B61.036.340.752.337.227.942.6
VideoRFT-7B60.936.742.053.034.726.542.3
Video-Thinker-7B61.034.344.753.052.210.442.6
LongVT-7B-SFT (Ours)12.537.746.058.336.026.836.2
LongVT-7B-RL (Ours)66.132.744.750.037.831.043.7
Open-Source (Dense Sampling)
Qwen2.5-VL-7B64.335.744.356.740.933.846.0
Video-R1-7B60.537.338.746.340.133.142.7
VideoRFT-7B49.237.740.748.718.726.937.0
Video-Thinker-7B60.837.742.755.354.36.642.9
LongVT-7B-SFT (Ours)64.932.342.049.741.134.844.1
LongVT-7B-RL (Ours)66.137.742.356.341.435.946.6
LongVT-7B-RFT (Ours)67.035.743.756.741.342.047.7

Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best results among open-source models in each column are marked in bold and underlined, respectively.


Ablation Studies

We conduct comprehensive ablation studies to examine the impact of data recipes, training stages, and reward design on model performance.

Data Recipe

SettingVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
w/ subadapt.comp.perc.
SFT w/o self-curated iMCoTT8.433.641.646.015.14.124.8
SFT w/ self-curated iMCoTT64.932.342.049.741.134.844.1
RL w/o self-curated QAs55.130.642.045.638.430.840.4
RL w/ self-curated QAs66.137.742.356.341.435.946.6

Training Stage

SettingVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
w/ subadapt.comp.perc.
SFT only64.932.342.049.741.134.844.1
RL only52.735.343.055.137.128.241.9
SFT+RL66.137.742.356.341.435.946.6
SFT+RL+RFT67.035.743.756.741.342.047.7

Training Dynamics

Training Dynamics and Ablations on Reward Design

(a) shows training dynamics under different accuracy and time rewards, and (b) shows the effect of tool-call reward on tool usage.

Recall encourages coverage; IoU demands precision. Using Recall as the reward function during RL presents a drawback: the policy can enlarge the predicted span to envelop the ground-truth interval, which monotonically raises the Recall-based score while ignoring boundary quality. This plateau in the curve of Recall Accuracy Score validates our hypothesized reward hacking. In contrast, IoU explicitly penalizes span inflation via the union term, yielding better-aligned boundaries and more disciplined tool use.

Is tool reward really necessary? The Qwen2.5-VL-7B baseline collapses to near-zero tool calls after training in both configurations (w/ and w/o tool reward), indicating that the model does not internalize the tool’s function. After performing cold-start SFT to obtain LongVT-7B-SFT, tool-call frequency rises during training under both configurations and accuracy improves in tandem. Hence, the tool reward is not required for basic competence: once SFT grounds the tool’s semantics, the model learns when and how to invoke the tool.


Open-Source Resources
We open-source LongVT to facilitate future development of long-video reasoning with tool calling in the community
Model Checkpoints
Pre-trained models with SFT, RL, and RFT optimization
Training Datasets
VideoSIAH data suite for long-video reasoning
2025
Vision
Multimodal
Research
Video
Reasoning
Tool-calling

Authors

Zuhao Yang*, Sudong Wang*, Kaichen Zhang*, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

* Main Authors