NOV 27, 2025/

models

Zuhao Yang*,Sudong Wang*,Kaichen Zhang*,Keming Wu,Sicong Leng,Yifan Zhang,Bo Li,Chengwei Qin,Shijian Lu,Xingxuan Li,Lidong Bing

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Paper | Code | Data | Model | Demo

TL;DR

We teach vision-language models to think like humans when watching long videos: instead of trying to memorize everything at once, the model learns to skim first, then zoom in on relevant moments. This is achieved through a native tool-calling mechanism where the model can invoke crop_video(start, end) to inspect specific clips on demand. The result? A 7B model that outperforms much larger models on long-video understanding tasks.

The Problem: Why Long Videos Are Hard

Imagine watching a 2-hour movie and being asked: "What was the character wearing when they first entered the coffee shop?" Even for humans, this requires a specific strategy: you don't re-watch the entire movie—you mentally locate the relevant scene, then focus on those few seconds.

Current vision-language models don't work this way. They typically:

Sample frames uniformly across the video (missing critical moments)
Process everything at once (overwhelming the context window)
Hallucinate when the answer isn't in the sampled frames

The core issue is that evidence in long videos is sparse and temporally localized. A 2-hour video at 1 FPS gives you 7,200 frames, but the answer to your question might depend on just 3-5 frames buried somewhere in the middle.

Our Insight: Let Models "Think with Tools"

The key insight behind LongVT is simple: instead of forcing models to process everything, let them actively seek evidence.

We introduce a native tool-calling paradigm where the model can:

Preview the video globally (sparse sampling)
Hypothesize which time window might contain the answer
Zoom in by calling crop_video(start_time, end_time)
Verify or refine based on what it actually sees

This mimics the human cognitive process of global-to-local reasoning. The model isn't just passively processing frames—it's actively investigating.

Overview

Our contributions are threefold:

(1) LongVT: An End-to-End Agentic Framework for "Thinking with Long Videos"
We introduce a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought (CoT) with on-demand clip inspection over hours-long videos, thereby enabling large multimodal models (LMMs) to perform more effective and reliable long-video reasoning.

(2) VideoSIAH: A Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
We construct a scalable data pipeline that produces diverse and high-quality question-answering (QA) data and tool-integrated reasoning traces, and a dedicated benchmark under a video segment-in-a-haystack setting.

(3) LongVT-7B-RFT: A State-of-the-Art Baseline with Invaluable Insights
Through extensive quantitative comparisons, systematic ablations on data recipes, training strategies, and design choices, as well as in-depth analyses of training dynamics, we establish and open-source a powerful baseline model with "thinking with long videos" capabilities.

LongVT Interleaved Multimodal Chain-of-Tool-Thought

Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based CoT reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop_video(start_time, end_time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.

Why "Native" Tool Calling Matters

You might ask: why not just use an external retrieval system? The key difference is agency. In LongVT:

The model decides when to call the tool (not every query needs it)
The model chooses the time window based on its reasoning
The model integrates the retrieved evidence into its thought process

This is fundamentally different from retrieval-augmented generation (RAG), where retrieval happens before reasoning. In LongVT, retrieval is part of reasoning.

Motivation of VideoSIAH

The Data Gap

Having a good method is only half the battle—you need the right data to train it. We found that existing long-video datasets have two major problems:

Coarse-grained annotations: Most datasets have clip-level QAs that don't require precise temporal reasoning. A model can often guess the answer without knowing when something happened.
Benchmark contamination: Many multiple-choice benchmarks can be "solved" without even watching the video! Our contamination study (see table below) shows that some models perform well above chance in the "No Visual" setting.

VideoSIAH: Segment-in-a-Haystack

To address this, we created VideoSIAH—a dataset specifically designed for the "needle in a haystack" scenario where the answer depends on a small segment within a long video.

The name captures the challenge: finding the right video Segment In A Haystack of hours-long content.

Key design principles:

Open-ended questions: No multiple choice to game
Human validation: Annotators verify that questions truly require temporal localization
Tool-integrated traces: Training data includes the reasoning process, not just answers

We conduct a rigorous contamination study on the Qwen-VL series across two probing settings: (1) No Visual, where we feed the text prompt without video frames to test for direct memorization; (2) Rearranged Choices, where we randomize the mapping between option labels and their textual content for multiple-choice questions to detect label memorization. Our experimental results reveal significant vulnerabilities in existing benchmarks and highlight the necessity of our proposed VideoSIAH-Eval.

Setting	VideoMME (w/o sub)	VideoMMMU adapt.	VideoMMMU comp.	VideoMMMU perc.	VideoSIAH-Eval
Qwen2.5-VL-7B-Instruct
Original	64.3	35.7	44.3	56.7	33.8
No Visual	40.1	25.7	38.3	39.3	12.7
Rearranged Choices	56.0	29.7	40.3	67.0	-
Qwen3-VL-8B-Instruct
Original	69.3	40.7	60.3	71.3	46.6
No Visual	44.1	33.7	39.3	46.7	0.00
Rearranged Choices	69.0	36.3	47.7	69.3	-

Contamination Tests for Qwen-VL Series on Long Video Understanding and Reasoning Benchmarks. The VideoSIAH-Eval column shows "-" entries for Rearranged Choices since our proposed benchmark is fully open-ended QA, where random option-answer mapping is not applicable.

Data Pipeline

Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start supervised fine-tuning (SFT) stage, whereas reinforcement learning (RL) operates solely on the filtered QA pairs.

Dataset Statistics

Split	Source	Purpose	Samples	Total
SFT (w/o tool)	LongVideo-Reason CoT	Reasoning-augmented Open-ended QA	5,238	228,835
	Video-R1 CoT	Reasoning-augmented Video QA	165,575
	Image-based CoT	Reasoning-augmented Image QA	58,022
SFT (w/ tool)	Gemini-distilled iMCoTT	Tool-augmented Open-ended QA	12,766	19,161
	Qwen-distilled iMCoTT	Tool-augmented Temporal Grounding	6,395
RL	Gemini-distilled QAs	Open-ended QA over Long Videos	1,667	17,020
RFT	Self-distilled iMCoTT	Agentic Behaviors	15,353

Dataset Statistics of VideoSIAH. Our proposed dataset contains a large-scale of non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled reinforcement fine-tuning (RFT) traces.

Category Distribution of VideoSIAH-Eval. We present the distribution of video types (left) and question types (right), highlighting the diversity of our proposed benchmark.

Quantitative Comparisons

We compare our LongVT models against proprietary LMMs and state-of-the-art open-source video reasoning models across various long video understanding and reasoning benchmarks.

Model	Reasoning	Tool	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	Prompt	Calling	w/ sub	adapt.	comp.	perc.
Proprietary LMMs
GPT-4o	✗	✗	77.2	66.0	62.0	55.7	30.8	17.4	51.5
Gemini 1.5 Pro	✗	✗	81.3	59.0	53.3	49.3	33.1	-	55.2
Open-Source (Sparse Sampling)
Qwen2.5-VL-7B	✗	✗	62.6	37.3	28.0	36.7	30.7	28.1	37.2
Video-R1-7B	✓	✗	61.0	36.3	40.7	52.3	37.2	27.9	42.6
VideoRFT-7B	✓	✗	60.9	36.7	42.0	53.0	34.7	26.5	42.3
Video-Thinker-7B	✓	✗	61.0	34.3	44.7	53.0	52.2	10.4	42.6
LongVT-7B-SFT (Ours)	✓	✓	12.5	37.7	46.0	58.3	36.0	26.8	36.2
LongVT-7B-RL (Ours)	✓	✓	66.1	32.7	44.7	50.0	37.8	31.0	43.7
Open-Source (Dense Sampling)
Qwen2.5-VL-7B	✗	✗	64.3	35.7	44.3	56.7	40.9	33.8	46.0
Video-R1-7B	✓	✗	60.5	37.3	38.7	46.3	40.1	33.1	42.7
VideoRFT-7B	✓	✗	49.2	37.7	40.7	48.7	18.7	26.9	37.0
Video-Thinker-7B	✓	✗	60.8	37.7	42.7	55.3	54.3	6.6	42.9
LongVT-7B-SFT (Ours)	✓	✓	64.9	32.3	42.0	49.7	41.1	34.8	44.1
LongVT-7B-RL (Ours)	✓	✓	66.1	37.7	42.3	56.3	41.4	35.9	46.6
LongVT-7B-RFT (Ours)	✓	✓	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best results among open-source models in each column are marked in bold and underlined, respectively.

Ablation Studies

We conduct comprehensive ablation studies to examine the impact of data recipes, training stages, and reward design on model performance.

Data Recipe

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT w/o self-curated iMCoTT	8.4	33.6	41.6	46.0	15.1	4.1	24.8
SFT w/ self-curated iMCoTT	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL w/o self-curated QAs	55.1	30.6	42.0	45.6	38.4	30.8	40.4
RL w/ self-curated QAs	66.1	37.7	42.3	56.3	41.4	35.9	46.6

Training Stage

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT only	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL only	52.7	35.3	43.0	55.1	37.1	28.2	41.9
SFT+RL	66.1	37.7	42.3	56.3	41.4	35.9	46.6
SFT+RL+RFT	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Training Dynamics

(a) shows training dynamics under different accuracy and time rewards, and (b) shows the effect of tool-call reward on tool usage.

Recall encourages coverage; IoU demands precision. Using Recall as the reward function during RL presents a drawback: the policy can enlarge the predicted span to envelop the ground-truth interval, which monotonically raises the Recall-based score while ignoring boundary quality. This plateau in the curve of Recall Accuracy Score validates our hypothesized reward hacking. In contrast, IoU explicitly penalizes span inflation via the union term, yielding better-aligned boundaries and more disciplined tool use.

Is tool reward really necessary? The Qwen2.5-VL-7B baseline collapses to near-zero tool calls after training in both configurations (w/ and w/o tool reward), indicating that the model does not internalize the tool's function. After performing cold-start SFT to obtain LongVT-7B-SFT, tool-call frequency rises during training under both configurations and accuracy improves in tandem. Hence, the tool reward is not required for basic competence: once SFT grounds the tool's semantics, the model learns when and how to invoke the tool.

Key Takeaways & Lessons Learned

Through this project, we gained several insights that might be valuable for future work:

1. Cold-Start SFT is Crucial for Tool Learning

One surprising finding was that you can't just RL your way to tool use. Without the cold-start SFT phase that demonstrates tool semantics, the model never learns to call tools—even with explicit tool rewards. This suggests that tool-using behaviors need to be shown before they can be optimized.

2. The Model Learns When to Use Tools

After SFT, we observed that the model doesn't blindly call tools for every query. It develops an intuition for which questions require zooming in (fine-grained visual details) versus which can be answered from the global preview (general scene understanding). This emergent selectivity wasn't explicitly trained—it arose naturally from the data.

3. IoU > Recall for Temporal Grounding Rewards

When we first tried Recall-based rewards, we saw the model "gaming" the metric by predicting increasingly wide time windows. Switching to IoU forced the model to be precise about boundaries. This is a concrete example of how reward design shapes agent behavior.

4. Existing Benchmarks Have Contamination Issues

Our contamination study revealed that some models perform surprisingly well on long-video benchmarks even without seeing the video. This motivated us to create VideoSIAH-Eval with open-ended questions that can't be gamed through memorization.

What's Next?

LongVT opens up several exciting directions:

Multi-round tool calling: Currently, the model typically calls the tool once. Enabling iterative refinement could help with even more complex queries.
Tool diversity: Beyond crop_video, we could add tools for object tracking, scene search, or even external knowledge retrieval.
Longer videos: While we tested on hour-long videos, the paradigm should scale to even longer content like full TV series or lecture archives.

We're excited to see how the community builds on this work!

Open-Source Resources

We open-source LongVT to facilitate future development of long-video reasoning with tool calling in the community