Tags → #video

27 November 2025

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Overview

Our contributions are threefold:

(1) LongVT: An End-to-End Agentic Framework for “Thinking with Long Videos”
We introduce a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought (CoT) with on-demand clip inspection over hours-long videos, thereby enabling large multimodal models (LMMs) to perform more effective and reliable long-video reasoning.

(2) VideoSIAH: A Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
We construct a scalable data pipeline that produces diverse and high-quality question-answering (QA) data and tool-integrated reasoning traces, and a dedicated benchmark under a video segment-in-a-haystack setting.

(3) LongVT-7B-RFT: A State-of-the-Art Baseline with Invaluable Insights
Through extensive quantitative comparisons, systematic ablations on data recipes, training strategies, and design choices, as well as in-depth analyses of training dynamics, we establish and open-source a powerful baseline model with “thinking with long videos” capabilities.

LongVT Interleaved Multimodal Chain-of-Tool-Thought

Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based CoT reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop_video(start_time, end_time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.

Motivation of VideoSIAH

Long-video reasoning presents a fundamentally different challenge from previous video QA settings: LMMs must locate sparse, fine-grained, and causally decisive moments embedded within hours-long content. However, existing LMMs are mostly trained with coarse-grained and clip-level data. This mismatch leaves modern LMMs lacking the supervision needed to learn how temporal hypotheses are formed, verified, or revised—a critical yet underexplored capability for agentic long-video reasoning.

Moreover, most existing video understanding benchmarks only offer multiple-choice QAs, which can be solved without genuine temporal grounding and are vulnerable to dataset leakage or shortcut exploitation. To fill this gap, we introduce VideoSIAH, a large-scale, diverse, and high-quality data suite that serves collectively as a training dataset capturing the reasoning dynamics required for video segment-in-a-haystack QA, and a fine-grained evaluation benchmark, VideoSIAH-Eval, with human-in-the-loop validation for long-video open-ended question-answering.

We conduct a rigorous contamination study on the Qwen-VL series across two probing settings: (1) No Visual, where we feed the text prompt without video frames to test for direct memorization; (2) Rearranged Choices, where we randomize the mapping between option labels and their textual content for multiple-choice questions to detect label memorization. Our experimental results reveal significant vulnerabilities in existing benchmarks and highlight the necessity of our proposed VideoSIAH-Eval.

Setting	VideoMME (w/o sub)	VideoMMMU adapt.	VideoMMMU comp.	VideoMMMU perc.	VideoSIAH-Eval
Qwen2.5-VL-7B-Instruct
Original	64.3	35.7	44.3	56.7	33.8
No Visual	40.1	25.7	38.3	39.3	12.7
Rearranged Choices	56.0	29.7	40.3	67.0	-
Qwen3-VL-8B-Instruct
Original	69.3	40.7	60.3	71.3	46.6
No Visual	44.1	33.7	39.3	46.7	0.00
Rearranged Choices	69.0	36.3	47.7	69.3	-

Contamination Tests for Qwen-VL Series on Long Video Understanding and Reasoning Benchmarks. The VideoSIAH-Eval column shows ”-” entries for Rearranged Choices since our proposed benchmark is fully open-ended QA, where random option-answer mapping is not applicable.

Data Pipeline

Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start supervised fine-tuning (SFT) stage, whereas reinforcement learning (RL) operates solely on the filtered QA pairs.

Dataset Statistics

Split	Source	Purpose	Samples	Total
SFT (w/o tool)	LongVideo-Reason CoT	Reasoning-augmented Open-ended QA	5,238	228,835
	Video-R1 CoT	Reasoning-augmented Video QA	165,575
	Image-based CoT	Reasoning-augmented Image QA	58,022
SFT (w/ tool)	Gemini-distilled iMCoTT	Tool-augmented Open-ended QA	12,766	19,161
	Qwen-distilled iMCoTT	Tool-augmented Temporal Grounding	6,395
RL	Gemini-distilled QAs	Open-ended QA over Long Videos	1,667	17,020
RFT	Self-distilled iMCoTT	Agentic Behaviors	15,353

Dataset Statistics of VideoSIAH. Our proposed dataset contains a large-scale of non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled reinforcement fine-tuning (RFT) traces.

Category Distribution of VideoSIAH-Eval. We present the distribution of video types (left) and question types (right), highlighting the diversity of our proposed benchmark.

Quantitative Comparisons

We compare our LongVT models against proprietary LMMs and state-of-the-art open-source video reasoning models across various long video understanding and reasoning benchmarks.

Model	Reasoning	Tool	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	Prompt	Calling	w/ sub	adapt.	comp.	perc.
Proprietary LMMs
GPT-4o	✗	✗	77.2	66.0	62.0	55.7	30.8	17.4	51.5
Gemini 1.5 Pro	✗	✗	81.3	59.0	53.3	49.3	33.1	-	55.2
Open-Source (Sparse Sampling)
Qwen2.5-VL-7B	✗	✗	62.6	37.3	28.0	36.7	30.7	28.1	37.2
Video-R1-7B	✓	✗	61.0	36.3	40.7	52.3	37.2	27.9	42.6
VideoRFT-7B	✓	✗	60.9	36.7	42.0	53.0	34.7	26.5	42.3
Video-Thinker-7B	✓	✗	61.0	34.3	44.7	53.0	52.2	10.4	42.6
LongVT-7B-SFT (Ours)	✓	✓	12.5	37.7	46.0	58.3	36.0	26.8	36.2
LongVT-7B-RL (Ours)	✓	✓	66.1	32.7	44.7	50.0	37.8	31.0	43.7
Open-Source (Dense Sampling)
Qwen2.5-VL-7B	✗	✗	64.3	35.7	44.3	56.7	40.9	33.8	46.0
Video-R1-7B	✓	✗	60.5	37.3	38.7	46.3	40.1	33.1	42.7
VideoRFT-7B	✓	✗	49.2	37.7	40.7	48.7	18.7	26.9	37.0
Video-Thinker-7B	✓	✗	60.8	37.7	42.7	55.3	54.3	6.6	42.9
LongVT-7B-SFT (Ours)	✓	✓	64.9	32.3	42.0	49.7	41.1	34.8	44.1
LongVT-7B-RL (Ours)	✓	✓	66.1	37.7	42.3	56.3	41.4	35.9	46.6
LongVT-7B-RFT (Ours)	✓	✓	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best results among open-source models in each column are marked in bold and underlined, respectively.

Ablation Studies

We conduct comprehensive ablation studies to examine the impact of data recipes, training stages, and reward design on model performance.

Data Recipe

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT w/o self-curated iMCoTT	8.4	33.6	41.6	46.0	15.1	4.1	24.8
SFT w/ self-curated iMCoTT	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL w/o self-curated QAs	55.1	30.6	42.0	45.6	38.4	30.8	40.4
RL w/ self-curated QAs	66.1	37.7	42.3	56.3	41.4	35.9	46.6

Training Stage

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT only	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL only	52.7	35.3	43.0	55.1	37.1	28.2	41.9
SFT+RL	66.1	37.7	42.3	56.3	41.4	35.9	46.6
SFT+RL+RFT	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Training Dynamics

(a) shows training dynamics under different accuracy and time rewards, and (b) shows the effect of tool-call reward on tool usage.

Recall encourages coverage; IoU demands precision. Using Recall as the reward function during RL presents a drawback: the policy can enlarge the predicted span to envelop the ground-truth interval, which monotonically raises the Recall-based score while ignoring boundary quality. This plateau in the curve of Recall Accuracy Score validates our hypothesized reward hacking. In contrast, IoU explicitly penalizes span inflation via the union term, yielding better-aligned boundaries and more disciplined tool use.

Is tool reward really necessary? The Qwen2.5-VL-7B baseline collapses to near-zero tool calls after training in both configurations (w/ and w/o tool reward), indicating that the model does not internalize the tool’s function. After performing cold-start SFT to obtain LongVT-7B-SFT, tool-call frequency rises during training under both configurations and accuracy improves in tandem. Hence, the tool reward is not required for basic competence: once SFT grounds the tool’s semantics, the model learns when and how to invoke the tool.

Open-Source Resources

We open-source LongVT to facilitate future development of long-video reasoning with tool calling in the community

GitHub

Code Repository

Complete training and evaluation code for LongVT

Paper

Technical Report

Read our paper on arXiv

Model Checkpoints

Pre-trained models with SFT, RL, and RFT optimization

LongVT-7B-RFT

Training Datasets

VideoSIAH data suite for long-video reasoning

13 January 2025

Video-MMMU: Evaluating Knowledge Acquisition from Educational Videos

Video-MMMU Overview — Video-MMMU: A comprehensive benchmark for evaluating knowledge acquisition from educational videos across multiple disciplines

Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

🎯 Motivation

Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

High information density (heavy OCR/ASR signals)
Advanced knowledge requirements (college-level knowledge)
Temporal structure (concepts unfolding over time)

These properties make reasoning from lecture video notably harder. This leads to our core question:

When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

🏆 Video-MMMU Leaderboard

Model	Overall \	Δknowledge	Perception	Comprehension	Adaptation
GPT-5-thinking	84.6 \	—	—	—	—
Gemini-2.5-Pro	83.6 \	—	—	—	—
OpenAI O3	83.3 \	—	—	—	—
Claude-3.5-Sonnet	65.78 \	🟢 +11.4	72.00	69.67	55.67
Kimi-VL-A3B-Thinking-2506	65.22 \	🟢 +3.5	75.00	66.33	54.33
GPT-4o	61.22 \	🟢 +15.6	66.00	62.00	55.67
Qwen-2.5-VL-72B	60.22 \	🟢 +9.7	69.33	61.00	50.33

See full leaderboard with 20+ models in our paper and website

📚 Overview

We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

1) Video: Knowledge Source

Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content.

Dataset Composition:

300 college-level, lecture-style videos
30 subjects across 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering
High-quality educational content from university-level courses

2) QA Design: Three Stages of Knowledge Acquisition

Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

🔍 Perception – Identifying relevant surface information
🧠 Comprehension – Understanding underlying concepts or strategies
🎯 Adaptation – Applying learned knowledge to new scenarios

Knowledge Acquisition Categories — Figure 2: Examples for each knowledge acquisition category across different disciplines. Perception (ASR/OCR-based), Comprehension (concept/strategy understanding), and Adaptation (application to new scenarios).

Benchmark Structure — Figure 3: Video-MMMU benchmark structure showing the progression from video content to three-tier evaluation framework.

3) In-Context Knowledge Acquisition: Learning Like Humans

Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment.

In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

A core innovation in Video-MMMU is its shift from measuring only final performance to measuring learning.

Δknowledge Formula

Δknowledge = (Acc_after_video - Acc_before_video) / (100% - Acc_before_video) × 100%

Evaluation Process

1. Initial Test: The model attempts to answer a question without seeing the video.

2. Re-Test after video viewing: We provide the corresponding lecture video. The model is asked the same question again.

3. Performance Gain: If the model succeeds after watching, it demonstrates successful knowledge acquisition from video.

This setup mirrors a human’s natural educational process:

Don't know → Learn by watching → Apply the knowledge

🔍 Key Insights

Performance Analysis — Figure 4: Comprehensive analysis showing progressive performance decline and the human-model gap in knowledge acquisition from videos.

Progressive Performance Decline

Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.

Knowledge Acquisition Challenge

The Δknowledge metric reveals a significant human–model gap:

Humans: Substantial improvement (Δknowledge ≈ 33.1%)
Top Models: Smaller gains (GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%)

This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

📊 Case Studies

Failure Case: Method Adaptation Error

Failure Case Study — Figure 5: Example of method adaptation failure - the model failed to adapt the method from video to solve the Adaptation question.

Success Case: Learning from Video

Success Case Study — Figure 6: Example of successful learning from video - transforming an initial wrong answer into a correct one after watching the educational content.

🚀 Research Impact

Paradigm Shift

Video-MMMU represents a paradigm shift from traditional video understanding to knowledge acquisition evaluation:

From Scene Understanding to Learning - Moving beyond visual comprehension to knowledge acquisition
From Static Evaluation to Dynamic Learning - Measuring improvement rather than just final performance
From Task Solving to Learning Capability - Evaluating the ability to learn new skills

Implications for AI Development

Real-World Deployment - Models must learn continuously after deployment
Educational AI - Critical for AI tutoring and educational applications
Knowledge Transfer - Understanding how models generalize learned concepts
Human-AI Alignment - Bridging the gap in learning capabilities

📈 Future Directions

Benchmark Extensions

Multimodal Knowledge Sources - Incorporating diverse educational formats
Long-term Learning - Evaluating knowledge retention over time
Interactive Learning - Adding feedback loops and iterative improvement

Model Development

Learning-Optimized Architectures - Designing models specifically for knowledge acquisition
Memory Integration - Better mechanisms for knowledge storage and retrieval
Transfer Learning - Improving cross-domain knowledge application

Video-MMMU Resources

Complete benchmark resources including dataset, evaluation tools, research paper, and project information

Link

Project Website

Comprehensive benchmark information

Paper

Research Paper

Detailed methodology and analysis

Dataset

Video-MMMU Dataset

Complete benchmark dataset

GitHub

GitHub Repository

Evaluation code and tools

🎯 Getting Started

Download the Video-MMMU dataset from Hugging Face
Set up the evaluation environment using our GitHub repository
Run baseline evaluations on your models
Analyze Δknowledge metrics to understand learning capabilities
Compare results with our comprehensive leaderboard

Video-MMMU challenges the current state of multimodal AI by shifting focus from static performance to dynamic learning capability - a critical step toward truly intelligent and adaptive AI systems.

30 September 2024
LLaVA-Video

LLaVA-Video Resources
Complete implementation, interactive demos, models, evaluation results, and documentation for LLaVA-Video
GitHub
GitHub Repository
Complete implementation and source code
Model
Model Checkpoints
LLaVA-Video-7B models
Link
Dataset Link
LLaVA-Video-178K Dataset
Link
Documentation
Complete usage cookbook and guides

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Video Instruction-Following Data Synthesis

A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

Video Sources

Video sources in the proposed LLaVA-Video-178K. The relationship between 10 video sources we have utilized and other existing video-language datasets.

We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.

Automated Generation for Video Detail Description

The video detail description creation pipeline. A three-level creation pipeline is considered, with each level developed via a recurrent approach. Note that t is the index of time internal at its own level, and T is the last time internal index

For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.

Automated Generation for Video Question Answering

In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model’s ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.

Dataset Statistics

We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.

Dataset Comparison

We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.

A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model.

High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage.

Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.
05 August 2024
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision: A unified model for single-image, multi-image, and video understanding

Overview

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

Key Features

Unified Architecture

LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

Model Sizes
- 0.5B parameters - Lightweight deployment
- 7B parameters - Balanced performance
- 72B parameters - State-of-the-art capabilities
Emerging Capabilities

The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

1. Cross-Scenario Understanding

Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

2. Advanced Visual Analysis
- Diagram and table interpretation - Understanding complex visual structures
- Multi-screenshot interaction - Analyzing relationships across multiple screens
- Set-of-mark object referencing - Precise object identification and tracking
3. Video Capabilities
- Image-to-video generation understanding - Comprehending temporal transitions
- Video analysis and comparison - Deep understanding of video content
- Multi-camera video interpretation - Processing footage from multiple viewpoints
- Detailed video subject description - Rich, contextual video narration
Strong Transfer Learning

Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the model’s ability to generalize learned representations across visual domains.

Open-Source Resources
Complete LLaVA-OneVision resources to facilitate future development of LMMs in the community
GitHub
Training Code
Cook a SOTA model with our released training code and reproduction scripts
Model
Model Checkpoints
Access pre-trained model checkpoints in all three sizes (0.5B, 7B, 72B)
Dataset
Training Datasets
Explore comprehensive training datasets for Single-Image and OneVision stages
Demo
Live Demo
Try LLaVA-OneVision directly in your browser

Development Roadmap

LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:
- Enhanced reasoning capabilities
- Improved OCR performance
- Expanded world knowledge
- Advanced multimodal understanding
24 June 2024
LongVA: Long Context Transfer from Language to Vision
LongVA's performance on Visual Needle-In-A-Haystack benchmark showing accurate retrieval across long video sequences

Overview

Gemini has amazed the world with its capability to understand hour-long videos. However, we still lack an open-source alternative with similar capabilities. Our latest research presents an innovative solution towards long video LMM, shifting the focus from reducing visual tokens per frame to leveraging the long context capabilities of language models.

Here, we present our state-of-the-art video model, Long Video Assistant (LongVA), and our novel benchmark, Visual Needle-In-A-Haystack (V-NIAH).

Key Innovations

🔄 Long Context Transfer

We discovered and verified that the long context capability of language models can be directly transferred to the video domain in modality-aligned multi-modal models. On V-NIAH, LongVA is the only open-source model capable of accurately retrieving visual information from inputs with:
- 2000+ frames
- 200K+ visual tokens
🎯 UniRes: Unified Visual Encoding

We proposed UniRes, a unified visual encoding scheme that encodes both images and videos. In UniRes, a video is encoded the same as multiple image crops in a sequence.

Key Benefits:
- Leverages the Long Context Transfer property
- Enables superior zero-shot performance in video tasks
- No video-specific training data required
Performance Highlights

🏆 State-of-the-Art Results

LongVA achieves state-of-the-art performance on the comprehensive Video-MME benchmarks among 7B models.

Key Performance Features:
- Performance increases with denser sampling of video frames
- Superior zero-shot capabilities on video understanding tasks
- Comprehensive ablation studies validating improvement sources
📊 V-NIAH Benchmark

Our novel Visual Needle-In-A-Haystack (V-NIAH) benchmark provides:
- Rigorous evaluation of long-context visual understanding
- Testing retrieval accuracy across extended video sequences
- Open-source evaluation framework for the community
Technical Architecture

Multi-Modal Alignment

LongVA demonstrates that language models’ inherent long-context capabilities can be effectively transferred to visual domains through proper modality alignment.

Scalable Design

The architecture scales efficiently with:
- Increased frame sampling rates
- Extended sequence lengths
- Larger visual token counts
Research Impact

Open-Source Alternative

LongVA provides the first viable open-source alternative to proprietary long-video understanding systems, enabling:
- Academic research advancement
- Commercial application development
- Community-driven improvements
Methodology Innovation

The long context transfer approach opens new research directions in:
- Cross-modal capability transfer
- Efficient video processing
- Unified multi-modal architectures
Future Directions

LongVA establishes a foundation for:
1. Extended Context Models - Pushing beyond current frame limits
2. Multi-Modal Transfer Learning - Applying insights to other modalities
3. Efficient Video Processing - Optimizing computational requirements
4. Benchmark Development - Creating more comprehensive evaluation metrics
LongVA Resources
Complete resources for LongVA including source code, evaluation benchmark, and pre-trained models
GitHub
GitHub Repository
Source code and implementation
Link
V-NIAH Benchmark
Evaluation framework
Model
Model Checkpoints
Pre-trained models for research and development

Tags → #video

Overview

Motivation of VideoSIAH

Data Pipeline

Dataset Statistics

Quantitative Comparisons

Ablation Studies

Data Recipe

Training Stage

Training Dynamics

🎯 Motivation

🏆 Video-MMMU Leaderboard

📚 Overview

1) Video: Knowledge Source

2) QA Design: Three Stages of Knowledge Acquisition

3) In-Context Knowledge Acquisition: Learning Like Humans

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

Δknowledge Formula

Evaluation Process

🔍 Key Insights

Progressive Performance Decline

Knowledge Acquisition Challenge

📊 Case Studies

Failure Case: Method Adaptation Error

Success Case: Learning from Video

🚀 Research Impact

Paradigm Shift

Implications for AI Development

📈 Future Directions

Benchmark Extensions

Model Development

🎯 Getting Started

Video Instruction-Following Data Synthesis

Video Sources

Automated Generation for Video Detail Description

Automated Generation for Video Question Answering

Dataset Statistics

Dataset Comparison

Overview

Key Features

Unified Architecture

Model Sizes

Emerging Capabilities

1. Cross-Scenario Understanding

2. Advanced Visual Analysis

3. Video Capabilities

Strong Transfer Learning

Development Roadmap

Overview

Key Innovations

🔄 Long Context Transfer

🎯 UniRes: Unified Visual Encoding

Performance Highlights

🏆 State-of-the-Art Results

📊 V-NIAH Benchmark

Technical Architecture

Multi-Modal Alignment

Scalable Design

Research Impact

Open-Source Alternative

Methodology Innovation

Future Directions