skip to content

Tags #research

  • Video-MMMU Overview
    Video-MMMU: A comprehensive benchmark for evaluating knowledge acquisition from educational videos across multiple disciplines
    WebsitePaperDataset

    Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

    🎯 Motivation

    Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

    Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

    1. High information density (heavy OCR/ASR signals)
    2. Advanced knowledge requirements (college-level knowledge)
    3. Temporal structure (concepts unfolding over time)

    These properties make reasoning from lecture video notably harder. This leads to our core question:

    When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

    Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

    🏆 Video-MMMU Leaderboard

    ModelOverall \ΔknowledgePerceptionComprehensionAdaptation
    GPT-5-thinking84.6 \
    Gemini-2.5-Pro83.6 \
    OpenAI O383.3 \
    Claude-3.5-Sonnet65.78 \🟢 +11.472.0069.6755.67
    Kimi-VL-A3B-Thinking-250665.22 \🟢 +3.575.0066.3354.33
    GPT-4o61.22 \🟢 +15.666.0062.0055.67
    Qwen-2.5-VL-72B60.22 \🟢 +9.769.3361.0050.33

    See full leaderboard with 20+ models in our paper and website

    📚 Overview

    We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

    1) Video: Knowledge Source

    Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content.

    Dataset Composition:

    • 300 college-level, lecture-style videos
    • 30 subjects across 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering
    • High-quality educational content from university-level courses

    2) QA Design: Three Stages of Knowledge Acquisition

    Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

    • 🔍 Perception – Identifying relevant surface information
    • 🧠 Comprehension – Understanding underlying concepts or strategies
    • 🎯 Adaptation – Applying learned knowledge to new scenarios
    Knowledge Acquisition Categories
    Figure 2: Examples for each knowledge acquisition category across different disciplines. Perception (ASR/OCR-based), Comprehension (concept/strategy understanding), and Adaptation (application to new scenarios).
    Benchmark Structure
    Figure 3: Video-MMMU benchmark structure showing the progression from video content to three-tier evaluation framework.

    3) In-Context Knowledge Acquisition: Learning Like Humans

    Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment.

    In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

    4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

    A core innovation in Video-MMMU is its shift from measuring only final performance to measuring learning.

    Δknowledge Formula

    Δknowledge = (Acc_after_video - Acc_before_video) / (100% - Acc_before_video) × 100%
    

    Evaluation Process

    1. Initial Test: The model attempts to answer a question without seeing the video.

    2. Re-Test after video viewing: We provide the corresponding lecture video. The model is asked the same question again.

    3. Performance Gain: If the model succeeds after watching, it demonstrates successful knowledge acquisition from video.

    This setup mirrors a human’s natural educational process:

    Don't know → Learn by watching → Apply the knowledge
    

    🔍 Key Insights

    Performance Analysis
    Figure 4: Comprehensive analysis showing progressive performance decline and the human-model gap in knowledge acquisition from videos.

    Progressive Performance Decline

    Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.

    Knowledge Acquisition Challenge

    The Δknowledge metric reveals a significant human–model gap:

    • Humans: Substantial improvement (Δknowledge ≈ 33.1%)
    • Top Models: Smaller gains (GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%)

    This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

    📊 Case Studies

    Failure Case: Method Adaptation Error

    Failure Case Study
    Figure 5: Example of method adaptation failure - the model failed to adapt the method from video to solve the Adaptation question.

    Success Case: Learning from Video

    Success Case Study
    Figure 6: Example of successful learning from video - transforming an initial wrong answer into a correct one after watching the educational content.

    🚀 Research Impact

    Paradigm Shift

    Video-MMMU represents a paradigm shift from traditional video understanding to knowledge acquisition evaluation:

    • From Scene Understanding to Learning - Moving beyond visual comprehension to knowledge acquisition
    • From Static Evaluation to Dynamic Learning - Measuring improvement rather than just final performance
    • From Task Solving to Learning Capability - Evaluating the ability to learn new skills

    Implications for AI Development

    1. Real-World Deployment - Models must learn continuously after deployment
    2. Educational AI - Critical for AI tutoring and educational applications
    3. Knowledge Transfer - Understanding how models generalize learned concepts
    4. Human-AI Alignment - Bridging the gap in learning capabilities

    📈 Future Directions

    Benchmark Extensions

    • Multimodal Knowledge Sources - Incorporating diverse educational formats
    • Long-term Learning - Evaluating knowledge retention over time
    • Interactive Learning - Adding feedback loops and iterative improvement

    Model Development

    • Learning-Optimized Architectures - Designing models specifically for knowledge acquisition
    • Memory Integration - Better mechanisms for knowledge storage and retrieval
    • Transfer Learning - Improving cross-domain knowledge application

    🎯 Getting Started

    1. Download the Video-MMMU dataset from Hugging Face
    2. Set up the evaluation environment using our GitHub repository
    3. Run baseline evaluations on your models
    4. Analyze Δknowledge metrics to understand learning capabilities
    5. Compare results with our comprehensive leaderboard

    Video-MMMU challenges the current state of multimodal AI by shifting focus from static performance to dynamic learning capability - a critical step toward truly intelligent and adaptive AI systems.

  • Multimodal-SAE Banner
    Multimodal-SAE: First demonstration of SAE-based feature interpretation in Large Multimodal Models

    Overview

    For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing a breakthrough solution for feature interpretation across various model scales.

    Inspiration and Motivation

    This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimodal models, we discovered intriguing features that:

    • Correlate with diverse semantics across visual and textual modalities
    • Can be leveraged to steer model behavior for precise control
    • Enable deeper understanding of LMM functionality and decision-making

    Technical Approach

    SAE Training Pipeline

    The Sparse Autoencoder (SAE) is trained using a targeted approach:

    1. Integration Strategy - SAE integrated into a specific layer of the model
    2. Frozen Architecture - All other model components remain frozen during training
    3. Training Data - Utilizes LLaVA-NeXT dataset for comprehensive multimodal coverage
    4. Feature Learning - Learns sparse, interpretable representations of multimodal features

    Auto-Explanation Pipeline

    Our novel auto-explanation pipeline analyzes visual features through:

    • Activation Region Analysis - Identifies where features activate in visual inputs
    • Semantic Correlation - Maps features to interpretable semantic concepts
    • Cross-Modal Understanding - Leverages larger LMMs for feature interpretation
    • Automated Processing - Scalable interpretation without manual annotation

    Feature Steering and Control

    Feature Steering Demonstration
    Demonstration of feature steering: These learned features can be used to control model behavior and generate desired outputs

    Behavioral Control Capabilities

    The learned features enable precise model steering by:

    • Selective Feature Activation - Amplifying specific semantic features
    • Behavioral Modification - Directing model attention and responses
    • Interpretable Control - Understanding why specific outputs are generated
    • Fine-Grained Manipulation - Precise control over model behavior

    Key Contributions

    🔬 First Multimodal SAE Implementation

    Pioneering application of SAE methodology to multimodal models, opening new research directions in mechanistic interpretability.

    🎯 Cross-Scale Feature Interpretation

    Demonstration that smaller LMMs can learn features interpretable by larger models, enabling scalable analysis approaches.

    🎮 Model Steering Capabilities

    Practical application of learned features for controllable model behavior and output generation.

    🔄 Auto-Explanation Pipeline

    Automated methodology for interpreting visual features without requiring manual semantic labeling.

    Research Impact

    Mechanistic Interpretability Advancement

    This work represents a significant advancement in understanding how multimodal models process and integrate information across modalities.

    Practical Applications

    • Model Debugging - Understanding failure modes and biases
    • Controllable Generation - Steering model outputs for specific applications
    • Safety and Alignment - Better control over model behavior
    • Feature Analysis - Deep understanding of learned representations

    Future Directions

    Our methodology opens new research avenues in:

    1. Cross-Modal Feature Analysis - Understanding feature interactions across modalities
    2. Scalable Interpretability - Extending to larger and more complex models
    3. Real-Time Steering - Dynamic control during inference
    4. Safety Applications - Preventing harmful or biased outputs

    Technical Details

    Architecture Integration

    The SAE is carefully integrated to:

    • Preserve Model Performance - Minimal impact on original capabilities
    • Capture Rich Features - Learn meaningful sparse representations
    • Enable Interpretation - Facilitate analysis by larger models
    • Support Steering - Allow runtime behavioral modification

    Evaluation Methodology

    Our approach is validated through:

    • Feature Interpretability - Qualitative analysis of learned features
    • Steering Effectiveness - Quantitative measurement of behavioral control
    • Cross-Model Validation - Testing interpretation across different model sizes
    • Semantic Consistency - Verifying feature stability and meaning

    Conclusion

    Multimodal-SAE represents a breakthrough in multimodal mechanistic interpretability, providing the first successful demonstration of SAE-based feature interpretation in the multimodal domain. Our work enables:

    • Deeper Understanding of how LMMs process multimodal information
    • Practical Control over model behavior through feature steering
    • Scalable Interpretation methods for increasingly complex models
    • Foundation Research for future advances in multimodal AI safety and control

    This research establishes a new paradigm for understanding and controlling Large Multimodal Models, with significant implications for AI safety, controllability, and interpretability research.

  • The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

    Video Instruction-Following Data Synthesis

    A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

    Video Sources

    We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.

    Automated Generation for Video Detail Description

    For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.

    Automated Generation for Video Question Answering

    In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model’s ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.

    Dataset Statistics

    We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.

    Dataset Comparison

    We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.

    A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model.

    High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage.

    Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.

  • LLaVA-OneVision
    LLaVA-OneVision: A unified model for single-image, multi-image, and video understanding

    Overview

    We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

    Key Features

    Unified Architecture

    LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

    Model Sizes

    • 0.5B parameters - Lightweight deployment
    • 7B parameters - Balanced performance
    • 72B parameters - State-of-the-art capabilities

    Emerging Capabilities

    The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

    1. Cross-Scenario Understanding

    Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

    2. Advanced Visual Analysis

    • Diagram and table interpretation - Understanding complex visual structures
    • Multi-screenshot interaction - Analyzing relationships across multiple screens
    • Set-of-mark object referencing - Precise object identification and tracking

    3. Video Capabilities

    • Image-to-video generation understanding - Comprehending temporal transitions
    • Video analysis and comparison - Deep understanding of video content
    • Multi-camera video interpretation - Processing footage from multiple viewpoints
    • Detailed video subject description - Rich, contextual video narration

    Strong Transfer Learning

    Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the model’s ability to generalize learned representations across visual domains.

    Development Roadmap

    LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:

    • Enhanced reasoning capabilities
    • Improved OCR performance
    • Expanded world knowledge
    • Advanced multimodal understanding
  • LMMs-Eval Banner
    LMMs-Eval: A comprehensive evaluation framework for Large Multimodal Models

    In today’s world, we’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

    To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI.

    However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

    In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

    We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.

  • LongVA Visual Needle-in-a-Haystack Heatmap
    LongVA's performance on Visual Needle-In-A-Haystack benchmark showing accurate retrieval across long video sequences

    Overview

    Gemini has amazed the world with its capability to understand hour-long videos. However, we still lack an open-source alternative with similar capabilities. Our latest research presents an innovative solution towards long video LMM, shifting the focus from reducing visual tokens per frame to leveraging the long context capabilities of language models.

    Here, we present our state-of-the-art video model, Long Video Assistant (LongVA), and our novel benchmark, Visual Needle-In-A-Haystack (V-NIAH).

    Key Innovations

    🔄 Long Context Transfer

    We discovered and verified that the long context capability of language models can be directly transferred to the video domain in modality-aligned multi-modal models. On V-NIAH, LongVA is the only open-source model capable of accurately retrieving visual information from inputs with:

    • 2000+ frames
    • 200K+ visual tokens

    🎯 UniRes: Unified Visual Encoding

    We proposed UniRes, a unified visual encoding scheme that encodes both images and videos. In UniRes, a video is encoded the same as multiple image crops in a sequence.

    Key Benefits:

    • Leverages the Long Context Transfer property
    • Enables superior zero-shot performance in video tasks
    • No video-specific training data required

    Performance Highlights

    🏆 State-of-the-Art Results

    LongVA achieves state-of-the-art performance on the comprehensive Video-MME benchmarks among 7B models.

    Key Performance Features:

    • Performance increases with denser sampling of video frames
    • Superior zero-shot capabilities on video understanding tasks
    • Comprehensive ablation studies validating improvement sources

    📊 V-NIAH Benchmark

    Our novel Visual Needle-In-A-Haystack (V-NIAH) benchmark provides:

    • Rigorous evaluation of long-context visual understanding
    • Testing retrieval accuracy across extended video sequences
    • Open-source evaluation framework for the community

    Technical Architecture

    Multi-Modal Alignment

    LongVA demonstrates that language models’ inherent long-context capabilities can be effectively transferred to visual domains through proper modality alignment.

    Scalable Design

    The architecture scales efficiently with:

    • Increased frame sampling rates
    • Extended sequence lengths
    • Larger visual token counts

    Research Impact

    Open-Source Alternative

    LongVA provides the first viable open-source alternative to proprietary long-video understanding systems, enabling:

    • Academic research advancement
    • Commercial application development
    • Community-driven improvements

    Methodology Innovation

    The long context transfer approach opens new research directions in:

    • Cross-modal capability transfer
    • Efficient video processing
    • Unified multi-modal architectures

    Future Directions

    LongVA establishes a foundation for:

    1. Extended Context Models - Pushing beyond current frame limits
    2. Multi-Modal Transfer Learning - Applying insights to other modalities
    3. Efficient Video Processing - Optimizing computational requirements
    4. Benchmark Development - Creating more comprehensive evaluation metrics
    LongVA Resources
    Complete resources for LongVA including source code, evaluation benchmark, and pre-trained models