skip to content
LMMs-Lab

Search

Tags #video

  • Fig. 1

    Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

    Motivation

    Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

    Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

    1. High information density (heavy OCR/ASR signals),
    2. Advanced knowledge requirements (college-level knowledge),
    3. Temporal structure (concepts unfolding over time).

    These properties make reasoning from lecture video notably harder. This leads to our core question:
    When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

    Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.


    🎓 Video-MMMU Leaderboard

    ModelOverall | ΔknowledgePerceptionComprehensionAdaptation
    GPT-5-thinking84.6 | —
    Gemini-2.5-Pro83.6 | —
    OpenAI O383.3 | —
    Claude-3.5-Sonnet65.78 | 🟢 +11.472.0069.6755.67
    Kimi-VL-A3B-Thinking-250665.22 | 🟢 +3.575.0066.3354.33
    GPT-4o61.22 | 🟢 +15.666.0062.0055.67
    Qwen-2.5-VL-72B60.22 | 🟢 +9.769.3361.0050.33
    GLM-4V-PLUS-011157.56 | 🔴 -1.777.3353.3342.00
    Gemini 1.5 Pro53.89 | 🟢 +8.759.0053.3349.33
    Aria50.78 | 🟢 +3.265.6746.6740.00
    Gemini 1.5 Flash49.78 | 🔴 -3.357.3349.0043.00
    LLaVA-Video-72B49.67 | 🟢 +7.159.6746.0043.33
    LLaVA-OneVision-72B48.33 | 🟢 +6.659.6742.3343.00
    Qwen-2.5-VL-7B47.44 | 🟢 +2.258.3344.3339.67
    VideoLLaMA3-7B47.00 | 🔴 -0.560.3346.0034.67
    InternVideo2.5-Chat-8B43.00 | 🟢 +3.054.6741.6732.67
    mPLUG-Owl3-7B42.00 | 🟢 +7.549.3338.6738.00
    MAmmoTH-VL-8B41.78 | 🟢 +1.551.6740.0033.67
    VideoChat-Flash-7B@44841.67 | 🔴 -1.351.6740.6732.67
    InternVL2-8B37.44 | 🔴 -8.547.3333.3331.67
    LLaVA-Video-7B36.11 | 🔴 -5.341.6733.3333.33
    VILA1.5-40B34.00 | 🟢 +9.438.6730.6732.67
    LLaVA-OneVision-7B33.89 | 🔴 -5.640.0031.0030.67
    Llama-3.2-11B30.00 | ➖ —35.6732.3322.00
    LongVA-7B23.98 | 🔴 -7.024.0024.3323.67
    VILA1.5-8B20.89 | 🟢 +5.920.3317.3325.00

    Overview

    We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

    1) Video: Knowledge Source

    Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.

    2) QA Design: Three Stages of Knowledge Acquisition

    Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

    • Perception – Identifying relevant surface information
    • Comprehension – Understanding underlying concepts or strategies
    • Adaptation – Applying learned knowledge to new scenarios
    Fig. 2

    Fig. 2 illustrates examples for each category:

    • Perception: ASR-based (Art, top-left); OCR-based (Business, bottom-left)
    • Comprehension: Concept understanding (Humanities, top-center); Strategy comprehension (Science, bottom-center)
    • Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)
    Fig. 3

    3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?

    Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

    4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

    Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.

    A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:

    Δknowledge=Accafter_videoAccbefore_video100%Accbefore_video×100%\Delta_{\text{knowledge}} = \frac{\text{Acc}_{\text{after\_video}} - \text{Acc}_{\text{before\_video}}}{100\% - \text{Acc}_{\text{before\_video}}} \times 100\%

    Evaluation of Δknowledge:

    1. Initial Test:
       The model attempts to answer a question *without* seeing the video.
     
    2. Re-Test after video viewing:
       We provide the corresponding lecture video. The model is asked the same question again.
     
    3. Performance Gain:
       If the model succeeds after watching, it demonstrates
       successful knowledge acquisition from video.

    This setup mirrors a human’s natural educational process:

    Don’t know → Learn by watching → Apply the knowledge

    Key Insights

    Fig. 4
    • Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
    • Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

    Evaluation

    Please refer to our Code@Github for full evaluation instructions.


    Case Study

    We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.

    Fig. 5

    Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.

    Fig. 6

    Authors

    🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Bo Li, and Ziwei Liu


    Citation

    @article{hu2025videommmu,
        title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
        author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
        journal={arXiv preprint arXiv:2501.13826},
        year={2025},
        url={https://arxiv.org/abs/2501.13826}
    }
  • LLaVA-OneVision

    Overview

    We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

    Key Features

    Unified Architecture

    LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

    Model Sizes

    • 0.5B parameters - Lightweight deployment
    • 7B parameters - Balanced performance
    • 72B parameters - State-of-the-art capabilities

    Emerging Capabilities

    The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

    1. Cross-Scenario Understanding

    Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

    2. Advanced Visual Analysis

    • Diagram and table interpretation - Understanding complex visual structures
    • Multi-screenshot interaction - Analyzing relationships across multiple screens
    • Set-of-mark object referencing - Precise object identification and tracking

    3. Video Capabilities

    • Image-to-video generation understanding - Comprehending temporal transitions
    • Video analysis and comparison - Deep understanding of video content
    • Multi-camera video interpretation - Processing footage from multiple viewpoints
    • Detailed video subject description - Rich, contextual video narration

    Strong Transfer Learning

    Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the model’s ability to generalize learned representations across visual domains.

    Open-Source Resources

    We open-source LLaVA-OneVision to facilitate future development of LMMs in the community:

    🚀 Training Code

    Cook a SOTA model with our released training code and reproduction scripts

    🤗 Model Checkpoints

    Access pre-trained model checkpoints in all three sizes (0.5B, 7B, 72B)

    📊 Training Datasets

    Explore comprehensive training datasets for Single-Image and OneVision stages

    🔥 Live Demo

    Try LLaVA-OneVision directly in your browser

    Development Roadmap

    LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:

    • Enhanced reasoning capabilities
    • Improved OCR performance
    • Expanded world knowledge
    • Advanced multimodal understanding

    Citation

    If you find LLaVA-OneVision useful for your research, please cite:

    @article{li2024llava-onevision,
      title={LLaVA-OneVision: Easy Visual Task Transfer},
      author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
      journal={arXiv preprint arXiv:2408.03326},
      year={2024}
    }

    Acknowledgments

    This work is a collaboration between researchers from ByteDance, NTU, CUHK, and HKUST, building upon the strong foundation of the LLaVA project series.