skip to content
LMMs-Lab

Search

Tags #benchmarks

  • Fig. 1

    Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

    Motivation

    Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

    Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

    1. High information density (heavy OCR/ASR signals),
    2. Advanced knowledge requirements (college-level knowledge),
    3. Temporal structure (concepts unfolding over time).

    These properties make reasoning from lecture video notably harder. This leads to our core question:
    When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

    Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.


    🎓 Video-MMMU Leaderboard

    ModelOverall | ΔknowledgePerceptionComprehensionAdaptation
    GPT-5-thinking84.6 | —
    Gemini-2.5-Pro83.6 | —
    OpenAI O383.3 | —
    Claude-3.5-Sonnet65.78 | 🟢 +11.472.0069.6755.67
    Kimi-VL-A3B-Thinking-250665.22 | 🟢 +3.575.0066.3354.33
    GPT-4o61.22 | 🟢 +15.666.0062.0055.67
    Qwen-2.5-VL-72B60.22 | 🟢 +9.769.3361.0050.33
    GLM-4V-PLUS-011157.56 | 🔴 -1.777.3353.3342.00
    Gemini 1.5 Pro53.89 | 🟢 +8.759.0053.3349.33
    Aria50.78 | 🟢 +3.265.6746.6740.00
    Gemini 1.5 Flash49.78 | 🔴 -3.357.3349.0043.00
    LLaVA-Video-72B49.67 | 🟢 +7.159.6746.0043.33
    LLaVA-OneVision-72B48.33 | 🟢 +6.659.6742.3343.00
    Qwen-2.5-VL-7B47.44 | 🟢 +2.258.3344.3339.67
    VideoLLaMA3-7B47.00 | 🔴 -0.560.3346.0034.67
    InternVideo2.5-Chat-8B43.00 | 🟢 +3.054.6741.6732.67
    mPLUG-Owl3-7B42.00 | 🟢 +7.549.3338.6738.00
    MAmmoTH-VL-8B41.78 | 🟢 +1.551.6740.0033.67
    VideoChat-Flash-7B@44841.67 | 🔴 -1.351.6740.6732.67
    InternVL2-8B37.44 | 🔴 -8.547.3333.3331.67
    LLaVA-Video-7B36.11 | 🔴 -5.341.6733.3333.33
    VILA1.5-40B34.00 | 🟢 +9.438.6730.6732.67
    LLaVA-OneVision-7B33.89 | 🔴 -5.640.0031.0030.67
    Llama-3.2-11B30.00 | ➖ —35.6732.3322.00
    LongVA-7B23.98 | 🔴 -7.024.0024.3323.67
    VILA1.5-8B20.89 | 🟢 +5.920.3317.3325.00

    Overview

    We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

    1) Video: Knowledge Source

    Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.

    2) QA Design: Three Stages of Knowledge Acquisition

    Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

    • Perception – Identifying relevant surface information
    • Comprehension – Understanding underlying concepts or strategies
    • Adaptation – Applying learned knowledge to new scenarios
    Fig. 2

    Fig. 2 illustrates examples for each category:

    • Perception: ASR-based (Art, top-left); OCR-based (Business, bottom-left)
    • Comprehension: Concept understanding (Humanities, top-center); Strategy comprehension (Science, bottom-center)
    • Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)
    Fig. 3

    3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?

    Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

    4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

    Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.

    A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:

    Δknowledge=Accafter_videoAccbefore_video100%Accbefore_video×100%\Delta_{\text{knowledge}} = \frac{\text{Acc}_{\text{after\_video}} - \text{Acc}_{\text{before\_video}}}{100\% - \text{Acc}_{\text{before\_video}}} \times 100\%

    Evaluation of Δknowledge:

    1. Initial Test:
       The model attempts to answer a question *without* seeing the video.
     
    2. Re-Test after video viewing:
       We provide the corresponding lecture video. The model is asked the same question again.
     
    3. Performance Gain:
       If the model succeeds after watching, it demonstrates
       successful knowledge acquisition from video.

    This setup mirrors a human’s natural educational process:

    Don’t know → Learn by watching → Apply the knowledge

    Key Insights

    Fig. 4
    • Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
    • Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

    Evaluation

    Please refer to our Code@Github for full evaluation instructions.


    Case Study

    We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.

    Fig. 5

    Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.

    Fig. 6

    Authors

    🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Bo Li, and Ziwei Liu


    Citation

    @article{hu2025videommmu,
        title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
        author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
        journal={arXiv preprint arXiv:2501.13826},
        year={2025},
        url={https://arxiv.org/abs/2501.13826}
    }
  • Banner

    In today’s world, we’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

    To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI.

    However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

    In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

    We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.