skip to content
LMMs-Lab

Search

Video-MMMU

5 min read

VideoMMMU

Fig. 1

Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

Motivation

Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

  1. High information density (heavy OCR/ASR signals),
  2. Advanced knowledge requirements (college-level knowledge),
  3. Temporal structure (concepts unfolding over time).

These properties make reasoning from lecture video notably harder. This leads to our core question:
When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.


🎓 Video-MMMU Leaderboard

ModelOverall | ΔknowledgePerceptionComprehensionAdaptation
GPT-5-thinking84.6 | —
Gemini-2.5-Pro83.6 | —
OpenAI O383.3 | —
Claude-3.5-Sonnet65.78 | 🟢 +11.472.0069.6755.67
Kimi-VL-A3B-Thinking-250665.22 | 🟢 +3.575.0066.3354.33
GPT-4o61.22 | 🟢 +15.666.0062.0055.67
Qwen-2.5-VL-72B60.22 | 🟢 +9.769.3361.0050.33
GLM-4V-PLUS-011157.56 | 🔴 -1.777.3353.3342.00
Gemini 1.5 Pro53.89 | 🟢 +8.759.0053.3349.33
Aria50.78 | 🟢 +3.265.6746.6740.00
Gemini 1.5 Flash49.78 | 🔴 -3.357.3349.0043.00
LLaVA-Video-72B49.67 | 🟢 +7.159.6746.0043.33
LLaVA-OneVision-72B48.33 | 🟢 +6.659.6742.3343.00
Qwen-2.5-VL-7B47.44 | 🟢 +2.258.3344.3339.67
VideoLLaMA3-7B47.00 | 🔴 -0.560.3346.0034.67
InternVideo2.5-Chat-8B43.00 | 🟢 +3.054.6741.6732.67
mPLUG-Owl3-7B42.00 | 🟢 +7.549.3338.6738.00
MAmmoTH-VL-8B41.78 | 🟢 +1.551.6740.0033.67
VideoChat-Flash-7B@44841.67 | 🔴 -1.351.6740.6732.67
InternVL2-8B37.44 | 🔴 -8.547.3333.3331.67
LLaVA-Video-7B36.11 | 🔴 -5.341.6733.3333.33
VILA1.5-40B34.00 | 🟢 +9.438.6730.6732.67
LLaVA-OneVision-7B33.89 | 🔴 -5.640.0031.0030.67
Llama-3.2-11B30.00 | ➖ —35.6732.3322.00
LongVA-7B23.98 | 🔴 -7.024.0024.3323.67
VILA1.5-8B20.89 | 🟢 +5.920.3317.3325.00

Overview

We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

1) Video: Knowledge Source

Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.

2) QA Design: Three Stages of Knowledge Acquisition

Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

  • Perception – Identifying relevant surface information
  • Comprehension – Understanding underlying concepts or strategies
  • Adaptation – Applying learned knowledge to new scenarios
Fig. 2

Fig. 2 illustrates examples for each category:

  • Perception: ASR-based (Art, top-left); OCR-based (Business, bottom-left)
  • Comprehension: Concept understanding (Humanities, top-center); Strategy comprehension (Science, bottom-center)
  • Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)
Fig. 3

3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?

Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.

A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:

Δknowledge=Accafter_videoAccbefore_video100%Accbefore_video×100%\Delta_{\text{knowledge}} = \frac{\text{Acc}_{\text{after\_video}} - \text{Acc}_{\text{before\_video}}}{100\% - \text{Acc}_{\text{before\_video}}} \times 100\%

Evaluation of Δknowledge:

1. Initial Test:
   The model attempts to answer a question *without* seeing the video.
 
2. Re-Test after video viewing:
   We provide the corresponding lecture video. The model is asked the same question again.
 
3. Performance Gain:
   If the model succeeds after watching, it demonstrates
   successful knowledge acquisition from video.

This setup mirrors a human’s natural educational process:

Don’t know → Learn by watching → Apply the knowledge

Key Insights

Fig. 4
  • Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
  • Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

Evaluation

Please refer to our Code@Github for full evaluation instructions.


Case Study

We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.

Fig. 5

Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.

Fig. 6

Authors

🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Bo Li, and Ziwei Liu


Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    journal={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}