Tag: benchmarks • LMMs-Lab

Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

Motivation

Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

High information density (heavy OCR/ASR signals),
Advanced knowledge requirements (college-level knowledge),
Temporal structure (concepts unfolding over time).

These properties make reasoning from lecture video notably harder. This leads to our core question:
When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

🎓 Video-MMMU Leaderboard

Model	Overall \| Δknowledge	Perception	Comprehension	Adaptation
GPT-5-thinking	84.6 \| —	—	—	—
Gemini-2.5-Pro	83.6 \| —	—	—	—
OpenAI O3	83.3 \| —	—	—	—
Claude-3.5-Sonnet	65.78 \| 🟢 +11.4	72.00	69.67	55.67
Kimi-VL-A3B-Thinking-2506	65.22 \| 🟢 +3.5	75.00	66.33	54.33
GPT-4o	61.22 \| 🟢 +15.6	66.00	62.00	55.67
Qwen-2.5-VL-72B	60.22 \| 🟢 +9.7	69.33	61.00	50.33
GLM-4V-PLUS-0111	57.56 \| 🔴 -1.7	77.33	53.33	42.00
Gemini 1.5 Pro	53.89 \| 🟢 +8.7	59.00	53.33	49.33
Aria	50.78 \| 🟢 +3.2	65.67	46.67	40.00
Gemini 1.5 Flash	49.78 \| 🔴 -3.3	57.33	49.00	43.00
LLaVA-Video-72B	49.67 \| 🟢 +7.1	59.67	46.00	43.33
LLaVA-OneVision-72B	48.33 \| 🟢 +6.6	59.67	42.33	43.00
Qwen-2.5-VL-7B	47.44 \| 🟢 +2.2	58.33	44.33	39.67
VideoLLaMA3-7B	47.00 \| 🔴 -0.5	60.33	46.00	34.67
InternVideo2.5-Chat-8B	43.00 \| 🟢 +3.0	54.67	41.67	32.67
mPLUG-Owl3-7B	42.00 \| 🟢 +7.5	49.33	38.67	38.00
MAmmoTH-VL-8B	41.78 \| 🟢 +1.5	51.67	40.00	33.67
VideoChat-Flash-7B@448	41.67 \| 🔴 -1.3	51.67	40.67	32.67
InternVL2-8B	37.44 \| 🔴 -8.5	47.33	33.33	31.67
LLaVA-Video-7B	36.11 \| 🔴 -5.3	41.67	33.33	33.33
VILA1.5-40B	34.00 \| 🟢 +9.4	38.67	30.67	32.67
LLaVA-OneVision-7B	33.89 \| 🔴 -5.6	40.00	31.00	30.67
Llama-3.2-11B	30.00 \| ➖ —	35.67	32.33	22.00
LongVA-7B	23.98 \| 🔴 -7.0	24.00	24.33	23.67
VILA1.5-8B	20.89 \| 🟢 +5.9	20.33	17.33	25.00

Overview

We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

1) Video: Knowledge Source

Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.

2) QA Design: Three Stages of Knowledge Acquisition

Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

Perception – Identifying relevant surface information
Comprehension – Understanding underlying concepts or strategies
Adaptation – Applying learned knowledge to new scenarios

Fig. 2 illustrates examples for each category:

Perception: ASR-based (Art, top-left); OCR-based (Business, bottom-left)
Comprehension: Concept understanding (Humanities, top-center); Strategy comprehension (Science, bottom-center)
Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)

3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?

Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.

A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:

\Delta_{\text{knowledge}} = \frac{\text{Acc}_{\text{after\_video}} - \text{Acc}_{\text{before\_video}}}{100\% - \text{Acc}_{\text{before\_video}}} \times 100\%

Evaluation of Δknowledge:

1. Initial Test:
   The model attempts to answer a question *without* seeing the video.
 
2. Re-Test after video viewing:
   We provide the corresponding lecture video. The model is asked the same question again.
 
3. Performance Gain:
   If the model succeeds after watching, it demonstrates
   successful knowledge acquisition from video.

This setup mirrors a human’s natural educational process:

Don’t know → Learn by watching → Apply the knowledge

Key Insights

Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

Evaluation

Please refer to our Code@Github for full evaluation instructions.

Case Study

We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.

Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.

Authors

🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Bo Li, and Ziwei Liu

Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    journal={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}

Tags → #benchmarks