skip to content

Tags #benchmarks

  • Video-MMMU Overview
    Video-MMMU: A comprehensive benchmark for evaluating knowledge acquisition from educational videos across multiple disciplines
    WebsitePaperDataset

    Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

    🎯 Motivation

    Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

    Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

    1. High information density (heavy OCR/ASR signals)
    2. Advanced knowledge requirements (college-level knowledge)
    3. Temporal structure (concepts unfolding over time)

    These properties make reasoning from lecture video notably harder. This leads to our core question:

    When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

    Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

    🏆 Video-MMMU Leaderboard

    ModelOverall \ΔknowledgePerceptionComprehensionAdaptation
    GPT-5-thinking84.6 \
    Gemini-2.5-Pro83.6 \
    OpenAI O383.3 \
    Claude-3.5-Sonnet65.78 \🟢 +11.472.0069.6755.67
    Kimi-VL-A3B-Thinking-250665.22 \🟢 +3.575.0066.3354.33
    GPT-4o61.22 \🟢 +15.666.0062.0055.67
    Qwen-2.5-VL-72B60.22 \🟢 +9.769.3361.0050.33

    See full leaderboard with 20+ models in our paper and website

    📚 Overview

    We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

    1) Video: Knowledge Source

    Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content.

    Dataset Composition:

    • 300 college-level, lecture-style videos
    • 30 subjects across 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering
    • High-quality educational content from university-level courses

    2) QA Design: Three Stages of Knowledge Acquisition

    Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

    • 🔍 Perception – Identifying relevant surface information
    • 🧠 Comprehension – Understanding underlying concepts or strategies
    • 🎯 Adaptation – Applying learned knowledge to new scenarios
    Knowledge Acquisition Categories
    Figure 2: Examples for each knowledge acquisition category across different disciplines. Perception (ASR/OCR-based), Comprehension (concept/strategy understanding), and Adaptation (application to new scenarios).
    Benchmark Structure
    Figure 3: Video-MMMU benchmark structure showing the progression from video content to three-tier evaluation framework.

    3) In-Context Knowledge Acquisition: Learning Like Humans

    Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment.

    In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

    4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

    A core innovation in Video-MMMU is its shift from measuring only final performance to measuring learning.

    Δknowledge Formula

    Δknowledge = (Acc_after_video - Acc_before_video) / (100% - Acc_before_video) × 100%
    

    Evaluation Process

    1. Initial Test: The model attempts to answer a question without seeing the video.

    2. Re-Test after video viewing: We provide the corresponding lecture video. The model is asked the same question again.

    3. Performance Gain: If the model succeeds after watching, it demonstrates successful knowledge acquisition from video.

    This setup mirrors a human’s natural educational process:

    Don't know → Learn by watching → Apply the knowledge
    

    🔍 Key Insights

    Performance Analysis
    Figure 4: Comprehensive analysis showing progressive performance decline and the human-model gap in knowledge acquisition from videos.

    Progressive Performance Decline

    Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.

    Knowledge Acquisition Challenge

    The Δknowledge metric reveals a significant human–model gap:

    • Humans: Substantial improvement (Δknowledge ≈ 33.1%)
    • Top Models: Smaller gains (GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%)

    This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

    📊 Case Studies

    Failure Case: Method Adaptation Error

    Failure Case Study
    Figure 5: Example of method adaptation failure - the model failed to adapt the method from video to solve the Adaptation question.

    Success Case: Learning from Video

    Success Case Study
    Figure 6: Example of successful learning from video - transforming an initial wrong answer into a correct one after watching the educational content.

    🚀 Research Impact

    Paradigm Shift

    Video-MMMU represents a paradigm shift from traditional video understanding to knowledge acquisition evaluation:

    • From Scene Understanding to Learning - Moving beyond visual comprehension to knowledge acquisition
    • From Static Evaluation to Dynamic Learning - Measuring improvement rather than just final performance
    • From Task Solving to Learning Capability - Evaluating the ability to learn new skills

    Implications for AI Development

    1. Real-World Deployment - Models must learn continuously after deployment
    2. Educational AI - Critical for AI tutoring and educational applications
    3. Knowledge Transfer - Understanding how models generalize learned concepts
    4. Human-AI Alignment - Bridging the gap in learning capabilities

    📈 Future Directions

    Benchmark Extensions

    • Multimodal Knowledge Sources - Incorporating diverse educational formats
    • Long-term Learning - Evaluating knowledge retention over time
    • Interactive Learning - Adding feedback loops and iterative improvement

    Model Development

    • Learning-Optimized Architectures - Designing models specifically for knowledge acquisition
    • Memory Integration - Better mechanisms for knowledge storage and retrieval
    • Transfer Learning - Improving cross-domain knowledge application

    🎯 Getting Started

    1. Download the Video-MMMU dataset from Hugging Face
    2. Set up the evaluation environment using our GitHub repository
    3. Run baseline evaluations on your models
    4. Analyze Δknowledge metrics to understand learning capabilities
    5. Compare results with our comprehensive leaderboard

    Video-MMMU challenges the current state of multimodal AI by shifting focus from static performance to dynamic learning capability - a critical step toward truly intelligent and adaptive AI systems.

  • LMMs-Eval Banner
    LMMs-Eval: A comprehensive evaluation framework for Large Multimodal Models

    Why LMMs-Eval?

    We’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

    To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

    In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

    We humbly absorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM. For more details, please refer to our paper.

    Key Features

    • Multi-modality support: Text, image, video, and audio evaluations
    • 100+ supported tasks across different modalities
    • 30+ supported models including vision-language and audio models
    • Response caching and accelerated inference options (vLLM, SGLang, tensor parallelism)
    • OpenAI-compatible API support for diverse model architectures
    • Reproducible results with version-controlled environments using uv

    Supported Models

    LMMs-Eval supports a wide range of models including:

    • LLaVA series: LLaVA-1.5, LLaVA-OneVision, LLaVA-OneVision 1.5
    • Qwen series: Qwen2-VL, Qwen2.5-VL
    • Commercial APIs: GPT-4o, GPT-4o Audio Preview, Gemini 1.5 Pro
    • Audio models: Aero-1-Audio, Gemini Audio
    • Other open models: InternVL-2, VILA, LongVA, LLaMA-3.2-Vision

    Supported Benchmarks

    Vision

    MME, COCO, VQAv2, TextVQA, GQA, MMVP, ChartQA, DocVQA, OCRVQA, LLaVA-Bench, MMMU, MathVista

    Video

    EgoSchema, PerceptionTest, VideoMME, MVBench, LongVideoBench, TemporalBench, VideoMathQA

    Audio

    AIR-Bench, Clotho-AQA, LibriSpeech, VoiceBench, WenetSpeech

    Reasoning

    CSBench, SciBench, MedQA, SuperGPQA, PhyX

    Installation

    Installation with uv (Recommended)

    curl -LsSf https://astral.sh/uv/install.sh | sh
    git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
    cd lmms-eval
    uv pip install -e ".[all]"

    Usage

    Basic Evaluation

    # Evaluate LLaVA-OneVision on multiple benchmarks
    accelerate launch --num_processes=8 -m lmms_eval \
      --model=llava_onevision \
      --model_args=pretrained=lmms-lab/llava-onevision-qwen2-7b-ov \
      --tasks=mmmu_val,mmbench_en,mathvista_testmini \
      --batch_size=1
     
    # See all options
    python -m lmms_eval --help

    Latest Release (v0.5)

    The October 2025 release features:

    • Comprehensive audio evaluation expansion
    • Response caching capabilities
    • 5 new models (GPT-4o Audio Preview, Gemma-3, LongViLA-R1, LLaVA-OneVision 1.5)
    • 50+ new benchmark variants
    • Enhanced reproducibility tools

    Community

    With 3.4k+ stars, 460+ forks, and 157+ contributors, LMMs-Eval has become the standard evaluation framework for multimodal models in the research community.