skip to content

Tags #tools

  • LMMs-Eval Banner
    LMMs-Eval: A comprehensive evaluation framework for Large Multimodal Models

    Why LMMs-Eval?

    We’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

    To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

    In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

    We humbly absorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM. For more details, please refer to our paper.

    Key Features

    • Multi-modality support: Text, image, video, and audio evaluations
    • 100+ supported tasks across different modalities
    • 30+ supported models including vision-language and audio models
    • Response caching and accelerated inference options (vLLM, SGLang, tensor parallelism)
    • OpenAI-compatible API support for diverse model architectures
    • Reproducible results with version-controlled environments using uv

    Supported Models

    LMMs-Eval supports a wide range of models including:

    • LLaVA series: LLaVA-1.5, LLaVA-OneVision, LLaVA-OneVision 1.5
    • Qwen series: Qwen2-VL, Qwen2.5-VL
    • Commercial APIs: GPT-4o, GPT-4o Audio Preview, Gemini 1.5 Pro
    • Audio models: Aero-1-Audio, Gemini Audio
    • Other open models: InternVL-2, VILA, LongVA, LLaMA-3.2-Vision

    Supported Benchmarks

    Vision

    MME, COCO, VQAv2, TextVQA, GQA, MMVP, ChartQA, DocVQA, OCRVQA, LLaVA-Bench, MMMU, MathVista

    Video

    EgoSchema, PerceptionTest, VideoMME, MVBench, LongVideoBench, TemporalBench, VideoMathQA

    Audio

    AIR-Bench, Clotho-AQA, LibriSpeech, VoiceBench, WenetSpeech

    Reasoning

    CSBench, SciBench, MedQA, SuperGPQA, PhyX

    Installation

    Installation with uv (Recommended)

    curl -LsSf https://astral.sh/uv/install.sh | sh
    git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
    cd lmms-eval
    uv pip install -e ".[all]"

    Usage

    Basic Evaluation

    # Evaluate LLaVA-OneVision on multiple benchmarks
    accelerate launch --num_processes=8 -m lmms_eval \
      --model=llava_onevision \
      --model_args=pretrained=lmms-lab/llava-onevision-qwen2-7b-ov \
      --tasks=mmmu_val,mmbench_en,mathvista_testmini \
      --batch_size=1
     
    # See all options
    python -m lmms_eval --help

    Latest Release (v0.5)

    The October 2025 release features:

    • Comprehensive audio evaluation expansion
    • Response caching capabilities
    • 5 new models (GPT-4o Audio Preview, Gemma-3, LongViLA-R1, LLaVA-OneVision 1.5)
    • 50+ new benchmark variants
    • Enhanced reproducibility tools

    Community

    With 3.4k+ stars, 460+ forks, and 157+ contributors, LMMs-Eval has become the standard evaluation framework for multimodal models in the research community.