skip to content

LMMs-Eval

3 min read

Reality Check on the Evaluation of Large Multimodal Models

LMMs-Eval Banner
LMMs-Eval: A comprehensive evaluation framework for Large Multimodal Models

Why LMMs-Eval?

We’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

We humbly absorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM. For more details, please refer to our paper.

Key Features

  • Multi-modality support: Text, image, video, and audio evaluations
  • 100+ supported tasks across different modalities
  • 30+ supported models including vision-language and audio models
  • Response caching and accelerated inference options (vLLM, SGLang, tensor parallelism)
  • OpenAI-compatible API support for diverse model architectures
  • Reproducible results with version-controlled environments using uv

Supported Models

LMMs-Eval supports a wide range of models including:

  • LLaVA series: LLaVA-1.5, LLaVA-OneVision, LLaVA-OneVision 1.5
  • Qwen series: Qwen2-VL, Qwen2.5-VL
  • Commercial APIs: GPT-4o, GPT-4o Audio Preview, Gemini 1.5 Pro
  • Audio models: Aero-1-Audio, Gemini Audio
  • Other open models: InternVL-2, VILA, LongVA, LLaMA-3.2-Vision

Supported Benchmarks

Vision

MME, COCO, VQAv2, TextVQA, GQA, MMVP, ChartQA, DocVQA, OCRVQA, LLaVA-Bench, MMMU, MathVista

Video

EgoSchema, PerceptionTest, VideoMME, MVBench, LongVideoBench, TemporalBench, VideoMathQA

Audio

AIR-Bench, Clotho-AQA, LibriSpeech, VoiceBench, WenetSpeech

Reasoning

CSBench, SciBench, MedQA, SuperGPQA, PhyX

Installation

Installation with uv (Recommended)

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
uv pip install -e ".[all]"

Usage

Basic Evaluation

# Evaluate LLaVA-OneVision on multiple benchmarks
accelerate launch --num_processes=8 -m lmms_eval \
  --model=llava_onevision \
  --model_args=pretrained=lmms-lab/llava-onevision-qwen2-7b-ov \
  --tasks=mmmu_val,mmbench_en,mathvista_testmini \
  --batch_size=1
 
# See all options
python -m lmms_eval --help

Latest Release (v0.5)

The October 2025 release features:

  • Comprehensive audio evaluation expansion
  • Response caching capabilities
  • 5 new models (GPT-4o Audio Preview, Gemma-3, LongViLA-R1, LLaVA-OneVision 1.5)
  • 50+ new benchmark variants
  • Enhanced reproducibility tools

Community

With 3.4k+ stars, 460+ forks, and 157+ contributors, LMMs-Eval has become the standard evaluation framework for multimodal models in the research community.

2024
Benchmarks
Models
Evaluation
Tools
Research

Authors

Kaichen Zhang*, Bo Li*, Peiyuan Zhang*, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu

* Main Authors