Tags → #research

30 September 2025
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Code | Technical Report | Models and Datasets | Demo

High performance, low cost, and strong reproducibility!

LLaVA, proposed in 2023, efficiently connects open-source vision encoders with large language models through low-cost alignment, bringing “see—understand—converse” multimodal capabilities to the open ecosystem. It significantly narrows the gap with top-tier closed models and marks an important milestone in open-source multimodal paradigms.

Starting with a low-cost alignment that bridges “vision encoder + large language model,” LLaVA laid the groundwork; LLaVA-1.5 strengthened comprehension with larger, cleaner data and high-resolution inputs; LLaVA-NeXT expanded into OCR, mathematical reasoning, and broader, multi-scenario tasks. It then branched into LLaVA-NeXT-Video for temporal video understanding and multi-frame reasoning, and LLaVA-NeXT-Interleave to support interleaved multi-image–text inputs and cross-image joint reasoning. Ultimately, the line converged in LLaVA-OneVision, which provides a unified interface covering images, documents, charts, multi-image, and video, balancing quality and efficiency.

Although interfaces and architectures for multimodal alignment are trending toward convergence, a truly “reproducible” open-source path still differs from releases that “open weights only.” Qwen2.5-VL and InternVL3.5 set strong baselines in OCR, document understanding, mathematical and cross-image reasoning; however, full data inventories, cleaning and mixing ratios, as well as alignment/sampling and training schedules are often only partially disclosed, making end-to-end reproduction difficult. Molmo, with a cleaner data pipeline and meticulous design, approaches strong closed-source baselines across multiple evaluations and human preference settings; Open-Qwen2VL shows that under a more efficient paradigm, strong comparative performance is achievable even when raw multimodal tokens account for a relatively small proportion. The primary gap today lies in the “reproducibility of recipes and engineering details,” rather than any single choice of model architecture.

LMMs-Lab, focused on the goals of high performance, low cost, and strong reproducibility, releases on top of the LLaVA‑OneVision framework a fully open, concept-balanced 85M pretraining dataset (LLaVA‑OV‑1.5‑Mid‑Training‑85M) and a carefully curated 22M instruction dataset (LLaVA‑OV‑1.5‑Instruct‑22M). We retain a compact three-stage pipeline (Stage‑1 language–image alignment; Stage‑1.5 concept balancing and high-quality knowledge injection; Stage‑2 instruction tuning), combine offline parallel data packing (up to ~11× padding compression) with Megatron‑LM plus a distributed optimizer, and complete Stage‑1.5 pretraining of an 8B‑scale VL model on 128 A800 GPUs in about four days.

Building on this, we introduce LLaVA‑OneVision‑1.5, which inherits and extends the LLaVA series: it adds RICE‑ViT for native-resolution, region-level fine-grained semantic modeling; strengthens chart/document/structured-scene understanding; continues the compact three-stage paradigm to avoid a lengthy curriculum; and emphasizes “quality–coverage–balance” across the 85M pretraining and 22M instruction sets. Crucially, it delivers truly end-to-end transparent openness—covering data, training and packing toolchains, configuration scripts, logs, and reproducible evaluation commands with their build and execution details—to enable low-cost reproduction and verifiable extension by the community. Experiments show LLaVA‑OneVision achieves competitive or superior performance to Qwen2.5‑VL on multiple public multimodal benchmarks (see the technical report).

Pretraining Dataset (85M) and Concept Balancing

A general-purpose vision–language pretraining dataset (85M) and an instruction-tuning dataset (22M). The 85M pretraining corpus fuses eight heterogeneous sources—COYO-700M, Obelics, DataComp-1B, LAION-CN, ImageNet-21K, SAM-1B, MINT, and Zero250M—yielding roughly 20 million Chinese and 65 million English image–text pairs. To tackle long-tail concept sparsity and noise/missing issues in raw captions, we move beyond raw term frequencies and adopt a feature-driven “concept balancing” strategy: using a MetaCLIP encoder, we embed all images and a 500K-scale concept vocabulary into a shared vector space, retrieve the Top-K most similar concepts for each image, tally concept frequencies, and then apply inverse-frequency weighted resampling. This suppresses high-frequency background classes and boosts rare fine-grained entities, attributes, and scenes, substantially flattening the long-tail distribution. We then use a high-quality captioner to generate aligned bilingual (Chinese/English) augmented descriptions. Systematic experiments show that, under the same or lower token budget, scaling high-quality data combined with concept-balanced sampling delivers significant and reproducible gains in multimodal understanding, long-tail recognition, and instruction generalization.

Instruction Dataset (22M)

The 22M instruction dataset covers eight categories: Caption, Chart & Table, Code & Math, Domain-specific, General VQA, Grounding & Counting, OCR, and Science. Through multi-source aggregation, format standardization, instruction rewriting, bilingual conversion, template diversification (to reduce homogeneity), and safety filtering, we maintain balanced distributions across categories and difficulty levels. Moreover, augmenting our instruction data with the FineVision dataset yields further performance gains.

Method

1) Visual Encoder Pretraining

To raise the floor for OCR, tables/documents, region‑level understanding, and downstream instruction reasoning, LLaVA‑OneVision‑1.5 adopts our in‑house MVT v1.5 (RICE‑ViT) as the vision backbone.

Compared to CLIP/SigLIP‑style contrastive models that rely on global alignment only, RICE‑ViT addresses the structural bottleneck of representing an instance with a single global vector by introducing a unified Region Cluster Discrimination mechanism:
- trained on 450M images and 2.4B candidate regions
- explicitly models local entities/text blocks and their context via region‑cluster discrimination plus region‑aware attention
- uses 2D rotary position encoding (2D RoPE) for native multi‑resolution support
Unlike SigLIP2, which relies on multiple specialized losses (SILC, TIPS, LocCa, etc.), we use a single clustering‑discrimination paradigm to simultaneously strengthen general semantics, OCR recognition, and localization, yielding a simpler, more maintainable training/inference pipeline.

During multimodal fusion, a lightweight projection followed by full‑parameter joint training seamlessly plugs this fine‑grained semantic foundation into the language model, reducing redundant adapters and improving cross‑task transfer efficiency.

2) Three‑Stage Learning Pipeline
- Stage‑1: Language–image alignment
  Train the visual projection layer on the LLaVA‑1.5 558K dataset to map visual encoder outputs into the LLM’s token embedding space, with controlled parameter updates for fast, stable convergence.
- Stage‑1.5: Mid‑stage pretraining with high‑quality knowledge
  Full‑parameter training on the concept‑balanced 85M pretraining set to inject broad visual semantics and world knowledge, emphasizing data quality and coverage rather than blindly expanding token counts.
- Stage‑2: Visual instruction alignment
  Continue full‑parameter training on the 22M instruction set plus multi‑source visual instruction corpora such as FineVision to improve task generalization, reasoning organization, and response‑format control.
3) Offline Parallel Data Packing

To reduce padding waste from multimodal sequence‑length variance and improve effective token utilization, we adopt offline parallel packing:
- hash‑bucket clustering by sample length or length ranges to cut global sorting/scanning costs
- multithreaded concatenation of multiple short samples into fixed‑length sequences close to the target length during data prep
This one‑pass, corpus‑wide pipeline is deterministic and reproducible, avoiding the runtime instability and extra CPU overhead of online dynamic packing. On the 85M pretraining set, it achieves up to ~11× effective padding compression (defined as original total padding tokens / post‑packing total padding tokens) compared to the baseline.

4) Hybrid Parallelism and Efficient Long‑Context Training

On the training side, we use hybrid parallelism and long‑context optimizations—tensor parallelism (TP) + pipeline parallelism (PP) + sequence/context parallelism with a distributed optimizer—to improve compute utilization and memory efficiency at cluster scale. We also adopt a native‑resolution strategy to preserve structural details in charts, documents, and dense text regions, avoiding information loss from uniform resizing.

On a 128×A800 cluster, Stage‑1.5 for an 8B model (85M samples, native resolution) completes in about 3.7 days, balancing throughput and cost.

Open-Source Resources

We open-source LLaVA-OneVision-1.5 to facilitate future development of LMMs in the community:

🚀 Training Code

Cook a SOTA model with our released training code and reproduction scripts, click here.

🤗 Model Checkpoints

Models

Model HF Link
LLaVA-OV-1.5-4B-Instruct 🤗 HF / 4B-Instruct
LLaVA-OV-1.5-8B-Instruct 🤗 HF / 8B-Instruct
LLaVA-OV-1.5-4B-Base 🤗 HF / 4B-Base
LLaVA-OV-1.5-8B-Base 🤗 HF / 8B-Base

📊 Training Datasets

Explore comprehensive training datasets

Description Link
LLaVA-OV-1.5-Mid-Training-85M 🤗HF / Mid-Training 85M
LLaVA-OV-1.5-Instruct 🤗HF / Insturct-Data

🔥 Live Demo

Try LLaVA-OneVision-1.5 directly in your browser, click here!

Quick Start with HuggingFace
```
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
 
# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
 
# default processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
 
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
 
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
 
```
Evaluation
```
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git

accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
    --model=llava_onevision1_5 \
    --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
    --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
    --batch_size=1
```
Citation

If you find LLaVA-OneVision useful for your research, please cite:
```
@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arxiv},
  year={2025}
 }
 
@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research}
  year={2024}
}
```
29 August 2025
LLaVA-Critic-R1: Unified Critic and Policy Model Through Reinforcement Learning
Figure 1: LLaVA-Critic-R1 is trained on top of the base model Qwen-2.5-VL-7B. Building upon a stronger reasoning VLM, ThinkLite-VL-7B, we further develop LLaVA-Critic-R1+ by applying the same RL critic training procedure. Left: Performance comparison of LLaVA-Critic-R1 with other base and reasoning VLMs on multiple visual reasoning, visual understanding, and visual reward benchmarks. LLaVA-Critic-R1 not only significantly outperforms other models in critic performance, but also demonstrates stronger policy capabilities. Right: Performance improvement of critic training and test-time self-critic scaling on five common visual reasoning and visual understanding benchmarks. Critic training alone significantly improves the base model’s performance. Building upon this, leveraging the dual policy and critic capabilities of LLaVA-Critic-R1 for a “Best-of-128” self-critic scaling procedure at test time leads to a further substantial boost in performance.

Breaking the Critic-Policy Divide

In vision-language modeling, critic models are typically trained to evaluate outputs—assigning scalar scores or pairwise preferences—rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use.

LLaVA-Critic-R1 challenges this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing a multimodal critic trained to optimize preference judgments while retaining full generation ability.

Surprising Dual Excellence

LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model—matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B).

Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a state-of-the-art 71.9 on MMMU at the 7B scale.

Self-Critique at Test Time

The enhanced critic ability benefits inference significantly. Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. This demonstrates the power of unified critic-policy models for creating self-improving systems.

Technical Innovation

Our approach centers on three key innovations:

Data Reorganization: We transform preference-labeled critic datasets into verifiable training signals suitable for reinforcement learning.

GRPO Training: We apply Group Relative Policy Optimization directly on generative models, enabling them to learn from critic data while maintaining generation capabilities.

Unified Architecture: We maintain a single model for both critic and policy functions, eliminating the traditional separation between evaluation and generation.

Model Performance

LLaVA-Critic-R1 demonstrates strong performance across diverse benchmarks:
- Visual Reasoning: Competitive performance with specialized models on complex reasoning tasks
- Critic Evaluation: Top-tier preference judgment and scalar scoring capabilities
- Generation Quality: Maintained fluency and coherence with strong instruction following
The model comes in two variants:
- LLaVA-Critic-R1: Base model trained from Qwen-2.5-VL-7B
- LLaVA-Critic-R1+: Extended approach applied to strong reasoning VLMs
Implications for the Field

Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems. This work demonstrates that the traditional separation between critics and policies is not necessary—a single model can excel at both tasks simultaneously.

Resources

🚀 Code Repository

Access training code and implementation details

🤗 Model Collection

Download pre-trained model checkpoints

📝 Paper

Read the full technical paper on arXiv

Citation
```
@article{llava-critic-r1-2025,
  title={LLaVA-Critic-R1: Unified Critic and Policy Model Through Reinforcement Learning},
  author={Wang, Xiyao and Li, Chunyuan and Yang, Jianwei and Zhang, Kai and Liu, Bo and Xiong, Tianyi and Huang, Furong},
  journal={arXiv preprint arXiv:2509.00676},
  year={2025}
}
```
Acknowledgments

This work represents a collaborative effort in advancing the capabilities of multimodal models through innovative training approaches, building upon the strong foundation of the LLaVA project series.

Model	HF Link
LLaVA-OV-1.5-4B-Instruct	🤗 HF / 4B-Instruct
LLaVA-OV-1.5-8B-Instruct	🤗 HF / 8B-Instruct
LLaVA-OV-1.5-4B-Base	🤗 HF / 4B-Base
LLaVA-OV-1.5-8B-Base	🤗 HF / 8B-Base

Description	Link
LLaVA-OV-1.5-Mid-Training-85M	🤗HF / Mid-Training 85M
LLaVA-OV-1.5-Instruct	🤗HF / Insturct-Data

06 August 2025

Improved MM-Search-R1: Reasoning and Action in Multimodal Search

Our previous work, MMSearch-R1, represents a paradigm shift in multimodal AI as the first framework to employ end-to-end reinforcement learning for autonomous tool invocation in large multimodal models (LMMs). By enabling models to independently determine when and how to leverage external search tools, MMSearch-R1 achieves both high efficiency and state-of-the-art performance on open-world tasks, marking a significant advance in practical AI deployment.

What began as a specialized tool-calling model has since evolved into a general-purpose reasoning engine that seamlessly integrates knowledge retrieval with cognitive processing. This evolution offers critical insights into the future of autonomous AI systems: the most capable agents will not only be able to think deeply, but also actively seek and utilize relevant information as needed.

Reasoning-improved Search

Despite MMSearch-R1’s strong performance, we observed limitations in its ability to adapt to complex, dynamic information needs. To address these constraints, we propose a reasoning-first agent paradigm that emphasizes the following core capabilities:

Intelligent search: The model reasons about its knowledge gaps to make decisions about when and how to invoke search tools
Query generation: Deep task understanding enables context-aware query formulation that evolves with the problem
Knowledge integration: External information is systematically incorporated through reasoning processes, not merely retrieved and appended
Performance: The approach delivers fundamental advances in multimodal reasoning, not just incremental improvements

Training Recipe

Prior work in multimodal reasoning has demonstrated that training with verifiable rewards can significantly enhance a model’s capabilities in understanding and solving complex STEM problems. In our initial experiments, we evaluated numerous multimodal STEM datasets. We discovered that many existing datasets suffer from various limitations: some lack sufficient difficulty for advanced models, while others contain noisy annotations, incomplete visual-text alignments, or unverifiable ground truth answers. These issues can produce unreliable reward signals that destabilize reinforcement learning training. To address these challenges, we curated a comprehensive high-quality training set consisting of: MMPR[1], MMK12[2], MMR1[3], Multi-subject-RLVR[4], ScienceQA. To ensure data quality for effective multimodal RL training, we implemented a rigorous filtering pipeline:

Multimodal Verification: Every problem undergoes automatic verification to ensure visual and textual components are properly aligned and complete. We filter datasets to include only problems where both modalities contribute meaningfully to the solution process.
Answer Verifiability: Each problem must have verifiable ground truth answers with clear reasoning paths. For mathematical problems, we verify symbolic and numerical answers; for scientific problems, we ensure explanations align with established principles.
Complexity Filtering: Problems must require genuine multimodal reasoning rather than being solvable through text or vision alone. We exclude problems where one modality is merely decorative.

After filtering, we obtained 80K high-quality multimodal STEM problems for RL training.

Our RL training stage follows DAPO[5] with the following modifications:

No Entropy Loss: We eliminate entropy loss entirely, as its inclusion frequently causes training instability characterized by exponential entropy growth and subsequent collapse.
No KL Loss: Following DAPO, we remove KL loss to allow the model to diverge from the original SFT policy’s trust region. This also eliminates reference policy log probability computation, accelerating training.
Overlong Filtering: We mask loss for truncated sequences to preserve long-context reasoning capabilities.
Learning Rate Schedule: We implement a sigmoid-based decay schedule. The sigmoid schedule provides smooth S-shaped transitions that stabilize early training and asymptotically approach target rates without discontinuities. We keeps the base learning rate to $2e-6$ and the warmup steps to 60 steps with sigmoid curve progression. The decay is a sigmoid function reducing to 90% of base rate (final LR $\approx 1.8e-6$ ).
Improved Exploration: We set the clip high ratio to 0.3 in the GRPO/PPO surrogate loss to encourage exploration and stabilize entropy dynamics.

Our reward function employs a two-stage hierarchical approach combining mathematical verification with LLM-based evaluation. We first apply a static mathematical verifier to assess answer correctness for questions with deterministic solutions. When the verifier returns zero — indicating either incorrect answers or inability to verify, we employ an LLM-as-judge for secondary assessment to handle questions requiring semantic evaluation or those with multiple valid representations (e.g., “teal blue” vs. “blue”), the LLM would judge based on given images, questions, answers and model predictions.

This design prioritizes computational verification for efficiency while leveraging LLM evaluation for complex semantic cases.

Result

Based on this foundation, we can build a very strong STEM-focused reasoning model that surpasses the rest of open models.

Models	MMK12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)
Qwen2.5-VL-7B	34.4	46.2	24.0	66.6	49.8
OpenVL-Thinker	31.0	45.2	24.0	70.2	52.3
R1-OneVision	30.6	44.1	24.0	64.1	49.2
MM-Eureka-7B	27.0	50.3	26.9	73.0	50.7
General STEM	46.2	51.4	28.4	73.6	57.3
General STEM -> Search (Two Stage)	43.0	51.9	28.0	72.4	57.9

With this reasoning foundation, we can go further to improve the model’s search abilities. We first implemented a two-stage training process to seamlessly integrate search capabilities. This approach ensures that search becomes a natural extension of the model’s reasoning process rather than a separate module.

From the figure, compared with our original MMSearch baseline, which was built on Qwen-2.5-VL-7B (referred to as Instruct → Search in this context), we can observe that the model achieved good improvements. The reasoning-first approach enabled more intelligent search decisions, better query formulation, and more effective utilization of retrieved information.

Accuracy across four multimodal benchmarks (Infoseek, MMSearch, FVQA, and SimpleVQA). The Reasoning to Search paradigm consistently outperforms or matches Instruct -> Search, especially on Infoseek and MMSearch, demonstrating the benefit of reasoning-first strategies in complex information retrieval tasks.

One of the most intriguing findings emerged during our evaluation of STEM tasks (e.g., MMMU, MathVision) using Search prompts. We observed a counterintuitive phenomenon: excessive searching actually led to decreased performance. Specifically, models employing Search prompts tended to over-rely on external searches, frequently initiating queries for information that could have been inferred through reasoning or was already available internally.

Accuracy comparison across five challenging reasoning datasets. Results indicate that while integrating search generally helps, excessive or unguided searching can lower performance. This underscores the need for precise reasoning-guided search prompting to achieve optimal results in complex multimodal reasoning tasks.

These performance drops highlight critical insight: without effective reasoning capabilities to guide their search strategies, models tend to default to inefficient search behaviors. This not only results in unnecessary computational overhead but can also introduce irrelevant information, ultimately degrading the quality of answer generation.

Search Ratio	MM-K12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)
Reason -> Search (Search Prompt)	16.8	22.9	9.5	12.5	24.7

Reason to Act for General Search Model

To achieve a robust balance between reasoning and search performance across general-domain tasks, we choose to integrate the training into one stage for both capabilities. Our goal is to build a model that not only retrieves relevant information efficiently but also demonstrates advanced reasoning over searched information.

Training Recipe

We unify the training process by adopting a ReACT-style prompt template, inspired by [REACT PAPER], which allows the model to interleave reasoning and action (search) steps within a single trajectory. This template is a slight refinement of the standard Search prompt, and full implementation details are provided in the Appendix.

The table below summarizes the lineage and training data for each model variant, clarifying the distinctions in model initialization and supervision strategies. For comprehensive information on hyperparameters and training dynamics, please refer to the Appendix.

Result

We evaluated both our two-stage and unified (one-stage) models across a broad suite of benchmarks and consistently observed performance improvements as model capacity increased.

The General STEM model showed that enhancing reasoning capabilities alone can lead to significant gains. In contrast, the General Search model revealed the multiplicative benefits of integrating reasoning with targeted search strategies. Notably, these improvements were not simply incremental - they represent fundamental advances in how models address complex, multimodal problems.

Models	MMK12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)	AI2D	ChartQA	MME	RealworldQA	OCRBench	DocVQA	MMBench	MMStar	MiaBench
Qwen2.5-VL-7B	34.4	46.2	24.0	66.6	49.8	93.3	94.4	630.4/1685.2	68.5	85.2	94.6	82.9	62.6	81.7
General STEM	46.2	51.4	28.4	73.6	57.3	94.4	91.4	700.7/1662.1	67.5	83.7	92.1	83.8	65.5	76.0
Reason -> Search	43.2	51.7	25.0	71.8	57.9	94.0	93.6	652.5/1688.3	67.5	81.7	93.5	83.2	63.1	47.6
General Search	43.6	52.0	27.3	74.7	56.1	94.6	94.0	718.9/1775.3	65.5	77.8	89.4	84.0	60.4	44.4

Models	Infoseek	MMSearch	FVQA	SimpleVQA
Qwen2.5-VL-7B	20.1	12.8	20.3	38.4
MMSearch	55.1	53.8	58.4	57.4
Reasoning -> Search	58.5	57.1	57.9	57.7
General Search	52.0	54.9	52.8	57.0

Our results reveal that MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.

MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.

We found a strong positive correlation (Pearson r = 0.911) between search ratio and model performance, indicating that increased search engagement directly improves accuracy. However, this relationship has limits—excessive or undirected search introduces computational costs and answer noise that can degrade reliability. Additional experiments with reduced STEM data, increased search data ratios, and shortened warmup periods (60 vs 45 steps) confirmed that better performance requires strategic search integration. Models perform best when search is invoked selectively through explicit reasoning about information needs, balancing enhanced knowledge access against computational efficiency. These findings demonstrate that the key to multimodal model performance lies not in maximizing search frequency, but in developing sophisticated reasoning mechanisms that determine when external information retrieval adds value to complex query resolution.

Case Study

We show the following interesting cases to demonstrate versatile abilities of our final model.

Case: MME

In this example from the MME benchmark, the model is required to answer a question about a statue located in the National Gallery of Art in Washington, D.C. The process begins with the model analyzing the query image to determine what additional information is needed. It then performs searches for visually similar images, systematically evaluates the retrieved results, and conducts follow-up searches from different perspectives to verify its findings. This iterative search-and-reasoning approach allows the model to gather comprehensive evidence before arriving at a well-supported conclusion.

Case: Writing Email to a Public Figure

In this case, the model is tasked with composing an email to Abdullah Shahid Sial, a public figure. To accomplish this effectively, the model must gather comprehensive information about him through internet searches, including his social media presence (Twitter), official website, professional background, and other publicly available information sources.

Reference

[1] https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2

[2] https://huggingface.co/datasets/FanqingM/MMK12

[3] https://huggingface.co/datasets/MMR1/MMR1-Math-RL-Data-v0

[4] https://huggingface.co/datasets/virtuoussy/Multi-subject-RLVR

Appendix

Reasoning Template

{question}
Please reason step by step. Output the thinking process within <think> </think> tags and final answer within <answer> </answer> tags.

Search Template

Answer the user's question based on the provided image. Examine the image carefully and identify any recognizable entities, such as faces, objects, locations, events, logos, or text. Determine whether you have sufficient knowledge to confidently recognize the main visual element and answer the user's question. If so, first explain your reasoning, then provide a clear and direct answer.\nIf you are unable to confidently identify the visual element, stop and invoke the image search tool by appending the string <search><img></search> at the end of your response. This will trigger a Google Lens search using the original image to retrieve relevant information that can help you confirm the visual content.\nOnce you have sufficient visual understanding, combine it with the user's question and assess whether you can confidently answer. If so, answer the question directly using your own knowledge. If not, invoke the text search tool by generating a concise and specific query, and output it in the format <text_search>your query here</text_search> at the end of your response. Carefully craft your query to accurately retrieve the information needed to help answer the question. The text search tool will then use Google Search to return relevant information based on your query.\nYou must include your reasoning inside <reason>...</reason> before taking any action, whether it is calling the image search tool, generating a text search query, or providing a final answer. The reasoning may involve analysis of the original image and question, interpretation of search results, or logical steps leading to the final answer.\nAll search results will be placed inside <information> and </information> and returned to you. When you are ready to answer the question, wrap your final answer between <answer> and </answer>, without detailed illustrations. For example: <answer>Titanic</answer>.\nHere is the image and the question:\n<image>
{question}

ReACT Template

# System Message
  You are a helpful assistant. You should strictly follow reason-to-act thinking process to answer user provided question. Namely, you should first analyze the question & observation (e.g., user provided image or search results) and then inform the following action. The thinking process should be within <reason> and </reason> tags. The actions you can choose are:
<answer>xxxxx</answer>:  which returns the answer within <answer> and </answer> tags, and finishes the task.
<search>image</search>: which searches user provided image on Google and returns image-related visual entity/concept/knowledge for further reason-to-act. The search results are placed between <observation> and </observation> tags.
<search>text query</search>:  which generates a text query and sent to Google and returns some snippets containing the answer for further reason-to-act. The search results are placed between <observation> and </observation> tags. Note that sometimes the snippets do not contain the answer, and some alternative search might be needed.
 
  Your output format should be one of the following three formats:
  <reason> YOUR THINKING PROCESS </reason>
  <answer> YOUR ANSWER AFTER GETTING ENOUGH INFORMATION </answer>
   or
  <reason> YOUR THINKING PROCESS </reason>
  <search> IMAGE </search>
  or
  <reason> YOUR THINKING PROCESS </reason>
  <search> YOUR GENERATED TEXT QUERY FOR HELPING YOU FIND INFORMATION ON GOOGLE TO ANSWER USER QUESTION </search>
 
  Only output the final answer (in words, numbers or phrase) inside the <answer></answer> tags, without any explanations or extra information. If this is a yes-or-no question, you should only answer yes or no.

12 July 2025
Sparse Autoencoder Made Easy
SAE is inspired by a wealth of Sparse Autoencoder (SAE) work from Anthropic, OpenAI, Google, and the open-source community. SAE has become a powerful and widely-used tool in the field of explainable AI.

This project aims to provide a simple and flexible interface that allows users to inject SAE modules into their models at any layer with minimal effort. We adopt the elegant design of huggingface’s peft and regard SAE training as a kind of parameter efficient tuning, as long as the target is an nn.Module, SAE can be easily integrated and trained with only few lines.

Design Philosophy

The code design takes inspiration from PEFT, as we believe SAE shares many structural similarities with PEFT-based methods. By inheriting from a BaseTuner class, we enable seamless SAE integration into existing models.

With this design, injecting an SAE module is as simple as:
```
 
import torch
import torch.nn as nn
from peft import inject_adapter_in_model
 
from sae import TopKSaeConfig, get_peft_sae_model, PeftSaeModel
 
class DummyModel(nn.Module):
    def __init__(self):
        super(DummyModel, self).__init__()
        self.linear = nn.Linear(10, 10)
 
    def forward(self, x):
        return self.linear(x)
 
model = DummyModel()
config = TopKSaeConfig(k=1, num_latents=5, target_modules=["linear"])
 
# Inject the adapter into the model
model = inject_adapter_in_model(config, model)
 
# Check if the adapter was injected correctly
result = model(torch.randn(1, 512, 10))
```
You can also obtain a PEFT-wrapped model using the magic function from the PEFT library. The rest of your workflow remains the same:
```
# Get the PEFT model
peft_model = get_peft_sae_model(model, config)
 
result = peft_model(torch.randn(1, 512, 10))
```
Loading and saving is similar to PeftModel
```
peft_model.save_pretrained("test_save_peft_model")
 
model = DummyModel()
peft_model = PeftSaeModel.from_pretrained(
    model,
    "test_save_peft_model",
    adapter_name="default",
    low_cpu_mem_usage=True,
)
```
Data Processing

To ensure consistency in data formatting, we recommend first processing your data and storing it in Parquet format. This standardization simplifies interface development and data preparation.

You are free to customize the preprocessing logic and define keys for different modalities. However, the final output should be compatible with chat templates and our preprocessing pipeline.
An example preprocessing script is available at:
examples/data_process/llava_ov_clevr.py
```
python examples/data_process/llava_ov_clevr.py --push_to_hub --hf_repo_path lmms-lab/LLaVA-OneVision-Data --subset "CLEVR-Math(MathV360K)" --split train --target_hf_repo_path lmms-lab/LLaVA-OneVision-Data-SAE
```
Training

Our trainer implementation builds on top of existing frameworks and supports the following features:
- ZeRO-1/2/3 training
- Weights & Biases (WandB) logging
With ZeRO optimizations, you can train SAEs on 72B models using just 8×A800 GPUs.

We provide a simple training recipe to help you get started quickly. You’re also welcome to implement your own training pipeline.

Quick Start
- ZeRO-3, 72B training:
  examples/train/zero/run_qwen25_vl_72b_zero3.sh
- ZeRO-2, 7B training:
  examples/train/zero/run_qwen25_vl_7b_zero2.sh
- DDP, 7B training:
  examples/train/ddp/run_qwen25_vl_7b_ddp.sh
Reproducible Logs

Related Work and Citation

If you find this repository useful, please consider checking out our previous paper on applying Sparse Autoencoders (SAE) to Large Multimodal Models, accepted at ICCV 2025.

You can cite our work as follows:
```
@misc{zhang2024largemultimodalmodelsinterpret,
      title={Large Multi-modal Models Can Interpret Features in Large Multi-modal Models},
      author={Kaichen Zhang and Yifei Shen and Bo Li and Ziwei Liu},
      year={2024},
      eprint={2411.14982},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.14982},
}
```
01 June 2025
Multimodal Search R1
🔗 Code | Paper | Model | Data

MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.

Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.

1. Introduction

Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with long-tail facts, newly emerging information, and domain-specific content that is often restricted by privacy or copyright constraints. As a result, their performance remains suboptimal on knowledge-intensive and information-seeking visual question answering tasks, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution, such as unfamiliar visual content or previously unseen textual information. This limitation raises important concerns regarding their factual reliability in real-world applications.

Integrating search capabilities into LMMs offers a promising solution to above limitations. However, existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal. RAG methods rely on a fixed retrieve-then-generate pipeline grounded in static corpora, often leading to over-retrieval, high computational cost, and the unrealistic assumption that all necessary information is already available. This rigid setup fails to reflect the dynamic and unpredictable nature of real-world scenarios. In contrast, prompt-based agents can access real-time search engines, but their parameters are not optimized through learning, preventing them from truly acquiring effective search behaviors or adapting to open-world environments.

To address these limitations, we aim to train LMMs that can interact with real-world environments and acquire three essential search-related capabilities: (1) when to search, (2) what to search for, and (3) how to reason over search results to answer user queries. Building on these goals, we introduce MMSearch-R1, the first end-to-end reinforcement learning framework designed to empower LMMs with on-demand search capabilities in open, internet-based environments. Our efforts are summarized as follows:
- Dataset Construction We propose an automated approach to construct a multimodal search VQA dataset by estimating the model’s familiarity with each question. This enables the generation of search-required and search-free samples, further complemented by manually annotated test data covering diverse knowledge types and difficulty levels.
- Multimodal Search Tool Integration We develop a real-world search pipeline combining an image search tool and a text search tool, enabling LMMs to retrieve relevant visual and textual information for unfamiliar inputs.
- Wiser Search via Reinforcement Learning We introduce a GRPO-based RL framework that trains LMMs to decide when, what, and how to search. Our method achieves superior performance over RAG-based baselines while reducing search calls by over 30%.
- Open-Sourced Dataset and Framework We will release our model, dataset and training framework to support future research in search-augmented multimodal reasoning.
2. Method

2.1. Building Iterative Multimodal Search-Integrated RL Framework

Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.

We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions with the real-world environment during the rollout process, as illustrated in Figure 2 and below.
- Multimodal Search Tools We equip the model with two types of search tools to interact with real-world internet content. The first is an image search tool, which takes the input image and returns the top-5 visually similar webpages, each represented by a thumbnail and a title. This enables the model to identify unfamiliar visual entities in the image. The second is a text search pipeline, where the model formulates a query based on the user question, retrieves relevant webpages, and processes their content into concise summaries. This allows the model to acquire textual knowledge needed to answer the question accurately.
- Rollout with Multi-turn Multimodal Search The rollout process is designed to be multi-turn and iterative. At each step, the model receives new information, such as the original question or retrieved search results, and performs reasoning based on the accumulated context. It then selects an action from a predefined action space, which includes invoking search tools or answering the question. This process continues until the model generates a final answer or reaches the maximum number of allowed turns. To support this interaction, we define and utilize a set of special tokens to structure the model’s outputs and the environment’s feedback.
- Reward Modeling Our reward consists of two components: an accuracy score with search penalty and a format score. For accuracy score, we evaluate model performance using exact string match against the ground truth, assigning a score of 1 for correct answers and 0 otherwise. For correct responses, a penalty factor (between 0 and 1) is applied if any search was used, encouraging the model to rely on internal knowledge and invoke search only when necessary. This design promotes efficient, on-demand search behavior. The format score verifies whether the model follows the required output structure, ensuring compatibility with the environment interface.
$\texttt{reward} = (1 - \alpha)\cdot \texttt{Acc\_Score}\cdot \texttt{Search\_Penalty} + \alpha\cdot \texttt{Format\_Score}$
2.2. Curating Search-balanced VQA Datasets

Figure 3: Illustration of data construction process of FVQA dataset: (a). An automated pipeline for visual knowledge-required VQA samples collection; (b). Knowledge taxonomy; (c). Overall pipeline showing the composition and origin of FVQA from various automatic and manually curated sources.

To effectively train models for on-demand search using simple outcome-based reinforcement learning, we require a search-balanced dataset that includes both search-required and search-free questions. This balance allows the model to learn when to rely on internal knowledge and when to invoke external search. We propose three key criteria for such datasets: (1). Coverage of Both Search-Required/Free Questions; (2). Concise and Verifiabl Answers; (3). Diversity in Knowledge and Difficulty. Follow these criteria, we construct a multimodal search VQA dataset, FactualVQA (FVQA), using a combination of automated pipelines and manual annotation.
- VQA Collection We first gather a pool of candidate VQA samples requiring either visual or textual knowledge. For visual knowledge, we develop an automated pipeline that collects images related to head and tail visual concepts in the MetaCLIP vocabulary from the internet. Based on these images, we use GPT-4o to generate corresponding questions that assess the model’s recognition capabilities. For textual knowledge, we sample questions from the InfoSeek training set. We annotate the knowledge type for each question using GPT4o and maintain a balanced distribution across categories.
- Search Balancing To distinguish between search-required and search-free questions, we use a preliminary model equipped with search capabilities to classify the collected VQA samples. Based on this classification, we construct a search-balanced training set of 5,000 examples, named FVQA-train, which includes approximately 3,400 search-required and 1,600 search-free questions.
- Human Annotation Human annotators are involved throughout the data curation process to ensure diversity, authenticity, and label quality—especially for the test set of FVQA.
3. Experimental Findings

We evaluated MMSearch-R1 against both closed-source models (GPT-4o and Gemini 2.5 Pro) and open-source models from the Qwen2.5-VL series on knowledge-intensive and information-seeking VQA tasks (FVQA-test, InfoSeek, MMSearch, SimpleVQA, and LiveVQA). All baseline models are tasked with solving VQA problems in two different workflows. (1) Direct Answer: Models are prompted to directly generate a short and precise answer without accessing external information. (2) Answer under RAG Workflow: In this workflow, models are required to perform exactly two search operations using our multimodal search tools for each VQA example, first performing an image search and then a text search. Specifically, given an input image and question, the model is provided with the image search results and the original question in the first round and is prompted to generate a text query to assist in answering. In the second round, the retrieved results based on the text query are fed into the model, and the model is asked to produce the final answer. Under a fixed budget of search steps, the RAG workflow typically exposes the model to more external information compared to the on-demand search strategy.

Table 1: Performance of MMSearch-R1 across benchmarks. "Acc (%)" denotes the accuracy evaluated by LLM-as-Judge, while "SR (%)" represents the search ratio, defined as the percentage of total search calls made relative to the maximum allowed search steps for each method.
- Finding 1: RL training enables models to better recognize the boundaries of their knowledge and perform on-demand search more effectively. As shown in Table 1, MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%, across both in-domain and out-of-domain test sets. This demonstrates that our RL-trained model achieves higher correctness with fewer search calls, indicating more efficient and selective use of external information.
Figure 4: (a). Performance comparison between the Base model and the RL-trained model under the RAG workflow. (b). Answer behavior breakdown of Base (inner circle) and RL (outer circle) models in InfoSeek and SimpleVQA.
- Finding 2: RL training enhances the model’s ability to generate effective text queries and summarize retrieved information. To evaluate the ablities of query generation and information summarization, we follow a fixed RAG setup where both image and text search are executed for every question. This isolates the model’s ability to interact with retrieved information. As shown in Figure 4(a), MMSearch-R1-7B consistently outperforms the base model on both in-domain and out-of-domain tasks.
- Finding 3: RL improves the model’s ability to utilize its internal knowledge. As shown in Figure 4(b), there is a clear upward trend in the Correct without Search proportion from the base model to the RL-trained model. These gains indicate that the RL-trained model can answer substantially more questions correctly without invoking the search tool, demonstrating improved recall and reasoning based on its internal knowledge.
Figure 5: (a). Performance improvements of SFT and RL over Base across five VQA datasets. (b). Training dynamics of reward and search ratio for different strategies.
- Finding 4: RL achieves greater performance improvements and exhibits higher data efficiency compared to supervised SFT. We distill GPT-4o’s behavior on our collected VQA samples to construct SFT data, and fine-tune Qwen2.5-VL-7B on it. This serves as a supervised learning baseline for comparison against our reinforcement learning-trained model. As shown in Figure 5(a), the results show that the model trained with RL consistently outperforms the one trained with SFT across all tasks, despite being trained on only about half as much data.
- Finding 5: Training with balanced data and a search penalty in the reward effectively guide the model to perform on-demand search. Figure 5(b) illustrates the training dynamics of reward and search ratio during reinforcement learning. Removing either the search penalty or data balancing leads to distinct trade-offs. Although both ablated variants achieve slightly higher rewards, they do so at the cost of overusing the search tool, with search ratios rapidly converging to nearly 100%.
4. Conclusion

MMSearch-R1 learns to recognize knowledge gaps, selectively invoke image or text search, and reason over retrieved content. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls. Our framework, dataset, and findings offer practical insights into training LMMs with real-world interaction capabilities and lay the groundwork for building multimodal agents that are both adaptive and interactive. We look forward to the next major advancement in multimodal intelligence emerging as models increasingly engage with and explore the real world through more tools, further evolving their reasoning and adaptive capabilities.

Authors
- Jinming Wu*
- Zihao Deng*
- Wei Li
- Yiding Liu
- Bo You
- Bo Li
- Zejun Ma
*equal contribution

Citation
```
@article{wu2025searchr1,
  title={Search-R1: A Multimodal Search-Augmented Reinforcement Learning Framework for LMMs},
  author={Wu, Jinming and Deng, Zihao and Li, Wei and Liu, Yiding and You, Bo and Li, Bo and Ma, Zejun},
  url={https://github.com/EvolvingLMMs-Lab/multimodal-search-r1},
  year={2025}
}
```
28 May 2025
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Code@Github

1. Introduction

SOTA large multimodal model (LMM) architectures, such as Qwen2.5-VL, typically build on a powerful large language model (LLM) (e.g. Qwen2.5) integrated with an external Native Resolution Vision Transformer (NaViT). Such approach also presents challenges in high-resolution real-world scenarios, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. By comparison, when processing high-resolution real-world scenarios, the human visual system employs task-driven visual search strategies to ground and scrutinize critical regions of interest. Motivated by this biological mechanism, we attempt to equip LLMs with similar visual search capabilities by leveraging visual grounding to focus on key image regions.

However, empowering LMMs with such grounding-based visual reasoning capabilities is non-trivial, primarily due to the scarcity and high cost of obtaining grounding annotations for standard visual-question-answering (VQA) datasets, which are required for constructing multi-turn grounding-based conversation data for supervised fine-tuning (SFT). In this paper, we highlight that accurate grounding behavior can emerge within a reinforcement learning (RL) paradigm, even when training supervision is provided solely through a binary reward function derived from the correctness of the final answer.

To this end, we introduce Multi-turn Grounding-based Policy Optimization (MGPO), a reinforcement learning (RL) algorithm that enables LMMs to iteratively focus on key image regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Given a high-resolution image and a question, the model first predicts the coordinates of key regions relevant to the query. An image cropping function is then triggered to extract and return the corresponding sub-image. In subsequent turns, the model can integrate previous in-context convesations (including both the original image and cropped sub-image) to solve the question.

Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process. The conversation in the figure only shows key parts, the full conversation is provided in Figure 9.

In summary, MGPO mainly offers the following advantages:
- Top-down and Interpretable Visual Reasoning. MGPO equips LMMs with a top-down, question-driven visual search mechanism for high-resolution scenarios and provides interpretable outputs that indicate which image regions are attended to throughout the reasoning process.
- Overcomes Maximum Pixel Constraints. MGPO can overcomes the maximum pixel limitation of LMMs. As shown in the first example of Figure 1, even when resizing a high-resolution image within pixel limits results in a blurred input, the model can still identify relevant coordinates and crop clear sub-images from the original input for further analysis.
- Without Additional Grounding Annotations. MGPO can be post-trained directly on standard VQA datasets without the need for extra grounding annotations, and experimental results demonstrate substantial improvements in intermediate grounding performance compared to GRPO
Ultimately, we utilize MGPO to post-train Qwen2.5-VL-7B using visual-question-short answering data, yet achieves strong intermediate grounding performance without requiring grounding annotations (examples shown in Figure 1). Compared to GRPO, MGPO yields a 5.4% improvement on the in-distribution MME-Realworld benchmark and a 5.2% gain on the challenging out-of-distribution V* Bench. Notably, leveraging with only 21K post-training samples, our model surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench.

2. Multi-turn Grounding-Based RL

Figure illustrates a comparison of different post-training paradigms for LMMs. In our MGPO, the model operates over K sequential interaction, dynamically grounding and reasoning by conditioning on the full history of visual and textual context at each step.

Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.

Multi-turn Template without Cold Start. In practice, we observe that LLMs struggle to autonomously generate grounding coordinates during the rollout process, which hinder effective multi-turn RL. To address this, we design a fixed two-turn dialogue template, as shown in Figure 3, to explicitly activate the model’s grounding and reasoning abilities.

Figure 3: Fixed multi-turn grounding template, which eliminate cold start SFT process.

Grounding Key Visual Areas. Within the two-turn MGPO framework, the extraction of sub-images is performed with respect to the original high-resolution image. Since the grounding coordinates predicted by Qwen2.5-VL are inherently dependent on the resolution of the input image, it is necessary to normalize the predicted coordinates by the input image dimensions and subsequently map them back to the coordinate space of the original image. This normalization procedure is particularly crucial when the original image resolution exceeds the maximum pixel limit of the LMM, as it enables the model to access higher-fidelity sub-image for processing. A illustration of this process is provided in the Figure 4.

Figure 4: A illustration of cropping sub-image based on grounding coordinates.

3. Experiments

3.1 Datasets & Metrics

To evaluate the effectiveness of the our approach, experiments are conducted on two established datasets: MME-Realworld and V* Bench. Both datasets are specifically designed to evaluate the capabilities of LMMs in analyzing high-resolution images and capturing fine-grained visual information.

MME-Realworld. The MME-Realworld dataset comprises a diverse array of tasks, which are systematically categorized into perception and reasoning domains. For in-distribution evaluation, the lite subset of MME-Realworld, consisting of 1,919 samples, is reserved as the test set, while the remaining 21,690 samples are utilized for training.

V* Bench. V* Bench serves as an out-of-distribution benchmark, focuses on detailed visual grounding on high-resolution images. This vision-centric benchmark requires LMMs to accurately localize and interpret specific visual information, which has also been adopted by OpenAI to assess the visual reasoning capabilities of their latest o3 and o4-mini models. This benchmark contains 191 test samples.

All datasets employ the multiple-choice question format, and model performance is consistently measured by accuracy on both the in-distribution (MME-Realworld) and out-of-distribution (V* Bench) test sets. Figure 5 illustrates the distribution of image resolutions across different datasets.

Figure 5: Distribution of image resolutions (width × height) across different datasets.

3.2 Experimental Setup

We employ the verl framework to enable distributed training across multiple machines and GPUs, and utilize vLLM to accelerate inference during the rollout phase. For reinforcement learning, we adopt the naive GRPO algorithm as RL baseline, where a post-prompt is added: “{question}\nOutput the coordinates ofthe key image area relevant to the problem in JSON format. And put the answer letter (A, B, C, D, or E) within \boxed{}.” Both GRPO and our proposed MGPO leverage a binary accuracy reward function, assigning a reward of 1 if the final multiple-choice answer is correct and 0 otherwise.

All experiments are conducted using the Qwen2.5-VL-7B model. To prevent out-of-memory errors, the maximum number of input image pixels is limited to 1,003,520 (1280 × 28 × 28), corresponding to a maximum of 1280 visual tokens per image. Images exceeding this pixel threshold are resized to comply with this constraint.

3.3 Main Results

Table 1 presents the performance comparison of different post-training paradigms on Qwen2.5-VL7B, including SFT, GRPO and our MGPO. All three post-training methods substantially improve the model’s performance on high-resolution visual tasks, as measured by both OOD V* Bench and ID MME-Realworld benchmarks.

Notably, we observe that GRPO does not yield significant improvements over SFT, which contrasts with conclusions drawn from prior work on multi-modal mathematical tasks. We hypothesize that, for high-resolution vision-centric tasks, the primary challenge lies in enabling the model to perceive fine-grained image details, rather than performing complex, lengthy reasoning.

In contrast, our MGPO algorithm achieves remarkable gains, outperforming both SFT and GRPO. Specifically, MGPO delivers a substantial 5.2% absolute improvement over the GRPO baseline on the V* Bench (OOD) benchmark, and a 5.4% gain in overall MME-Realworld (ID) performance. These results demonstrate the effectiveness of multi-turn grounding and iterative sub-image cropping in addressing the challenges of high-resolution visual understanding.

Additionally, we compare our results with OpenAI’s o1 and GPT-4o models. To ensure a fair comparison, we report only the OOD V* Bench results. Notably, our MGPO post-trained model surpasses both o1 and GPT-4o, despite being based on a 7B model and trained with a small-scale dataset of 21k samples.

Table 1: Performance comparison of different post-training paradigms for LMMs. V* Bench serves as an out-of-distribution evaluation, while MME-Realworld serves as an in-distribution evaluation. Abbreviations: OCR—Optical Character Recognition in the wild; RS—Remote Sensing; DT—Diagram and Table; MO—Video Monitoring; AD—Autonomous Driving.

Figure 6 illustrates the comparative performance trajectories of MGPO and GRPO on the V* Bench throughout the RL training process. As training progresses, MGPO consistently surpasses GRPO, highlighting its superior capacity to address high-resolution scenarios that remain unresolved by GRPO.

Figure 6: Performance comparison of V* Bench between MGPO and GRPO.

Effect of LMM Maximum Input Image Resolution. Table 2 compares the impact of varying maximum input image resolutions for LMMs. We observe that MGPO yields greater performance improvements on the V* Bench when the maximum input pixel limit is lower. This is because, when high-resolution images are aggressively resized, many tasks become more challenging to solve directly. however, MGPO can first identify key regions and crop clearer sub-images from the original image, thereby facilitating more effective task completion.

Table 2: Performance comparison of various post-training paradigms for LMMs under different maximum input image resolutions.

4. Grounding-based RL without Grounding Annotations

In this section, we highlight the insight that it is feasible to train powerful grounding-based RL models even without grounding annotations. This insight can broadens the applicability of grounding-based RL paradigms, as obtaining high-quality grounding annotations is often expensive and labor-intensive.

4.1 Emergent Grounding Ability During RL Training

To assess whether models can develop accurate grounding capabilities in the absence of grounding supervision, we analyze the proportion of rollouts that generate valid grounding coordinates during RL training (e.g., ensuring coordinates within the input image boundaries). Figure 7 illustrates the comparison between GRPO and MGPO. Regarding to GRPO, the ratio of valid grounding coordinates remains low and exhibits minimal improvement throughout training, indicating that the model struggles to ground correct image regions. In contrast, MGPO demonstrates a clear upward trajectory, with the proportion of valid grounding coordinates steadily increasing as training progresses.

Figure 7: The ratio of valid grounding coordinates during RL rollouts.

Additionally, we evaluate whether the grounding sub-images from the test set can be directly used to answer the question using Qwen2.5-VL-7B. As presented in Table 3, the comparative results across different methods demonstrate the superior accuracy of grounding achieved by MGPO. In the second stage of MGPO, the model is provided with either the cropped subimage or the original image, without any auxiliary reward for generating valid sub-image coordinates. Notably, the model autonomously increases the proportion of valid grounding coordinates, suggesting that it is capable of learning to localize key regions and utilize subimages to improve question answering performance.

Table 3: Ratio of grounding subimages that can directly answer the question using Qwen2.5-VL-7B on the V* Bench.

4.2 Further Experiments on Image Counting Tasks

To further substantiate the insight, we conduct additional experiments on the Image Counting task, leveraging the fact that the Image Count dataset provides both the grounding annotations (in point format) and the corresponding count as the final answer. Specifically, we randomly sample 3,000 instances from the Pixmo-Points dataset for post-training. Pixmo-Count is used as the indistribution (ID) evaluation benchmark, while FSC-147 serves as the out-of-distribution (OOD) benchmark.

During GRPO post-training, the model is prompted to first grounding (point) each object in the image and subsequently provide the total count. We compare two reward function: (1) the binary accuracy reward based solely on the correctness of the final count, and (2) incorporating an additional point reward. The point reward is computed by matching the model’s predicted point list with the ground-truth point list using the Hungarian algorithm, such that a higher number of matched ratio results in a higher reward.

The results, summarized in Table 4, indicate that introducing the additional point reward does not yield significant performance improvements. We further visualize the outputs of the GRPO model trained solely with the accuracy reward (see Figure 8), and observe that the model is capable of accurately localizing object points even without explicit grounding supervision. These results support our conclusion that explicit grounding annotations are not necessary for effective RL-based learning, as the model inherently learns to perform precise grounding as a prerequisite for solving the counting task.

Table 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.

Figure 8: Visualization of point predictions from the GRPO model trained with only accuracy reward.

5. Limitation

All experiments of MGPO are conducted using a fixed two-turn template, rather than allowing the model to autonomously decide when to perform image cropping based on the input question, as illustrated in lasted OpenAI models such as o3 and o4-mini. This limitation stems from our observation that Qwen2.5-VL, when directly subjected to RL post-training, struggles to generate grounding coordinates without explicit prompt guidance.

Nevertheless, we believe that our trained models can be leveraged to generate high-quality chain-ofthought (CoT) data for subsequent SFT. By adopting a multi-stage training strategy that combines SFT and RL, as in DeepSeek-R1, may ultimately enable the model to autonomously decide when and how to perform grounding. We leave this direction for future work.

Authors
- Xinyu Huang
- Yuhao Dong
- Wei Li
- Jinming Wu
- Zihao Deng
- Bo Li
- Zejun Ma
Citation

If you find our work to be useful for your research, please consider citing.
```
@article{huang2025highres,
  title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning},
  author={Huang, Xinyu and Dong, Yuhao and Li, Wei and Wu, Jinming and Deng, Zihao and Li, Bo and Ma, Zejun},
  url={https://github.com/EvolvingLMMs-Lab/MGPO},
  year={2025}
}
```
Appendix

Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks.

29 April 2025

Aero-1-Audio

What is Aero Audio?

Github | Playground | Models | Evaluation Results | Cookbook

Aero-1-Audio is a compact audio model adept at various audio tasks, including speech recognition, audio understanding, and following audio instructions. It is part of the Aero-1 series, the first generation of lightweight multimodal models developed by LMMs-Lab, with future expansions planned across additional modalities.

Built upon the Qwen-2.5-1.5B language model, Aero delivers strong performance across multiple audio benchmarks while remaining parameter-efficient, even compared with larger advanced models like Whisper and Qwen-2-Audio and Phi-4-Multimodal, or commercial services like ElevenLabs/Scribe.
Aero is trained within one day on 16 H100 GPUs using just 50k hours of audio data. Our insight suggests that audio model training could be sample efficient with high quality and filtered data.
Aero can accurately perform ASR and audio understanding on continuous audio inputs up to 15 minutes in length, which we find the scenario is still a challenge for other models.

ASR & Audio Understanding Performance

We evaluate our model performance on multiple dimensions and different benchmarks. Let’s first take a look at its overall performance compare with other models

Our model achieves a balance between performance and parameter efficiency. We evaluate it across multiple ASR and audio understanding benchmarks. On ASR tasks, our model attains the lowest WER scores on datasets such as AMI, LibriSpeech, and SPGISpeech. It also demonstrates strong audio understanding capabilities on various comprehension benchmarks. As illustrated in the plotted graph, our model falls within the highlighted triangular region that represents an optimal trade-off between parameter efficiency and performance.

Data Distribution

We present the contributions of our data mixture here. Our SFT data mixture includes over 20 publicly available datasets, and comparisons with other models highlight the data’s lightweight nature.

*The hours of some training datasets are estimated and may not be fully accurate
One of the key strengths of our training recipe lies in the quality and quantity of our data. Our training dataset consists of approximately 5 billion tokens, corresponding to around 50,000 hours of audio. Compared to models such as Qwen-Omni and Phi-4, our dataset is over 100 times smaller, yet our model achieves competitive performance. All data is sourced from publicly available open-source datasets, highlighting the sample efficiency of our training approach. A detailed breakdown of our data distribution is provided below, along with comparisons to other models.

What’s insightful

In this release, our primary focus is on developing an audio model capable of handling multiple audio tasks. The following examples showcase its core abilities across tasks such as audio understanding and speech recognition. Most notably, we highlight the model’s capability to perform long-form ASR, as demonstrated in the example below.

Long ASR

A common approach for current long-form ASR tasks is to split the audio into smaller, processable chunks and perform ASR on each segment individually. However, with the advancement of large language models (LLMs), long-context understanding has become increasingly important. We argue that a model’s ability to process long audio sequences continuously is essential for effective audio understanding and should be considered a critical capability. To demonstrate this, we set up a simple use case using examples from an NVIDIA conference and calculate the WER with respect to the auto-generated YouTube subtitles.

The image above presents a heatmap comparison of different models performing ASR tasks on a video with varying audio input lengths. As shown in the heatmap, Qwen-Omni and Phi-4 exhibit instability across different lengths and do not consistently produce the desired output.

Note: The ground truth is derived from the auto-generated subtitles downloaded from YouTube. Therefore, the WER does not necessarily imply that our model achieves perfect results, but rather demonstrates that our model is comparable to the YouTube ASR pipeline.

Model’s Output

Qwen Omni (12 minutes chunk)

When processing the audio in 12-minute chunks, Qwen-Omni failed to recognize the full speech content and was only able to capture portions of the audio.

▶ Qwen Omni (12 minutes chunk)

that’s like what’s going on why does itfocused on um ai and parallel parallelizable workloads but it’s still general to an extent it’s not as use case specific as something like grock with a queue that’s really designed to you know spit out tokens as fast as possible and that like is a goldilocks zone where it’s flexible enough to handle different workloads but not um but still much faster than um a traditional cpu and that google is one of the only companies that has a scaled internal custom silicon effort

Phi-4-Multimodal (full chunk)

When processing the full audio without splitting, the Phi-4-Multimodal model began to ignore the instructions and instead generated an overall summary of the audio.

▶ Phi-4-Multimodal (full chunk)

The conversation covered Nvidia’s focus on inference over training, the partnership with GM, the release of GUT-N1 for humanoid robotics, and the impact of China’s AI initiatives on global chip demand.

Aero (full chunk)

Aero Audio is able to generate the complete ASR output and accurately identify the full transcript.

▶ Aero (full chunk)

Welcome to the brainstorm episode eighty two frank downing joining us recap of nvidia’s gtc conference that is the gpu technology conference frank what happened what were the big takeaways i on my side i saw a gm and in video partnering but we can circle back to that what was … right nice timing good timing all right we’ll see everyone next week see everyone thank you

Results on LibriSpeech Unchunked

In the previous release, LibriSpeech split their audio files into smaller chunks and calculated the overall Word Error Rate (WER) based on these segmented samples. However, as we observed, it is straightforward to concatenate the chunks back into their original form, thereby creating a simple long-form Audio Speech Recognition benchmark. We evaluated various models on these benchmarks and found that their performance generally declined compared to their results on shorter samples. Among the models tested, our model achieved the best performance, showing the smallest drop in accuracy relative to the chunked version.

	LS.Clean	LS.Other	LS.Clean(Long)	LS.Other(Long)	Avg Diff
Phi-4	1.68	3.83	11.51	24.72	30.72
Qwen2-Audio-Instruct	3.59	7.46	93.01	93.63	175.59
Qwen2.5-Omni	1.80	3.40	13.03	13.29	21.12
Aero-1-Audio	1.49	3.17	5.31	11.71	12.36

We present the evaluation of various models on the unchunked LibriSpeech dataset. The average result is calculated by averaging the WER score differences across the same splits. All models show some degradation when handling longer audio, whereas our model exhibits the least amount of performance drop.

Evaluation Result

We then present the full evaluation result here with the evaluation scores

ASR Benchmarks

Model	Parameters	Automatic Speech Recognition						Average
		AMI	Earnings22	LibriSpeech Clean	LibriSpeech Other	SPGispeech	Tedlium
ElevenLabs/Scribe	N/A	14.43	12.14	1.79	3.31	3.30	3.17	6.36
REV.AI/Fusion	N/A	10.93	12.09	2.88	6.23	4.05	2.80	6.50
OpenAI/Whisper-large-v3	1.5B	15.95	11.29	2.01	3.91	2.94	3.86	6.66
Assembly.AI/AssemblyBest	N/A	15.64	13.54	1.74	3.11	1.81	3.43	6.55
Alibaba/Qwen2.5-Omni	7B	12.41	12.74	1.80	3.40	2.35	3.11	5.97
Microsoft/Phi-4-Multimodal	4B+1.6B	11.45	10.50	1.67	3.82	3.11	2.89	5.57
LMMs-Lab/Aero-1-Audio	1.5B	10.53	13.79	1.49	3.17	1.97	2.87	5.64

We evaluate our model on AMI, Earnings22, LibriSpeech, SPGISpeech, and TedLium. Our model achieves the second-best WER score compared to other models, while maintaining a small and efficient size.

Audio Understanding Result

We then test our model’s understanding result across 3 dimensions, Audio Analysis and Understanding, Speech Instruction, and Audio Scene Understanding

Model	Parameters	Audio Analysis and Understanding						Speech Instruction		Audio Scene Understanding			Average
		AIR-Chat					MMAU	OpenHermes	Alpaca Audio	AIR-Foundation
		Speech	Sound	Music	Mix	Avg	testmini	test	test	Speech	Sound	Music
Alibaba/Qwen2-Audio-Instruct	7B	7.2	7.0	6.8	6.8	6.9	49.2	46.8	49.2	62.9	55.4	56.8	56.7
Alibaba/Qwen2.5-Omni	7B	6.8	5.7	4.8	5.4	5.7	65.6	57.2	57.4	67.2	76.3	63.0	64.4
Microsoft/Phi-4-Multimodal	4B+1.6B	7.5	7.0	6.7	6.8	7.0	65.0	57.8	62.6	48.3	40.6	35.5	52.8
Tencent/Ola	7B	7.3	6.4	5.9	6.0	6.4	70.3	62.6	62.8	58.8	70.4	53.1	63.2
Tencent/Vita 1.5	7B	4.8	5.5	4.9	2.9	4.5	35.5	9.6	7.0	31.5	24.1	25.5	28.6
InspirAI/Mini-Omni2	0.5B	3.6	3.5	2.6	3.1	3.2	-	-	-	-	-	-	-
LMMs-Lab/Aero-1-Audio	1.5B	5.7	5.3	4.7	5.8	5.4	59.4	40.0	45.4	48.0	57.6	44.2	50.5

We conducted evaluations on AIR-Bench-Chat and MMAU for audio analysis and understanding. Our model achieved an average score of 5.35, outperforming Mini-Omni2 and Vita. For Audio Instruction Following, we evaluated on OpenHermes and Alpaca-Audio, following the same pipeline as AudioBench. Our model demonstrates a strong ability to understand instructions in speech and provide correct responses. Additionally, when evaluated on AIR-Bench-Foundation for Audio Scene Understanding, our model outperformed Phi-4-Multimodal in the sound and music dimensions. Overall, the average score of our model indicates strong performance relative to other models with larger parameter sizes.

Training Techniques

Dynamic Batch Size

We implemented a dynamic batching strategy based on the estimated token length to control the batch size per device. In many cases, using a fixed batch size requires setting it conservatively small to avoid out-of-memory (OOM) errors on longer samples, which leads to underutilization of computing resources. To address this, we group samples into batches such that the total token length stays within a predefined threshold, thereby minimizing computational waste and improving efficiency.

Sequence Packing

To further optimize dynamic batching, we implemented sequence packing for both the audio encoder and the language model, enabling larger batch sizes and faster training. This operation was then fused with the Liger kernel to achieve even higher throughput and lower memory usage. With a fixed packing length of 4096 to regulate the dynamic batch size, the average Model FLOP Utilization (MFU) was limited to 0.03. However, with sequence packing enabled, the average MFU increased to approximately 0.34, demonstrating a significant improvement in training efficiency.

Packing Length	Sequence Packing	Num GPUs	Avg MFU	Zero	OOM
4096	FALSE	64	0.03	2	No
32768	FALSE	64	NA	2	Yes
32768	TRUE	32	0.34	2	No

We tested our implementations on different settings to demonstrate the efficiency of our implementation

Contributor List

alphabetical order

*main contributors

Citation

@article{li2025aero,
  title={Aero: Audio-enhanced large language models},
  author={Li, Bo and Chen Change Loy and Pu Fanyi and Yang Jingkang and Zhang Kaichen and Hu Kairui and Thang Luu Minh and Trung Nguyen Quang and Cong Pham Ba and Liu Shuai and Wang Yezhen and Liu Ziwei},
  url={https://www.lmms-lab.com/posts/aero_audio/},
  year={2025}
}

06 March 2025
EgoLife

We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses 👓. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities—including discussions 💬, shopping 🛍️, cooking 🍳, socializing 👥, and entertainment 🎮 - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset 📖, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA❓, a suite of 3K long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations.

To address the key technical challenges of 1) developing robust visual-audio models for egocentric data, 2) enabling identity recognition, and 3) facilitating long-context question answering over extensive temporal information, we introduce EgoBulter 🫡, an integrated system comprising EgoGPT 🧠 and EgoRAG 🔍. EgoGPT is a vision-language model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.

13 January 2025

Video-MMMU

Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

Motivation

Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

High information density (heavy OCR/ASR signals),
Advanced knowledge requirements (college-level knowledge),
Temporal structure (concepts unfolding over time).

These properties make reasoning from lecture video notably harder. This leads to our core question:
When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

🎓 Video-MMMU Leaderboard

Model	Overall \| Δknowledge	Perception	Comprehension	Adaptation
GPT-5-thinking	84.6 \| —	—	—	—
Gemini-2.5-Pro	83.6 \| —	—	—	—
OpenAI O3	83.3 \| —	—	—	—
Claude-3.5-Sonnet	65.78 \| 🟢 +11.4	72.00	69.67	55.67
Kimi-VL-A3B-Thinking-2506	65.22 \| 🟢 +3.5	75.00	66.33	54.33
GPT-4o	61.22 \| 🟢 +15.6	66.00	62.00	55.67
Qwen-2.5-VL-72B	60.22 \| 🟢 +9.7	69.33	61.00	50.33
GLM-4V-PLUS-0111	57.56 \| 🔴 -1.7	77.33	53.33	42.00
Gemini 1.5 Pro	53.89 \| 🟢 +8.7	59.00	53.33	49.33
Aria	50.78 \| 🟢 +3.2	65.67	46.67	40.00
Gemini 1.5 Flash	49.78 \| 🔴 -3.3	57.33	49.00	43.00
LLaVA-Video-72B	49.67 \| 🟢 +7.1	59.67	46.00	43.33
LLaVA-OneVision-72B	48.33 \| 🟢 +6.6	59.67	42.33	43.00
Qwen-2.5-VL-7B	47.44 \| 🟢 +2.2	58.33	44.33	39.67
VideoLLaMA3-7B	47.00 \| 🔴 -0.5	60.33	46.00	34.67
InternVideo2.5-Chat-8B	43.00 \| 🟢 +3.0	54.67	41.67	32.67
mPLUG-Owl3-7B	42.00 \| 🟢 +7.5	49.33	38.67	38.00
MAmmoTH-VL-8B	41.78 \| 🟢 +1.5	51.67	40.00	33.67
VideoChat-Flash-7B@448	41.67 \| 🔴 -1.3	51.67	40.67	32.67
InternVL2-8B	37.44 \| 🔴 -8.5	47.33	33.33	31.67
LLaVA-Video-7B	36.11 \| 🔴 -5.3	41.67	33.33	33.33
VILA1.5-40B	34.00 \| 🟢 +9.4	38.67	30.67	32.67
LLaVA-OneVision-7B	33.89 \| 🔴 -5.6	40.00	31.00	30.67
Llama-3.2-11B	30.00 \| ➖ —	35.67	32.33	22.00
LongVA-7B	23.98 \| 🔴 -7.0	24.00	24.33	23.67
VILA1.5-8B	20.89 \| 🟢 +5.9	20.33	17.33	25.00

Overview

We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

1) Video: Knowledge Source

Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.

2) QA Design: Three Stages of Knowledge Acquisition

Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

Perception – Identifying relevant surface information
Comprehension – Understanding underlying concepts or strategies
Adaptation – Applying learned knowledge to new scenarios

Fig. 2 illustrates examples for each category:

Perception: ASR-based (Art, top-left); OCR-based (Business, bottom-left)
Comprehension: Concept understanding (Humanities, top-center); Strategy comprehension (Science, bottom-center)
Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)

3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?

Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.

A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:

\Delta_{\text{knowledge}} = \frac{\text{Acc}_{\text{after\_video}} - \text{Acc}_{\text{before\_video}}}{100\% - \text{Acc}_{\text{before\_video}}} \times 100\%

Evaluation of Δknowledge:

1. Initial Test:
   The model attempts to answer a question *without* seeing the video.
 
2. Re-Test after video viewing:
   We provide the corresponding lecture video. The model is asked the same question again.
 
3. Performance Gain:
   If the model succeeds after watching, it demonstrates
   successful knowledge acquisition from video.

This setup mirrors a human’s natural educational process:

Don’t know → Learn by watching → Apply the knowledge

Key Insights

Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

Evaluation

Please refer to our Code@Github for full evaluation instructions.

Case Study

We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.

Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.

Authors

🖋 Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Xiang Yue, Yuanhan Zhang, Bo Li, and Ziwei Liu

Citation

@article{hu2025videommmu,
    title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos},
    author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu},
    journal={arXiv preprint arXiv:2501.13826},
    year={2025},
    url={https://arxiv.org/abs/2501.13826}
}

15 November 2024
Multimodal-SAE

For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing the solution for feature interpretation across various model scales.

This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimdoal models, we discovered intriguing features that correlate with diverse semantics and can be leveraged to steer model behavior, enabling more precise control and understanding of LMM functionality.

The Sparse Autoencoder (SAE) is trained on LLaVA-NeXT data by integrating it into a specific layer of the model, with all other components frozen. The features learned by the SAE are subsequently interpreted through the proposed auto-explanation pipeline, which analyzes the visual features based on their activation regions.

These features can then be used to steer model’s behavior to output desire output. You can check our papers for more details.

Tags → #research

Pretraining Dataset (85M) and Concept Balancing

Instruction Dataset (22M)

Method

1) Visual Encoder Pretraining

2) Three‑Stage Learning Pipeline

3) Offline Parallel Data Packing

4) Hybrid Parallelism and Efficient Long‑Context Training

Open-Source Resources

🚀 Training Code

🤗 Model Checkpoints

Models

📊 Training Datasets

🔥 Live Demo

Quick Start with HuggingFace

Evaluation

Citation

Breaking the Critic-Policy Divide

Surprising Dual Excellence

Self-Critique at Test Time

Technical Innovation

Model Performance

Implications for the Field

Resources

🚀 Code Repository

🤗 Model Collection

📝 Paper

Citation

Acknowledgments

Reasoning-improved Search

Training Recipe

Result

Reason to Act for General Search Model

Training Recipe

Result

Case Study

Case: MME

Case: Writing Email to a Public Figure

Reference

Appendix

Reasoning Template

Search Template

ReACT Template

Design Philosophy

Data Processing

Training

Quick Start

Reproducible Logs

Related Work and Citation

1. Introduction

2. Method

2.1. Building Iterative Multimodal Search-Integrated RL Framework

2.2. Curating Search-balanced VQA Datasets

3. Experimental Findings

4. Conclusion

Authors

Citation

1. Introduction

2. Multi-turn Grounding-Based RL

3. Experiments

3.1 Datasets & Metrics

3.2 Experimental Setup

3.3 Main Results

4. Grounding-based RL without Grounding Annotations

4.1 Emergent Grounding Ability During RL Training

4.2 Further Experiments on Image Counting Tasks

5. Limitation

Authors

Citation

Appendix

What is Aero Audio?

ASR & Audio Understanding Performance

Data Distribution

What’s insightful

Long ASR

Model’s Output

Results on LibriSpeech Unchunked

Evaluation Result

ASR Benchmarks

Audio Understanding Result