Tags → #multimodal

15 December 2025

LLaVA-OneVision-1.5-RL: Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning

Paper | Code | Model | Data | Base Model

Overview

LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

Our contributions are threefold:

(1) Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

(2) Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

(3) Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

RL Training Data Distribution — Distribution of task categories in the RL training data (67K total instances)

RL Data Strategy

Discrepancy-Driven Selection

We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics. This approach targets “latent capability” rather than knowledge injection, ensuring the model learns to better utilize its existing knowledge.

Reward-Based Sampling

Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.

Reward System Architecture

We employ a rule-based paradigm with domain-specific verification rules rather than learned preference models:

Category	Source	Reward Design
STEM	ViRL39K	Choice accuracy & math expression equivalence
Grounding	Ref-L4, VigoRL-SA	IoU between predicted/reference boxes; choice accuracy
Spatial	VigoRL-SAT	Choice accuracy
Counting	PixmoCount	Numeric token equivalence
Coding	WebCode2M, UniSVG	Token/tag overlap; SVG rendering similarity [0,1]
OCR	InfoVQA	Text similarity
Diagram	AI2D	Choice accuracy

Two-Stage Training Procedure

Training uses Group Relative Policy Optimization (GRPO) within the AReaL asynchronous framework:

Stage 1: Answer-only RL

Normal split training with instruction “Put ONLY your final answer within <answer></answer>.” This stage stabilizes concise task performance.

Stage 2: Chain-of-Thought RL

Long-reasoning data with instruction “Think and solve… within <think></think>…” This stage unlocks deeper reasoning capabilities. A small proportion of normal-set examples are interspersed to prevent forgetting perception skills.

Performance Results

Core Capability Enhancement

General VQA Benchmarks (Average +1.0):

Benchmark	Base	+RL
MMStar	67.7	68.2
MMBench (EN)	84.1	85.7
MMBench (CN)	81.0	84.2
MME-RealWorld (EN)	61.7	63.4
CV-Bench	80.7	82.9
RealWorldQA	68.1	68.4

Reasoning Tasks (Average +6.0):

Benchmark	Base	+RL	Δ
MathVista Mini	69.6	72.3	+2.7
WeMath	61.5	69.4	+7.9
MathVision	25.6	34.4	+8.8
MMMU Validation	55.4	58.8	+3.4
MMMU-Pro	25.2	35.7	+10.5

OCR & Chart (Average +0.0):

Benchmark	Base	+RL
ChartQA	86.5	87.4
DocVQA	95.0	91.9
InfoVQA	78.4	76.6

Extended Capability Analysis

Extended Performance Comparison — Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks

Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.

Development Roadmap

This release represents Stage 3 in a multi-phase project:

Stage	Focus	Data Scale
Stage 1 & 1.5	Pre-training & Mid-training	85M multimodal samples
Stage 2	Visual instruction tuning (SFT)	22M instruction-following samples
Stage 3 (Current)	RL post-training with GRPO	67K curated samples

Acknowledgements

We thank the following projects and frameworks:

AReaL: Lightning-Fast RL for LLM Reasoning and Agents
sglang: Fast serving framework for LLMs and vision language models
lmms-eval: Standardized evaluation framework
LLaVA: Large Language-and-Vision Assistant
LLaVA-NeXT: Next-generation multi-modal assistant

Open-Source Resources

Complete LLaVA-OneVision-1.5-RL resources for the community

Paper

Research Paper

Read the full technical paper on arXiv

GitHub

Training Code

RL training code and reproduction scripts

Model Checkpoints

Pre-trained models with RL optimization

LLAVA-OV-1.5-8B-RL

Training Datasets

Curated RL training data

LLaVA-OneVision-1.5-RL-Data

67K

Base Model

LLaVA-OneVision-1.5 foundation

LLaVA-OneVision-1.5

Base

27 November 2025

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Paper | Code | Data | Model | Demo

Overview

Our contributions are threefold:

(1) LongVT: An End-to-End Agentic Framework for “Thinking with Long Videos”
We introduce a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought (CoT) with on-demand clip inspection over hours-long videos, thereby enabling large multimodal models (LMMs) to perform more effective and reliable long-video reasoning.

(2) VideoSIAH: A Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
We construct a scalable data pipeline that produces diverse and high-quality question-answering (QA) data and tool-integrated reasoning traces, and a dedicated benchmark under a video segment-in-a-haystack setting.

(3) LongVT-7B-RFT: A State-of-the-Art Baseline with Invaluable Insights
Through extensive quantitative comparisons, systematic ablations on data recipes, training strategies, and design choices, as well as in-depth analyses of training dynamics, we establish and open-source a powerful baseline model with “thinking with long videos” capabilities.

LongVT Interleaved Multimodal Chain-of-Tool-Thought

Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based CoT reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop_video(start_time, end_time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.

Motivation of VideoSIAH

Long-video reasoning presents a fundamentally different challenge from previous video QA settings: LMMs must locate sparse, fine-grained, and causally decisive moments embedded within hours-long content. However, existing LMMs are mostly trained with coarse-grained and clip-level data. This mismatch leaves modern LMMs lacking the supervision needed to learn how temporal hypotheses are formed, verified, or revised—a critical yet underexplored capability for agentic long-video reasoning.

Moreover, most existing video understanding benchmarks only offer multiple-choice QAs, which can be solved without genuine temporal grounding and are vulnerable to dataset leakage or shortcut exploitation. To fill this gap, we introduce VideoSIAH, a large-scale, diverse, and high-quality data suite that serves collectively as a training dataset capturing the reasoning dynamics required for video segment-in-a-haystack QA, and a fine-grained evaluation benchmark, VideoSIAH-Eval, with human-in-the-loop validation for long-video open-ended question-answering.

We conduct a rigorous contamination study on the Qwen-VL series across two probing settings: (1) No Visual, where we feed the text prompt without video frames to test for direct memorization; (2) Rearranged Choices, where we randomize the mapping between option labels and their textual content for multiple-choice questions to detect label memorization. Our experimental results reveal significant vulnerabilities in existing benchmarks and highlight the necessity of our proposed VideoSIAH-Eval.

Setting	VideoMME (w/o sub)	VideoMMMU adapt.	VideoMMMU comp.	VideoMMMU perc.	VideoSIAH-Eval
Qwen2.5-VL-7B-Instruct
Original	64.3	35.7	44.3	56.7	33.8
No Visual	40.1	25.7	38.3	39.3	12.7
Rearranged Choices	56.0	29.7	40.3	67.0	-
Qwen3-VL-8B-Instruct
Original	69.3	40.7	60.3	71.3	46.6
No Visual	44.1	33.7	39.3	46.7	0.00
Rearranged Choices	69.0	36.3	47.7	69.3	-

Contamination Tests for Qwen-VL Series on Long Video Understanding and Reasoning Benchmarks. The VideoSIAH-Eval column shows ”-” entries for Rearranged Choices since our proposed benchmark is fully open-ended QA, where random option-answer mapping is not applicable.

Data Pipeline

Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start supervised fine-tuning (SFT) stage, whereas reinforcement learning (RL) operates solely on the filtered QA pairs.

Dataset Statistics

Split	Source	Purpose	Samples	Total
SFT (w/o tool)	LongVideo-Reason CoT	Reasoning-augmented Open-ended QA	5,238	228,835
	Video-R1 CoT	Reasoning-augmented Video QA	165,575
	Image-based CoT	Reasoning-augmented Image QA	58,022
SFT (w/ tool)	Gemini-distilled iMCoTT	Tool-augmented Open-ended QA	12,766	19,161
	Qwen-distilled iMCoTT	Tool-augmented Temporal Grounding	6,395
RL	Gemini-distilled QAs	Open-ended QA over Long Videos	1,667	17,020
RFT	Self-distilled iMCoTT	Agentic Behaviors	15,353

Dataset Statistics of VideoSIAH. Our proposed dataset contains a large-scale of non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled reinforcement fine-tuning (RFT) traces.

Category Distribution of VideoSIAH-Eval. We present the distribution of video types (left) and question types (right), highlighting the diversity of our proposed benchmark.

Quantitative Comparisons

We compare our LongVT models against proprietary LMMs and state-of-the-art open-source video reasoning models across various long video understanding and reasoning benchmarks.

Model	Reasoning	Tool	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	Prompt	Calling	w/ sub	adapt.	comp.	perc.
Proprietary LMMs
GPT-4o	✗	✗	77.2	66.0	62.0	55.7	30.8	17.4	51.5
Gemini 1.5 Pro	✗	✗	81.3	59.0	53.3	49.3	33.1	-	55.2
Open-Source (Sparse Sampling)
Qwen2.5-VL-7B	✗	✗	62.6	37.3	28.0	36.7	30.7	28.1	37.2
Video-R1-7B	✓	✗	61.0	36.3	40.7	52.3	37.2	27.9	42.6
VideoRFT-7B	✓	✗	60.9	36.7	42.0	53.0	34.7	26.5	42.3
Video-Thinker-7B	✓	✗	61.0	34.3	44.7	53.0	52.2	10.4	42.6
LongVT-7B-SFT (Ours)	✓	✓	12.5	37.7	46.0	58.3	36.0	26.8	36.2
LongVT-7B-RL (Ours)	✓	✓	66.1	32.7	44.7	50.0	37.8	31.0	43.7
Open-Source (Dense Sampling)
Qwen2.5-VL-7B	✗	✗	64.3	35.7	44.3	56.7	40.9	33.8	46.0
Video-R1-7B	✓	✗	60.5	37.3	38.7	46.3	40.1	33.1	42.7
VideoRFT-7B	✓	✗	49.2	37.7	40.7	48.7	18.7	26.9	37.0
Video-Thinker-7B	✓	✗	60.8	37.7	42.7	55.3	54.3	6.6	42.9
LongVT-7B-SFT (Ours)	✓	✓	64.9	32.3	42.0	49.7	41.1	34.8	44.1
LongVT-7B-RL (Ours)	✓	✓	66.1	37.7	42.3	56.3	41.4	35.9	46.6
LongVT-7B-RFT (Ours)	✓	✓	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best results among open-source models in each column are marked in bold and underlined, respectively.

Ablation Studies

We conduct comprehensive ablation studies to examine the impact of data recipes, training stages, and reward design on model performance.

Data Recipe

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT w/o self-curated iMCoTT	8.4	33.6	41.6	46.0	15.1	4.1	24.8
SFT w/ self-curated iMCoTT	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL w/o self-curated QAs	55.1	30.6	42.0	45.6	38.4	30.8	40.4
RL w/ self-curated QAs	66.1	37.7	42.3	56.3	41.4	35.9	46.6

Training Stage

Setting	VideoMME	VideoMMMU			LVBench	VideoSIAH-Eval	Avg
	w/ sub	adapt.	comp.	perc.
SFT only	64.9	32.3	42.0	49.7	41.1	34.8	44.1
RL only	52.7	35.3	43.0	55.1	37.1	28.2	41.9
SFT+RL	66.1	37.7	42.3	56.3	41.4	35.9	46.6
SFT+RL+RFT	67.0	35.7	43.7	56.7	41.3	42.0	47.7

Training Dynamics

(a) shows training dynamics under different accuracy and time rewards, and (b) shows the effect of tool-call reward on tool usage.

Recall encourages coverage; IoU demands precision. Using Recall as the reward function during RL presents a drawback: the policy can enlarge the predicted span to envelop the ground-truth interval, which monotonically raises the Recall-based score while ignoring boundary quality. This plateau in the curve of Recall Accuracy Score validates our hypothesized reward hacking. In contrast, IoU explicitly penalizes span inflation via the union term, yielding better-aligned boundaries and more disciplined tool use.

Is tool reward really necessary? The Qwen2.5-VL-7B baseline collapses to near-zero tool calls after training in both configurations (w/ and w/o tool reward), indicating that the model does not internalize the tool’s function. After performing cold-start SFT to obtain LongVT-7B-SFT, tool-call frequency rises during training under both configurations and accuracy improves in tandem. Hence, the tool reward is not required for basic competence: once SFT grounds the tool’s semantics, the model learns when and how to invoke the tool.

Open-Source Resources

We open-source LongVT to facilitate future development of long-video reasoning with tool calling in the community

GitHub

Code Repository

Complete training and evaluation code for LongVT

Paper

Technical Report

Read our paper on arXiv

Model Checkpoints

Pre-trained models with SFT, RL, and RFT optimization

LongVT-7B-RFT

Training Datasets

VideoSIAH data suite for long-video reasoning

21 November 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Paper | Code | SFT Data | RL Data | Model | Project Page

Overview

Our contributions are threefold:

(1) High-quality multimodal reasoning data curation.
We provide the first systematic study on constructing SFT and RL datasets for multimodal reasoning, showing that both source diversity and answer diversity are crucial for building reliable supervision signals.

(2) A strong and reproducible SFT recipe.
We introduce a robust SFT pipeline with step-by-step validation, careful teacher-model selection, and cross-domain data integration, enabling the construction of a high-quality cold-start reasoning dataset.

(3) An advanced RL training recipe.
Through an extensive comparison of GSPO, GRPO, and DAPO, we identify the most stable and scalable RL strategy and build a reliable RL pipeline that significantly strengthens multimodal reasoning performance.

Performance Comparison with State-of-the-Art Large Multimodal Reasoning Models across Various Benchmarks. Our proposed OpenMMReasoner consistently outperforms competing methods, highlighting its effectiveness in complex reasoning tasks.

OpenMMReasoner-Data

OpenMMReasoner-Data presents two training recipes covering both the SFT and RL phases. The pipeline begins by collecting diverse data sources and selecting teacher models to generate new answer traces. During the RL phase, we explore different algorithm choices and filtering strategies, leading to our final optimized recipe.

Experimental Results on Visual Reasoning Benchmarks

We evaluate our approach on a suite of public visual reasoning benchmarks. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

Analysis and Insights for SFT

Our Analysis and Insights for SFT are as follows:

(1) Answer diversity enhances reasoning.
Increasing the diversity of generated answers consistently improves the model’s overall reasoning performance, even when using the same question sources, suggesting that exposure to varied solutions strengthens understanding.

(2) Teacher model selection is crucial.
Distilling from a strong teacher model substantially boosts the model’s reasoning ability while maintaining high data efficiency. Careful selection for teacher model directly affects the quality of the distilled dataset and the final model performance.

(3) Over-filtering reduces diversity and performance.
The best results are achieved without excessive filtering, indicating that maintaining greater answer diversity encourages more robust reasoning abilities.

(4) Cross-domain knowledge improves generalization.
Incorporating diverse data from multiple domains consistently enhances the model’s overall reasoning capabilities across tasks.

Analysis and Insights for RL

Our Analysis and Insights for RL are as follows:

(1) GSPO outperforms other algorithms.
GSPO demonstrates superior stability and faster convergence compared to alternative methods in multimodal RL training.

(2) Token efficiency is crucial.
While increasing reasoning steps at test time can improve performance, excessive tokens reduce efficiency. Our results show that a smaller reasoning budget can achieve comparable or even better accuracy.

(3) Reasoning ability transfers across domains.
Gains in reasoning during training consistently translate into stronger performance across multiple domains.

Open-Source Resources
We open-source OpenMMReasoner to facilitate future development of multimodal reasoning in the community
GitHub
Code Repository
Complete training and evaluation code for OpenMMReasoner
Paper
Technical Report
Read our paper on arXiv
Model Checkpoint
Pre-trained model with RL optimization
OpenMMReasoner-RL
7B
Training Datasets
High-quality SFT and RL datasets
OpenMMReasoner-SFT-874K
874K
OpenMMReasoner-RL-74K
74K
30 September 2025
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Code | Technical Report | Models and Datasets | Demo

High performance, low cost, and strong reproducibility!

LLaVA, proposed in 2023, efficiently connects open-source vision encoders with large language models through low-cost alignment, bringing “see—understand—converse” multimodal capabilities to the open ecosystem. It significantly narrows the gap with top-tier closed models and marks an important milestone in open-source multimodal paradigms.

Starting with a low-cost alignment that bridges “vision encoder + large language model,” LLaVA laid the groundwork; LLaVA-1.5 strengthened comprehension with larger, cleaner data and high-resolution inputs; LLaVA-NeXT expanded into OCR, mathematical reasoning, and broader, multi-scenario tasks. It then branched into LLaVA-NeXT-Video for temporal video understanding and multi-frame reasoning, and LLaVA-NeXT-Interleave to support interleaved multi-image–text inputs and cross-image joint reasoning. Ultimately, the line converged in LLaVA-OneVision, which provides a unified interface covering images, documents, charts, multi-image, and video, balancing quality and efficiency.

Although interfaces and architectures for multimodal alignment are trending toward convergence, a truly “reproducible” open-source path still differs from releases that “open weights only.” Qwen2.5-VL and InternVL3.5 set strong baselines in OCR, document understanding, mathematical and cross-image reasoning; however, full data inventories, cleaning and mixing ratios, as well as alignment/sampling and training schedules are often only partially disclosed, making end-to-end reproduction difficult. Molmo, with a cleaner data pipeline and meticulous design, approaches strong closed-source baselines across multiple evaluations and human preference settings; Open-Qwen2VL shows that under a more efficient paradigm, strong comparative performance is achievable even when raw multimodal tokens account for a relatively small proportion. The primary gap today lies in the “reproducibility of recipes and engineering details,” rather than any single choice of model architecture.

LMMs-Lab, focused on the goals of high performance, low cost, and strong reproducibility, releases on top of the LLaVA‑OneVision framework a fully open, concept-balanced 85M pretraining dataset (LLaVA‑OV‑1.5‑Mid‑Training‑85M) and a carefully curated 22M instruction dataset (LLaVA‑OV‑1.5‑Instruct‑22M). We retain a compact three-stage pipeline (Stage‑1 language–image alignment; Stage‑1.5 concept balancing and high-quality knowledge injection; Stage‑2 instruction tuning), combine offline parallel data packing (up to ~11× padding compression) with Megatron‑LM plus a distributed optimizer, and complete Stage‑1.5 pretraining of an 8B‑scale VL model on 128 A800 GPUs in about four days.

Building on this, we introduce LLaVA‑OneVision‑1.5, which inherits and extends the LLaVA series: it adds RICE‑ViT for native-resolution, region-level fine-grained semantic modeling; strengthens chart/document/structured-scene understanding; continues the compact three-stage paradigm to avoid a lengthy curriculum; and emphasizes “quality–coverage–balance” across the 85M pretraining and 22M instruction sets. Crucially, it delivers truly end-to-end transparent openness—covering data, training and packing toolchains, configuration scripts, logs, and reproducible evaluation commands with their build and execution details—to enable low-cost reproduction and verifiable extension by the community. Experiments show LLaVA‑OneVision achieves competitive or superior performance to Qwen2.5‑VL on multiple public multimodal benchmarks (see the technical report).

Pretraining Dataset (85M) and Concept Balancing

A general-purpose vision–language pretraining dataset (85M) and an instruction-tuning dataset (22M). The 85M pretraining corpus fuses eight heterogeneous sources—COYO-700M, Obelics, DataComp-1B, LAION-CN, ImageNet-21K, SAM-1B, MINT, and Zero250M—yielding roughly 20 million Chinese and 65 million English image–text pairs. To tackle long-tail concept sparsity and noise/missing issues in raw captions, we move beyond raw term frequencies and adopt a feature-driven “concept balancing” strategy: using a MetaCLIP encoder, we embed all images and a 500K-scale concept vocabulary into a shared vector space, retrieve the Top-K most similar concepts for each image, tally concept frequencies, and then apply inverse-frequency weighted resampling. This suppresses high-frequency background classes and boosts rare fine-grained entities, attributes, and scenes, substantially flattening the long-tail distribution. We then use a high-quality captioner to generate aligned bilingual (Chinese/English) augmented descriptions. Systematic experiments show that, under the same or lower token budget, scaling high-quality data combined with concept-balanced sampling delivers significant and reproducible gains in multimodal understanding, long-tail recognition, and instruction generalization.

Instruction Dataset (22M)

The 22M instruction dataset covers eight categories: Caption, Chart & Table, Code & Math, Domain-specific, General VQA, Grounding & Counting, OCR, and Science. Through multi-source aggregation, format standardization, instruction rewriting, bilingual conversion, template diversification (to reduce homogeneity), and safety filtering, we maintain balanced distributions across categories and difficulty levels. Moreover, augmenting our instruction data with the FineVision dataset yields further performance gains.

Method

1) Visual Encoder Pretraining

To raise the floor for OCR, tables/documents, region‑level understanding, and downstream instruction reasoning, LLaVA‑OneVision‑1.5 adopts our in‑house MVT v1.5 (RICE‑ViT) as the vision backbone.

Compared to CLIP/SigLIP‑style contrastive models that rely on global alignment only, RICE‑ViT addresses the structural bottleneck of representing an instance with a single global vector by introducing a unified Region Cluster Discrimination mechanism:
- trained on 450M images and 2.4B candidate regions
- explicitly models local entities/text blocks and their context via region‑cluster discrimination plus region‑aware attention
- uses 2D rotary position encoding (2D RoPE) for native multi‑resolution support
Unlike SigLIP2, which relies on multiple specialized losses (SILC, TIPS, LocCa, etc.), we use a single clustering‑discrimination paradigm to simultaneously strengthen general semantics, OCR recognition, and localization, yielding a simpler, more maintainable training/inference pipeline.

During multimodal fusion, a lightweight projection followed by full‑parameter joint training seamlessly plugs this fine‑grained semantic foundation into the language model, reducing redundant adapters and improving cross‑task transfer efficiency.

2) Three‑Stage Learning Pipeline
- Stage‑1: Language–image alignment
  Train the visual projection layer on the LLaVA‑1.5 558K dataset to map visual encoder outputs into the LLM’s token embedding space, with controlled parameter updates for fast, stable convergence.
- Stage‑1.5: Mid‑stage pretraining with high‑quality knowledge
  Full‑parameter training on the concept‑balanced 85M pretraining set to inject broad visual semantics and world knowledge, emphasizing data quality and coverage rather than blindly expanding token counts.
- Stage‑2: Visual instruction alignment
  Continue full‑parameter training on the 22M instruction set plus multi‑source visual instruction corpora such as FineVision to improve task generalization, reasoning organization, and response‑format control.
3) Offline Parallel Data Packing

To reduce padding waste from multimodal sequence‑length variance and improve effective token utilization, we adopt offline parallel packing:
- hash‑bucket clustering by sample length or length ranges to cut global sorting/scanning costs
- multithreaded concatenation of multiple short samples into fixed‑length sequences close to the target length during data prep
This one‑pass, corpus‑wide pipeline is deterministic and reproducible, avoiding the runtime instability and extra CPU overhead of online dynamic packing. On the 85M pretraining set, it achieves up to ~11× effective padding compression (defined as original total padding tokens / post‑packing total padding tokens) compared to the baseline.

4) Hybrid Parallelism and Efficient Long‑Context Training

On the training side, we use hybrid parallelism and long‑context optimizations—tensor parallelism (TP) + pipeline parallelism (PP) + sequence/context parallelism with a distributed optimizer—to improve compute utilization and memory efficiency at cluster scale. We also adopt a native‑resolution strategy to preserve structural details in charts, documents, and dense text regions, avoiding information loss from uniform resizing.

On a 128×A800 cluster, Stage‑1.5 for an 8B model (85M samples, native resolution) completes in about 3.7 days, balancing throughput and cost.

Open-Source Resources
We open-source LLaVA-OneVision-1.5 to facilitate future development of LMMs in the community
GitHub
Training Code
Cook a SOTA model with our released training code and reproduction scripts
Demo
Live Demo
Try LLaVA-OneVision-1.5 directly in your browser
Model Checkpoints
Pre-trained models in different sizes
LLaVA-OV-1.5-4B-Instruct
4B
LLaVA-OV-1.5-8B-Instruct
8B
LLaVA-OV-1.5-4B-Base
4B Base
LLaVA-OV-1.5-8B-Base
8B Base
Training Datasets
Explore comprehensive training datasets
LLaVA-OV-1.5-Mid-Training-85M
85M
LLaVA-OV-1.5-Instruct
22M
Quick Start with HuggingFace
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM from qwen_vl_utils import process_vision_info model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct" # default: Load the model on the available device(s) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # default processor processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image."}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(inputs, max_new_tokens=1024) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text)
Model Evaluation
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \ --model=llava_onevision1_5 \ --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \ --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \ --batch_size=1
29 August 2025
LLaVA-Critic-R1: Unified Critic and Policy Model Through Reinforcement Learning
Figure 1: LLaVA-Critic-R1 is trained on top of the base model Qwen-2.5-VL-7B. Building upon a stronger reasoning VLM, ThinkLite-VL-7B, we further develop LLaVA-Critic-R1+ by applying the same RL critic training procedure. **Left**: Performance comparison of LLaVA-Critic-R1 with other base and reasoning VLMs on multiple visual reasoning, visual understanding, and visual reward benchmarks. LLaVA-Critic-R1 not only significantly outperforms other models in critic performance, but also demonstrates stronger policy capabilities. **Right**: Performance improvement of critic training and test-time self-critic scaling on five common visual reasoning and visual understanding benchmarks. Critic training alone significantly improves the base model's performance. Building upon this, leveraging the dual policy and critic capabilities of LLaVA-Critic-R1 for a 'Best-of-128' self-critic scaling procedure at test time leads to a further substantial boost in performance.

Breaking the Critic-Policy Divide

In vision-language modeling, critic models are typically trained to evaluate outputs—assigning scalar scores or pairwise preferences—rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use.

LLaVA-Critic-R1 challenges this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing a multimodal critic trained to optimize preference judgments while retaining full generation ability.

Surprising Dual Excellence

LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model—matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B).

Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a state-of-the-art 71.9 on MMMU at the 7B scale.

Self-Critique at Test Time

The enhanced critic ability benefits inference significantly. Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. This demonstrates the power of unified critic-policy models for creating self-improving systems.

Technical Innovation

Our approach centers on three key innovations:

Data Reorganization: We transform preference-labeled critic datasets into verifiable training signals suitable for reinforcement learning.

GRPO Training: We apply Group Relative Policy Optimization directly on generative models, enabling them to learn from critic data while maintaining generation capabilities.

Unified Architecture: We maintain a single model for both critic and policy functions, eliminating the traditional separation between evaluation and generation.

Model Performance

LLaVA-Critic-R1 demonstrates strong performance across diverse benchmarks:
- Visual Reasoning: Competitive performance with specialized models on complex reasoning tasks
- Critic Evaluation: Top-tier preference judgment and scalar scoring capabilities
- Generation Quality: Maintained fluency and coherence with strong instruction following
The model comes in two variants:
- LLaVA-Critic-R1: Base model trained from Qwen-2.5-VL-7B
- LLaVA-Critic-R1+: Extended approach applied to strong reasoning VLMs
Implications for the Field

Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems. This work demonstrates that the traditional separation between critics and policies is not necessary—a single model can excel at both tasks simultaneously.

Project Resources
Access code, models, and research paper for LLaVA-Critic-R1
GitHub
Code Repository
Access training code and implementation details
Model
Model Collection
Download pre-trained model checkpoints
Paper
Research Paper
Read the full technical paper on arXiv

06 August 2025

Improved MM-Search-R1: Reasoning and Action in Multimodal Search

Our previous work, MMSearch-R1, represents a paradigm shift in multimodal AI as the first framework to employ end-to-end reinforcement learning for autonomous tool invocation in large multimodal models (LMMs). By enabling models to independently determine when and how to leverage external search tools, MMSearch-R1 achieves both high efficiency and state-of-the-art performance on open-world tasks, marking a significant advance in practical AI deployment.

What began as a specialized tool-calling model has since evolved into a general-purpose reasoning engine that seamlessly integrates knowledge retrieval with cognitive processing. This evolution offers critical insights into the future of autonomous AI systems: the most capable agents will not only be able to think deeply, but also actively seek and utilize relevant information as needed.

Reasoning-improved Search

Despite MMSearch-R1’s strong performance, we observed limitations in its ability to adapt to complex, dynamic information needs. To address these constraints, we propose a reasoning-first agent paradigm that emphasizes the following core capabilities:

Intelligent search: The model reasons about its knowledge gaps to make decisions about when and how to invoke search tools
Query generation: Deep task understanding enables context-aware query formulation that evolves with the problem
Knowledge integration: External information is systematically incorporated through reasoning processes, not merely retrieved and appended
Performance: The approach delivers fundamental advances in multimodal reasoning, not just incremental improvements

Training Recipe

Prior work in multimodal reasoning has demonstrated that training with verifiable rewards can significantly enhance a model’s capabilities in understanding and solving complex STEM problems. In our initial experiments, we evaluated numerous multimodal STEM datasets. We discovered that many existing datasets suffer from various limitations: some lack sufficient difficulty for advanced models, while others contain noisy annotations, incomplete visual-text alignments, or unverifiable ground truth answers. These issues can produce unreliable reward signals that destabilize reinforcement learning training. To address these challenges, we curated a comprehensive high-quality training set consisting of: MMPR[1], MMK12[2], MMR1[3], Multi-subject-RLVR[4], ScienceQA. To ensure data quality for effective multimodal RL training, we implemented a rigorous filtering pipeline:

Multimodal Verification: Every problem undergoes automatic verification to ensure visual and textual components are properly aligned and complete. We filter datasets to include only problems where both modalities contribute meaningfully to the solution process.
Answer Verifiability: Each problem must have verifiable ground truth answers with clear reasoning paths. For mathematical problems, we verify symbolic and numerical answers; for scientific problems, we ensure explanations align with established principles.
Complexity Filtering: Problems must require genuine multimodal reasoning rather than being solvable through text or vision alone. We exclude problems where one modality is merely decorative.

After filtering, we obtained 80K high-quality multimodal STEM problems for RL training.

Our RL training stage follows DAPO[5] with the following modifications:

No Entropy Loss: We eliminate entropy loss entirely, as its inclusion frequently causes training instability characterized by exponential entropy growth and subsequent collapse.
No KL Loss: Following DAPO, we remove KL loss to allow the model to diverge from the original SFT policy’s trust region. This also eliminates reference policy log probability computation, accelerating training.
Overlong Filtering: We mask loss for truncated sequences to preserve long-context reasoning capabilities.
Learning Rate Schedule: We implement a sigmoid-based decay schedule. The sigmoid schedule provides smooth S-shaped transitions that stabilize early training and asymptotically approach target rates without discontinuities. We keeps the base learning rate to $2e-6$ and the warmup steps to 60 steps with sigmoid curve progression. The decay is a sigmoid function reducing to 90% of base rate (final LR $\approx 1.8e-6$ ).
Improved Exploration: We set the clip high ratio to 0.3 in the GRPO/PPO surrogate loss to encourage exploration and stabilize entropy dynamics.

Our reward function employs a two-stage hierarchical approach combining mathematical verification with LLM-based evaluation. We first apply a static mathematical verifier to assess answer correctness for questions with deterministic solutions. When the verifier returns zero — indicating either incorrect answers or inability to verify, we employ an LLM-as-judge for secondary assessment to handle questions requiring semantic evaluation or those with multiple valid representations (e.g., “teal blue” vs. “blue”), the LLM would judge based on given images, questions, answers and model predictions.

This design prioritizes computational verification for efficiency while leveraging LLM evaluation for complex semantic cases.

Result

Based on this foundation, we can build a very strong STEM-focused reasoning model that surpasses the rest of open models.

Models	MMK12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)
Qwen2.5-VL-7B	34.4	46.2	24.0	66.6	49.8
OpenVL-Thinker	31.0	45.2	24.0	70.2	52.3
R1-OneVision	30.6	44.1	24.0	64.1	49.2
MM-Eureka-7B	27.0	50.3	26.9	73.0	50.7
General STEM	46.2	51.4	28.4	73.6	57.3
General STEM -> Search (Two Stage)	43.0	51.9	28.0	72.4	57.9

With this reasoning foundation, we can go further to improve the model’s search abilities. We first implemented a two-stage training process to seamlessly integrate search capabilities. This approach ensures that search becomes a natural extension of the model’s reasoning process rather than a separate module.

From the figure, compared with our original MMSearch baseline, which was built on Qwen-2.5-VL-7B (referred to as Instruct → Search in this context), we can observe that the model achieved good improvements. The reasoning-first approach enabled more intelligent search decisions, better query formulation, and more effective utilization of retrieved information.

Accuracy across four multimodal benchmarks (Infoseek, MMSearch, FVQA, and SimpleVQA). The Reasoning to Search paradigm consistently outperforms or matches Instruct -> Search, especially on Infoseek and MMSearch, demonstrating the benefit of reasoning-first strategies in complex information retrieval tasks.

One of the most intriguing findings emerged during our evaluation of STEM tasks (e.g., MMMU, MathVision) using Search prompts. We observed a counterintuitive phenomenon: excessive searching actually led to decreased performance. Specifically, models employing Search prompts tended to over-rely on external searches, frequently initiating queries for information that could have been inferred through reasoning or was already available internally.

Accuracy comparison across five challenging reasoning datasets. Results indicate that while integrating search generally helps, excessive or unguided searching can lower performance. This underscores the need for precise reasoning-guided search prompting to achieve optimal results in complex multimodal reasoning tasks.

These performance drops highlight critical insight: without effective reasoning capabilities to guide their search strategies, models tend to default to inefficient search behaviors. This not only results in unnecessary computational overhead but can also introduce irrelevant information, ultimately degrading the quality of answer generation.

Search Ratio	MM-K12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)
Reason -> Search (Search Prompt)	16.8	22.9	9.5	12.5	24.7

Reason to Act for General Search Model

To achieve a robust balance between reasoning and search performance across general-domain tasks, we choose to integrate the training into one stage for both capabilities. Our goal is to build a model that not only retrieves relevant information efficiently but also demonstrates advanced reasoning over searched information.

Training Recipe

We unify the training process by adopting a ReACT-style prompt template, inspired by [REACT PAPER], which allows the model to interleave reasoning and action (search) steps within a single trajectory. This template is a slight refinement of the standard Search prompt, and full implementation details are provided in the Appendix.

The table below summarizes the lineage and training data for each model variant, clarifying the distinctions in model initialization and supervision strategies. For comprehensive information on hyperparameters and training dynamics, please refer to the Appendix.

Result

We evaluated both our two-stage and unified (one-stage) models across a broad suite of benchmarks and consistently observed performance improvements as model capacity increased.

The General STEM model showed that enhancing reasoning capabilities alone can lead to significant gains. In contrast, the General Search model revealed the multiplicative benefits of integrating reasoning with targeted search strategies. Notably, these improvements were not simply incremental - they represent fundamental advances in how models address complex, multimodal problems.

Models	MMK12	MathVerse (testmini)	MathVision (testmini)	MathVista (testmini)	MMMU (val)	AI2D	ChartQA	MME	RealworldQA	OCRBench	DocVQA	MMBench	MMStar	MiaBench
Qwen2.5-VL-7B	34.4	46.2	24.0	66.6	49.8	93.3	94.4	630.4/1685.2	68.5	85.2	94.6	82.9	62.6	81.7
General STEM	46.2	51.4	28.4	73.6	57.3	94.4	91.4	700.7/1662.1	67.5	83.7	92.1	83.8	65.5	76.0
Reason -> Search	43.2	51.7	25.0	71.8	57.9	94.0	93.6	652.5/1688.3	67.5	81.7	93.5	83.2	63.1	47.6
General Search	43.6	52.0	27.3	74.7	56.1	94.6	94.0	718.9/1775.3	65.5	77.8	89.4	84.0	60.4	44.4

Models	Infoseek	MMSearch	FVQA	SimpleVQA
Qwen2.5-VL-7B	20.1	12.8	20.3	38.4
MMSearch	55.1	53.8	58.4	57.4
Reasoning -> Search	58.5	57.1	57.9	57.7
General Search	52.0	54.9	52.8	57.0

Our results reveal that MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.

MMSearchR1 performance comparison — MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.

We found a strong positive correlation (Pearson r = 0.911) between search ratio and model performance, indicating that increased search engagement directly improves accuracy. However, this relationship has limits—excessive or undirected search introduces computational costs and answer noise that can degrade reliability. Additional experiments with reduced STEM data, increased search data ratios, and shortened warmup periods (60 vs 45 steps) confirmed that better performance requires strategic search integration. Models perform best when search is invoked selectively through explicit reasoning about information needs, balancing enhanced knowledge access against computational efficiency. These findings demonstrate that the key to multimodal model performance lies not in maximizing search frequency, but in developing sophisticated reasoning mechanisms that determine when external information retrieval adds value to complex query resolution.

Case Study

We show the following interesting cases to demonstrate versatile abilities of our final model.

Case: MME

In this example from the MME benchmark, the model is required to answer a question about a statue located in the National Gallery of Art in Washington, D.C. The process begins with the model analyzing the query image to determine what additional information is needed. It then performs searches for visually similar images, systematically evaluates the retrieved results, and conducts follow-up searches from different perspectives to verify its findings. This iterative search-and-reasoning approach allows the model to gather comprehensive evidence before arriving at a well-supported conclusion.

MME benchmark case study — Example from the MME benchmark showing the model's iterative search-and-reasoning approach to identify a statue in the National Gallery of Art.

Case: Writing Email to a Public Figure

In this case, the model is tasked with composing an email to Abdullah Shahid Sial, a public figure. To accomplish this effectively, the model must gather comprehensive information about him through internet searches, including his social media presence (Twitter), official website, professional background, and other publicly available information sources.

Email composition case study — Case study showing the model's research process when tasked with writing an email to Abdullah Shahid Sial, demonstrating comprehensive information gathering capabilities.

Reference

[1] https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2

[2] https://huggingface.co/datasets/FanqingM/MMK12

[3] https://huggingface.co/datasets/MMR1/MMR1-Math-RL-Data-v0

[4] https://huggingface.co/datasets/virtuoussy/Multi-subject-RLVR

Appendix

Reasoning Template

{question}
Please reason step by step. Output the thinking process within <think> </think> tags and final answer within <answer> </answer> tags.

Search Template

Answer the user's question based on the provided image. Examine the image carefully and identify any recognizable entities, such as faces, objects, locations, events, logos, or text. Determine whether you have sufficient knowledge to confidently recognize the main visual element and answer the user's question. If so, first explain your reasoning, then provide a clear and direct answer.\nIf you are unable to confidently identify the visual element, stop and invoke the image search tool by appending the string <search><img></search> at the end of your response. This will trigger a Google Lens search using the original image to retrieve relevant information that can help you confirm the visual content.\nOnce you have sufficient visual understanding, combine it with the user's question and assess whether you can confidently answer. If so, answer the question directly using your own knowledge. If not, invoke the text search tool by generating a concise and specific query, and output it in the format <text_search>your query here</text_search> at the end of your response. Carefully craft your query to accurately retrieve the information needed to help answer the question. The text search tool will then use Google Search to return relevant information based on your query.\nYou must include your reasoning inside <reason>...</reason> before taking any action, whether it is calling the image search tool, generating a text search query, or providing a final answer. The reasoning may involve analysis of the original image and question, interpretation of search results, or logical steps leading to the final answer.\nAll search results will be placed inside <information> and </information> and returned to you. When you are ready to answer the question, wrap your final answer between <answer> and </answer>, without detailed illustrations. For example: <answer>Titanic</answer>.\nHere is the image and the question:\n<image>
{question}

ReACT Template

# System Message
You are a helpful assistant. You should strictly follow reason-to-act thinking process to answer user provided question. Namely, you should first analyze the question & observation (e.g., user provided image or search results) and then inform the following action. The thinking process should be within <reason> and </reason> tags. The actions you can choose are:
<answer>xxxxx</answer>:  which returns the answer within <answer> and </answer> tags, and finishes the task.
<search>image</search>: which searches user provided image on Google and returns image-related visual entity/concept/knowledge for further reason-to-act. The search results are placed between <observation> and </observation> tags.
<search>text query</search>:  which generates a text query and sent to Google and returns some snippets containing the answer for further reason-to-act. The search results are placed between <observation> and </observation> tags. Note that sometimes the snippets do not contain the answer, and some alternative search might be needed.
 
Your output format should be one of the following three formats:
<reason> YOUR THINKING PROCESS </reason>
<answer> YOUR ANSWER AFTER GETTING ENOUGH INFORMATION </answer>
or
<reason> YOUR THINKING PROCESS </reason>
<search> IMAGE </search>
or
<reason> YOUR THINKING PROCESS </reason>
<search> YOUR GENERATED TEXT QUERY FOR HELPING YOU FIND INFORMATION ON GOOGLE TO ANSWER USER QUESTION </search>
 
Only output the final answer (in words, numbers or phrase) inside the <answer></answer> tags, without any explanations or extra information. If this is a yes-or-no question, you should only answer yes or no.

01 June 2025
MMSearch-R1: Multimodal Search with Reinforcement Learning
MMSearch-R1: Bridging the gap between internal knowledge and external search

MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.

Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.

1. Introduction

Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with:
- Long-tail facts and newly emerging information
- Domain-specific content restricted by privacy or copyright constraints
- Knowledge-intensive and information-seeking visual question answering tasks
As a result, their performance remains suboptimal, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution.

Current Limitations

Existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal:
- RAG methods rely on fixed retrieve-then-generate pipelines, leading to over-retrieval and high computational costs
- Prompt-based agents can access real-time search engines but lack parameter optimization through learning
Our Solution: MMSearch-R1

To address these limitations, we introduce MMSearch-R1, training LMMs to acquire three essential search-related capabilities:
1. When to search - Recognizing knowledge boundaries
2. What to search for - Formulating effective queries
3. How to reason over search results to answer user queries
Key Contributions
- 🏗️ Dataset Construction - Automated approach to construct multimodal search VQA dataset
- 🔧 Multimodal Search Tool Integration - Real-world search pipeline with image and text tools
- 🧠 Wiser Search via Reinforcement Learning - GRPO-based RL framework for optimal search decisions
- 🌐 Open-Sourced Framework - Complete model, dataset, and training framework release
2. Method

2.1. Building Iterative Multimodal Search-Integrated RL Framework

Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.

We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions during the rollout process.

Multimodal Search Tools

Our framework equips models with two types of search tools:
1. Image Search Tool
  - Takes input image and returns top-5 visually similar webpages
  - Each result includes thumbnail and title
  - Enables identification of unfamiliar visual entities
2. Text Search Pipeline
  - Model formulates queries based on user questions
  - Retrieves relevant webpages and processes content
  - Provides concise summaries for accurate answering
Reward Modeling

Our reward system consists of two components:
reward = (1 - α) × Acc_Score × Search_Penalty + α × Format_Score
- Accuracy Score - Exact string match against ground truth (1 for correct, 0 otherwise)
- Search Penalty - Applied to correct responses that used search, encouraging internal knowledge use
- Format Score - Ensures model follows required output structure
2.2. Curating Search-balanced VQA Datasets

Figure 3: Illustration of data construction process of FVQA dataset: (a) Automated pipeline for visual knowledge-required VQA samples collection; (b) Knowledge taxonomy; (c) Overall pipeline showing composition and origin of FVQA from various sources.

We construct FactualVQA (FVQA), a search-balanced dataset following three key criteria:
1. Coverage of Both Search-Required/Free Questions
2. Concise and Verifiable Answers
3. Diversity in Knowledge and Difficulty
Data Construction Pipeline
- VQA Collection - Gather candidates requiring visual or textual knowledge
- Search Balancing - Use preliminary model to classify search requirements
- Human Annotation - Ensure diversity, authenticity, and label quality
3. Experimental Findings

We evaluated MMSearch-R1 against both closed-source models (GPT-4o, Gemini 2.5 Pro) and open-source models (Qwen2.5-VL series) on knowledge-intensive VQA tasks.

Table 1: Performance of MMSearch-R1 across benchmarks. 'Acc (%)' denotes accuracy evaluated by LLM-as-Judge, while 'SR (%)' represents the search ratio.

Key Findings

Finding 1: Enhanced Knowledge Boundary Recognition

MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%.

Figure 4: (a) Performance comparison between Base model and RL-trained model under RAG workflow. (b) Answer behavior breakdown of Base (inner circle) and RL (outer circle) models.

Finding 2: Improved Query Generation and Summarization

RL training enhances the model’s ability to generate effective text queries and summarize retrieved information under fixed RAG setup.

Finding 3: Better Internal Knowledge Utilization

Clear upward trend in Correct without Search proportion demonstrates improved recall and reasoning based on internal knowledge.

Figure 5: (a) Performance improvements of SFT and RL over Base across five VQA datasets. (b) Training dynamics of reward and search ratio for different strategies.

Finding 4: RL vs. Supervised Learning

RL consistently outperforms SFT across all tasks despite being trained on only about half as much data, demonstrating superior data efficiency.

Finding 5: Balanced Training Effectiveness

Training with balanced data and search penalty effectively guides the model to perform on-demand search without overusing the search tool.

4. Conclusion

MMSearch-R1 represents a significant advancement in multimodal AI, learning to:
- Recognize knowledge gaps and boundaries
- Selectively invoke image or text search
- Reason effectively over retrieved content
Our framework outperforms same-sized RAG baselines and approaches larger model performance while requiring significantly fewer search calls. This work lays the groundwork for building multimodal agents that are both adaptive and interactive, paving the way for the next major advancement in multimodal intelligence.

Project Resources
Complete implementation, research paper, models, and datasets for MMSearch-R1
GitHub
GitHub Repository
Complete implementation
Paper
Research Paper
Detailed methodology and results
Model
Model Checkpoints
Pre-trained models
Dataset
FVQA Dataset
Training and evaluation data
28 May 2025
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO: Multi-Turn Grounding-Based Policy Optimization for high-resolution visual reasoning

Project Resources
Access the complete MGPO implementation and research materials
GitHub
MGPO Repository
Complete implementation of Multi-turn Grounding-based Policy Optimization

1. Introduction

SOTA large multimodal model (LMM) architectures, such as Qwen2.5-VL, typically build on a powerful large language model (LLM) (e.g. Qwen2.5) integrated with an external Native Resolution Vision Transformer (NaViT). Such approach also presents challenges in high-resolution real-world scenarios, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. By comparison, when processing high-resolution real-world scenarios, the human visual system employs task-driven visual search strategies to ground and scrutinize critical regions of interest. Motivated by this biological mechanism, we attempt to equip LLMs with similar visual search capabilities by leveraging visual grounding to focus on key image regions.

However, empowering LMMs with such grounding-based visual reasoning capabilities is non-trivial, primarily due to the scarcity and high cost of obtaining grounding annotations for standard visual-question-answering (VQA) datasets, which are required for constructing multi-turn grounding-based conversation data for supervised fine-tuning (SFT). In this paper, we highlight that accurate grounding behavior can emerge within a reinforcement learning (RL) paradigm, even when training supervision is provided solely through a binary reward function derived from the correctness of the final answer.

To this end, we introduce Multi-turn Grounding-based Policy Optimization (MGPO), a reinforcement learning (RL) algorithm that enables LMMs to iteratively focus on key image regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Given a high-resolution image and a question, the model first predicts the coordinates of key regions relevant to the query. An image cropping function is then triggered to extract and return the corresponding sub-image. In subsequent turns, the model can integrate previous in-context convesations (including both the original image and cropped sub-image) to solve the question.

Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.

In summary, MGPO mainly offers the following advantages:
- Top-down and Interpretable Visual Reasoning. MGPO equips LMMs with a top-down, question-driven visual search mechanism for high-resolution scenarios and provides interpretable outputs that indicate which image regions are attended to throughout the reasoning process.
- Overcomes Maximum Pixel Constraints. MGPO can overcomes the maximum pixel limitation of LMMs. As shown in the first example of Figure 1, even when resizing a high-resolution image within pixel limits results in a blurred input, the model can still identify relevant coordinates and crop clear sub-images from the original input for further analysis.
- Without Additional Grounding Annotations. MGPO can be post-trained directly on standard VQA datasets without the need for extra grounding annotations, and experimental results demonstrate substantial improvements in intermediate grounding performance compared to GRPO
Ultimately, we utilize MGPO to post-train Qwen2.5-VL-7B using visual-question-short answering data, yet achieves strong intermediate grounding performance without requiring grounding annotations (examples shown in Figure 1). Compared to GRPO, MGPO yields a 5.4% improvement on the in-distribution MME-Realworld benchmark and a 5.2% gain on the challenging out-of-distribution V* Bench. Notably, leveraging with only 21K post-training samples, our model surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench.

2. Multi-turn Grounding-Based RL

Figure illustrates a comparison of different post-training paradigms for LMMs. In our MGPO, the model operates over K sequential interaction, dynamically grounding and reasoning by conditioning on the full history of visual and textual context at each step.

Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.

Multi-turn Template without Cold Start. In practice, we observe that LLMs struggle to autonomously generate grounding coordinates during the rollout process, which hinder effective multi-turn RL. To address this, we design a fixed two-turn dialogue template, as shown in Figure 3, to explicitly activate the model’s grounding and reasoning abilities.

Figure 3: Our two-turn dialogue template design to explicitly activate the model's grounding and reasoning abilities.

Multi-turn Grounding-Based RL Process. The MGPO training process consists of the following key steps:
1. Initial Grounding: Given a high-resolution image and question, the model predicts bounding box coordinates for key regions
2. Image Cropping: Based on predicted coordinates, relevant sub-images are automatically cropped from the original image
3. Multi-turn Reasoning: The model integrates both original and cropped images in subsequent conversation turns
4. Reward Learning: Binary rewards are provided based on final answer correctness, enabling the emergence of grounding behavior through RL
Figure 4: The Multi-turn Grounding-based Policy Optimization (MGPO) algorithm workflow.

3. Experimental Results

We evaluate MGPO on multiple high-resolution visual reasoning benchmarks and demonstrate significant improvements over baseline methods.

3.1 Main Results

Table 1: Performance comparison on high-resolution visual reasoning benchmarks. MGPO achieves superior performance across multiple datasets.

Our experimental results show that MGPO yields substantial improvements:
- 5.4% improvement on MME-Realworld benchmark compared to GRPO
- 5.2% gain on challenging out-of-distribution V* Bench
- Surpasses OpenAI’s o1 and GPT-4o models on OOD V* Bench with only 21K post-training samples
3.2 Ablation Studies

Table 2: Ablation study showing the contribution of different components in MGPO.

3.3 Grounding Performance Analysis

Figure 5: Analysis of grounding performance showing emergence of accurate grounding behavior through RL training.

4. Additional Analysis

4.1 Point Counting Task

Table 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.

4.2 Visualization Results

Figure 8: Visualization of point predictions from the GRPO model trained with only accuracy reward.

5. Limitation

All experiments of MGPO are conducted using a fixed two-turn template, rather than allowing the model to autonomously decide when to perform image cropping based on the input question, as illustrated in lasted OpenAI models such as o3 and o4-mini. This limitation stems from our observation that Qwen2.5-VL, when directly subjected to RL post-training, struggles to generate grounding coordinates without explicit prompt guidance.

Nevertheless, we believe that our trained models can be leveraged to generate high-quality chain-ofthought (CoT) data for subsequent SFT. By adopting a multi-stage training strategy that combines SFT and RL, as in DeepSeek-R1, may ultimately enable the model to autonomously decide when and how to perform grounding. We leave this direction for future work.

Appendix

Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks.

29 April 2025

Aero-1-Audio

What is Aero Audio?

Aero-1-Audio Resources

Complete implementation, interactive demos, models, evaluation results, and documentation for Aero-1-Audio

GitHub

GitHub Repository

Complete implementation and source code

Demo

Interactive Playground

Try Aero-1-Audio live demo

Model

Model Checkpoints

Pre-trained Aero-1-Audio models

Link

Evaluation Results

Comprehensive evaluation benchmarks

Link

Documentation

Complete usage cookbook and guides

Aero-1-Audio is a compact audio model adept at various audio tasks, including speech recognition, audio understanding, and following audio instructions. It is part of the Aero-1 series, the first generation of lightweight multimodal models developed by LMMs-Lab, with future expansions planned across additional modalities.

Built upon the Qwen-2.5-1.5B language model, Aero delivers strong performance across multiple audio benchmarks while remaining parameter-efficient, even compared with larger advanced models like Whisper and Qwen-2-Audio and Phi-4-Multimodal, or commercial services like ElevenLabs/Scribe.
Aero is trained within one day on 16 H100 GPUs using just 50k hours of audio data. Our insight suggests that audio model training could be sample efficient with high quality and filtered data.
Aero can accurately perform ASR and audio understanding on continuous audio inputs up to 15 minutes in length, which we find the scenario is still a challenge for other models.

ASR & Audio Understanding Performance

We evaluate our model performance on multiple dimensions and different benchmarks. Let’s first take a look at its overall performance compare with other models

ASR and Understanding Performance Comparison — Performance comparison showing Aero-1-Audio's balance between parameter efficiency and performance across ASR and audio understanding benchmarks

Detailed ASR Performance — Detailed ASR performance metrics showing Aero-1-Audio achieving optimal trade-off between parameter efficiency and performance

Our model achieves a balance between performance and parameter efficiency. We evaluate it across multiple ASR and audio understanding benchmarks. On ASR tasks, our model attains the lowest WER scores on datasets such as AMI, LibriSpeech, and SPGISpeech. It also demonstrates strong audio understanding capabilities on various comprehension benchmarks. As illustrated in the plotted graph, our model falls within the highlighted triangular region that represents an optimal trade-off between parameter efficiency and performance.

Data Distribution

We present the contributions of our data mixture here. Our SFT data mixture includes over 20 publicly available datasets, and comparisons with other models highlight the data’s lightweight nature.

Training Time Comparison — Training time comparison demonstrating sample efficiency - our dataset is over 100 times smaller than comparable models while achieving competitive performance

*The hours of some training datasets are estimated and may not be fully accurate

One of the key strengths of our training recipe lies in the quality and quantity of our data. Our training dataset consists of approximately 5 billion tokens, corresponding to around 50,000 hours of audio. Compared to models such as Qwen-Omni and Phi-4, our dataset is over 100 times smaller, yet our model achieves competitive performance. All data is sourced from publicly available open-source datasets, highlighting the sample efficiency of our training approach. A detailed breakdown of our data distribution is provided below, along with comparisons to other models.

What’s insightful

In this release, our primary focus is on developing an audio model capable of handling multiple audio tasks. The following examples showcase its core abilities across tasks such as audio understanding and speech recognition. Most notably, we highlight the model’s capability to perform long-form ASR, as demonstrated in the example below.

Long ASR

A common approach for current long-form ASR tasks is to split the audio into smaller, processable chunks and perform ASR on each segment individually. However, with the advancement of large language models (LLMs), long-context understanding has become increasingly important. We argue that a model’s ability to process long audio sequences continuously is essential for effective audio understanding and should be considered a critical capability. To demonstrate this, we set up a simple use case using examples from an NVIDIA conference and calculate the WER with respect to the auto-generated YouTube subtitles.

The image above presents a heatmap comparison of different models performing ASR tasks on a video with varying audio input lengths. As shown in the heatmap, Qwen-Omni and Phi-4 exhibit instability across different lengths and do not consistently produce the desired output.

Note: The ground truth is derived from the auto-generated subtitles downloaded from YouTube. Therefore, the WER does not necessarily imply that our model achieves perfect results, but rather demonstrates that our model is comparable to the YouTube ASR pipeline.

Model’s Output

Qwen Omni (12 minutes chunk)

When processing the audio in 12-minute chunks, Qwen-Omni failed to recognize the full speech content and was only able to capture portions of the audio.

Qwen Omni (12 minutes chunk)

that’s like what’s going on why does itfocused on um ai and parallel parallelizable workloads but it’s still general to an extent it’s not as use case specific as something like grock with a queue that’s really designed to you know spit out tokens as fast as possible and that like is a goldilocks zone where it’s flexible enough to handle different workloads but not um but still much faster than um a traditional cpu and that google is one of the only companies that has a scaled internal custom silicon effort

Phi-4-Multimodal (full chunk)

When processing the full audio without splitting, the Phi-4-Multimodal model began to ignore the instructions and instead generated an overall summary of the audio.

Phi-4-Multimodal (full chunk)

The conversation covered Nvidia’s focus on inference over training, the partnership with GM, the release of GUT-N1 for humanoid robotics, and the impact of China’s AI initiatives on global chip demand.

Aero (full chunk)

Aero Audio is able to generate the complete ASR output and accurately identify the full transcript.

Aero (full chunk)

Welcome to the brainstorm episode eighty two frank downing joining us recap of nvidia’s gtc conference that is the gpu technology conference frank what happened what were the big takeaways i on my side i saw a gm and in video partnering but we can circle back to that what was

…

right nice timing good timing all right we’ll see everyone next week see everyone thank you

Results on LibriSpeech Unchunked

In the previous release, LibriSpeech split their audio files into smaller chunks and calculated the overall Word Error Rate (WER) based on these segmented samples. However, as we observed, it is straightforward to concatenate the chunks back into their original form, thereby creating a simple long-form Audio Speech Recognition benchmark. We evaluated various models on these benchmarks and found that their performance generally declined compared to their results on shorter samples. Among the models tested, our model achieved the best performance, showing the smallest drop in accuracy relative to the chunked version.

Model	LS.Clean	LS.Other	LS.Clean(Long)	LS.Other(Long)	Avg Diff
Phi-4	1.68	3.83	11.51	24.72	30.72
Qwen2-Audio-Instruct	3.59	7.46	93.01	93.63	175.59
Qwen2.5-Omni	1.80	3.40	13.03	13.29	21.12
Aero-1-Audio	1.49	3.17	5.31	11.71	12.36

We present the evaluation of various models on the unchunked LibriSpeech dataset. The average result is calculated by averaging the WER score differences across the same splits. All models show some degradation when handling longer audio, whereas our model exhibits the least amount of performance drop.

Evaluation Result

We then present the full evaluation result here with the evaluation scores

ASR Benchmarks

Model	Parameters	AMI	Earnings22	LibriSpeech Clean	LibriSpeech Other	SPGispeech	Tedlium	Average
ElevenLabs/Scribe	N/A	14.43	12.14	1.79	3.31	3.30	3.17	6.36
REV.AI/Fusion	N/A	10.93	12.09	2.88	6.23	4.05	2.80	6.50
OpenAI/Whisper-large-v3	1.5B	15.95	11.29	2.01	3.91	2.94	3.86	6.66
Assembly.AI/AssemblyBest	N/A	15.64	13.54	1.74	3.11	1.81	3.43	6.55
Alibaba/Qwen2.5-Omni	7B	12.41	12.74	1.80	3.40	2.35	3.11	5.97
Microsoft/Phi-4-Multimodal	4B+1.6B	11.45	10.50	1.67	3.82	3.11	2.89	5.57
LMMs-Lab/Aero-1-Audio	1.5B	10.53	13.79	1.49	3.17	1.97	2.87	5.64

We evaluate our model on AMI, Earnings22, LibriSpeech, SPGISpeech, and TedLium. Our model achieves the second-best WER score compared to other models, while maintaining a small and efficient size.

Audio Understanding Result

We then test our model’s understanding result across 3 dimensions, Audio Analysis and Understanding, Speech Instruction, and Audio Scene Understanding

Model	Parameters	AIR-Chat							Speech Instruction		AIR-Foundation
		Speech	Sound	Music	Mix	Avg	MMAU testmini	OpenHermes test	Alpaca Audio test	Speech	Sound	Music
Alibaba/Qwen2-Audio-Instruct	7B	7.2	7.0	6.8	6.8	6.9	49.2	46.8	49.2	62.9	55.4	56.8	56.7
Alibaba/Qwen2.5-Omni	7B	6.8	5.7	4.8	5.4	5.7	65.6	57.2	57.4	67.2	76.3	63.0	64.4
Microsoft/Phi-4-Multimodal	4B+1.6B	7.5	7.0	6.7	6.8	7.0	65.0	57.8	62.6	48.3	40.6	35.5	52.8
Tencent/Ola	7B	7.3	6.4	5.9	6.0	6.4	70.3	62.6	62.8	58.8	70.4	53.1	63.2
Tencent/Vita 1.5	7B	4.8	5.5	4.9	2.9	4.5	35.5	9.6	7.0	31.5	24.1	25.5	28.6
InspirAI/Mini-Omni2	0.5B	3.6	3.5	2.6	3.1	3.2	-	-	-	-	-	-	-
LMMs-Lab/Aero-1-Audio	1.5B	5.7	5.3	4.7	5.8	5.4	59.4	40.0	45.4	48.0	57.6	44.2	50.5

We conducted evaluations on AIR-Bench-Chat and MMAU for audio analysis and understanding. Our model achieved an average score of 5.35, outperforming Mini-Omni2 and Vita. For Audio Instruction Following, we evaluated on OpenHermes and Alpaca-Audio, following the same pipeline as AudioBench. Our model demonstrates a strong ability to understand instructions in speech and provide correct responses. Additionally, when evaluated on AIR-Bench-Foundation for Audio Scene Understanding, our model outperformed Phi-4-Multimodal in the sound and music dimensions. Overall, the average score of our model indicates strong performance relative to other models with larger parameter sizes.

Training Techniques

Dynamic Batch Size

We implemented a dynamic batching strategy based on the estimated token length to control the batch size per device. In many cases, using a fixed batch size requires setting it conservatively small to avoid out-of-memory (OOM) errors on longer samples, which leads to underutilization of computing resources. To address this, we group samples into batches such that the total token length stays within a predefined threshold, thereby minimizing computational waste and improving efficiency.

Sequence Packing

To further optimize dynamic batching, we implemented sequence packing for both the audio encoder and the language model, enabling larger batch sizes and faster training. This operation was then fused with the Liger kernel to achieve even higher throughput and lower memory usage. With a fixed packing length of 4096 to regulate the dynamic batch size, the average Model FLOP Utilization (MFU) was limited to 0.03. However, with sequence packing enabled, the average MFU increased to approximately 0.34, demonstrating a significant improvement in training efficiency.

Packing Length	Sequence Packing	Num GPUs	Avg MFU	Zero	OOM
4096	FALSE	64	0.03	2	No
32768	FALSE	64	NA	2	Yes
32768	TRUE	32	0.34	2	No

We tested our implementations on different settings to demonstrate the efficiency of our implementation

13 January 2025

Video-MMMU: Evaluating Knowledge Acquisition from Educational Videos

Video-MMMU Overview — Video-MMMU: A comprehensive benchmark for evaluating knowledge acquisition from educational videos across multiple disciplines

Video-MMMU asks a fundamental question: If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?

🎯 Motivation

Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.

Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:

High information density (heavy OCR/ASR signals)
Advanced knowledge requirements (college-level knowledge)
Temporal structure (concepts unfolding over time)

These properties make reasoning from lecture video notably harder. This leads to our core question:

When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?

Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.

🏆 Video-MMMU Leaderboard

Model	Overall \	Δknowledge	Perception	Comprehension	Adaptation
GPT-5-thinking	84.6 \	—	—	—	—
Gemini-2.5-Pro	83.6 \	—	—	—	—
OpenAI O3	83.3 \	—	—	—	—
Claude-3.5-Sonnet	65.78 \	🟢 +11.4	72.00	69.67	55.67
Kimi-VL-A3B-Thinking-2506	65.22 \	🟢 +3.5	75.00	66.33	54.33
GPT-4o	61.22 \	🟢 +15.6	66.00	62.00	55.67
Qwen-2.5-VL-72B	60.22 \	🟢 +9.7	69.33	61.00	50.33

See full leaderboard with 20+ models in our paper and website

📚 Overview

We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.

1) Video: Knowledge Source

Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content.

Dataset Composition:

300 college-level, lecture-style videos
30 subjects across 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering
High-quality educational content from university-level courses

2) QA Design: Three Stages of Knowledge Acquisition

Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:

🔍 Perception – Identifying relevant surface information
🧠 Comprehension – Understanding underlying concepts or strategies
🎯 Adaptation – Applying learned knowledge to new scenarios

Knowledge Acquisition Categories — Figure 2: Examples for each knowledge acquisition category across different disciplines. Perception (ASR/OCR-based), Comprehension (concept/strategy understanding), and Adaptation (application to new scenarios).

Benchmark Structure — Figure 3: Video-MMMU benchmark structure showing the progression from video content to three-tier evaluation framework.

3) In-Context Knowledge Acquisition: Learning Like Humans

Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment.

In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.

4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)

A core innovation in Video-MMMU is its shift from measuring only final performance to measuring learning.

Δknowledge Formula

Δknowledge = (Acc_after_video - Acc_before_video) / (100% - Acc_before_video) × 100%

Evaluation Process

1. Initial Test: The model attempts to answer a question without seeing the video.

2. Re-Test after video viewing: We provide the corresponding lecture video. The model is asked the same question again.

3. Performance Gain: If the model succeeds after watching, it demonstrates successful knowledge acquisition from video.

This setup mirrors a human’s natural educational process:

Don't know → Learn by watching → Apply the knowledge

🔍 Key Insights

Performance Analysis — Figure 4: Comprehensive analysis showing progressive performance decline and the human-model gap in knowledge acquisition from videos.

Progressive Performance Decline

Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.

Knowledge Acquisition Challenge

The Δknowledge metric reveals a significant human–model gap:

Humans: Substantial improvement (Δknowledge ≈ 33.1%)
Top Models: Smaller gains (GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%)

This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.

📊 Case Studies

Failure Case: Method Adaptation Error

Failure Case Study — Figure 5: Example of method adaptation failure - the model failed to adapt the method from video to solve the Adaptation question.

Success Case: Learning from Video

Success Case Study — Figure 6: Example of successful learning from video - transforming an initial wrong answer into a correct one after watching the educational content.

🚀 Research Impact

Paradigm Shift

Video-MMMU represents a paradigm shift from traditional video understanding to knowledge acquisition evaluation:

From Scene Understanding to Learning - Moving beyond visual comprehension to knowledge acquisition
From Static Evaluation to Dynamic Learning - Measuring improvement rather than just final performance
From Task Solving to Learning Capability - Evaluating the ability to learn new skills

Implications for AI Development

Real-World Deployment - Models must learn continuously after deployment
Educational AI - Critical for AI tutoring and educational applications
Knowledge Transfer - Understanding how models generalize learned concepts
Human-AI Alignment - Bridging the gap in learning capabilities

📈 Future Directions

Benchmark Extensions

Multimodal Knowledge Sources - Incorporating diverse educational formats
Long-term Learning - Evaluating knowledge retention over time
Interactive Learning - Adding feedback loops and iterative improvement

Model Development

Learning-Optimized Architectures - Designing models specifically for knowledge acquisition
Memory Integration - Better mechanisms for knowledge storage and retrieval
Transfer Learning - Improving cross-domain knowledge application

Video-MMMU Resources

Complete benchmark resources including dataset, evaluation tools, research paper, and project information

Link

Project Website

Comprehensive benchmark information

Paper

Research Paper

Detailed methodology and analysis

Dataset

Video-MMMU Dataset

Complete benchmark dataset

GitHub

GitHub Repository

Evaluation code and tools

🎯 Getting Started

Download the Video-MMMU dataset from Hugging Face
Set up the evaluation environment using our GitHub repository
Run baseline evaluations on your models
Analyze Δknowledge metrics to understand learning capabilities
Compare results with our comprehensive leaderboard

Video-MMMU challenges the current state of multimodal AI by shifting focus from static performance to dynamic learning capability - a critical step toward truly intelligent and adaptive AI systems.

Tags → #multimodal

Overview

RL Data Strategy

Discrepancy-Driven Selection

Reward-Based Sampling

Reward System Architecture

Two-Stage Training Procedure

Stage 1: Answer-only RL

Stage 2: Chain-of-Thought RL

Performance Results

Core Capability Enhancement

Extended Capability Analysis

Development Roadmap

Acknowledgements

Overview

Motivation of VideoSIAH

Data Pipeline

Dataset Statistics

Quantitative Comparisons

Ablation Studies

Data Recipe

Training Stage

Training Dynamics

Overview

OpenMMReasoner-Data

Experimental Results on Visual Reasoning Benchmarks

Analysis and Insights for SFT

Analysis and Insights for RL

Pretraining Dataset (85M) and Concept Balancing

Instruction Dataset (22M)

Method

1) Visual Encoder Pretraining

2) Three‑Stage Learning Pipeline

3) Offline Parallel Data Packing

4) Hybrid Parallelism and Efficient Long‑Context Training

Quick Start with HuggingFace

Model Evaluation

Breaking the Critic-Policy Divide

Surprising Dual Excellence

Self-Critique at Test Time

Technical Innovation

Model Performance

Implications for the Field

Reasoning-improved Search

Training Recipe

Result

Reason to Act for General Search Model

Training Recipe

Result

Case Study

Case: MME

Case: Writing Email to a Public Figure

Reference

Appendix

Reasoning Template

Search Template

ReACT Template

1. Introduction

Current Limitations

Our Solution: MMSearch-R1

Key Contributions

2. Method

2.1. Building Iterative Multimodal Search-Integrated RL Framework

Multimodal Search Tools

Reward Modeling

2.2. Curating Search-balanced VQA Datasets

Data Construction Pipeline

3. Experimental Findings

Key Findings

Finding 1: Enhanced Knowledge Boundary Recognition

Finding 2: Improved Query Generation and Summarization

Finding 3: Better Internal Knowledge Utilization

Finding 4: RL vs. Supervised Learning

Finding 5: Balanced Training Effectiveness

4. Conclusion

1. Introduction

2. Multi-turn Grounding-Based RL

3. Experimental Results

3.1 Main Results

3.2 Ablation Studies