Figure 1: LLaVA-Critic-R1 is trained on top of the base model Qwen-2.5-VL-7B. Building upon a stronger reasoning VLM, ThinkLite-VL-7B, we further develop LLaVA-Critic-R1+ by applying the same RL critic training procedure. Left: Performance comparison of LLaVA-Critic-R1 with other base and reasoning VLMs on multiple visual reasoning, visual understanding, and visual reward benchmarks. LLaVA-Critic-R1 not only significantly outperforms other models in critic performance, but also demonstrates stronger policy capabilities. Right: Performance improvement of critic training and test-time self-critic scaling on five common visual reasoning and visual understanding benchmarks. Critic training alone significantly improves the base model’s performance. Building upon this, leveraging the dual policy and critic capabilities of LLaVA-Critic-R1 for a “Best-of-128” self-critic scaling procedure at test time leads to a further substantial boost in performance.
Breaking the Critic-Policy Divide
In vision-language modeling, critic models are typically trained to evaluate outputs—assigning scalar scores or pairwise preferences—rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use.
LLaVA-Critic-R1 challenges this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing a multimodal critic trained to optimize preference judgments while retaining full generation ability.
Surprising Dual Excellence
LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model—matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B).
Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a state-of-the-art 71.9 on MMMU at the 7B scale.
Self-Critique at Test Time
The enhanced critic ability benefits inference significantly. Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. This demonstrates the power of unified critic-policy models for creating self-improving systems.
Technical Innovation
Our approach centers on three key innovations:
Data Reorganization: We transform preference-labeled critic datasets into verifiable training signals suitable for reinforcement learning.
GRPO Training: We apply Group Relative Policy Optimization directly on generative models, enabling them to learn from critic data while maintaining generation capabilities.
Unified Architecture: We maintain a single model for both critic and policy functions, eliminating the traditional separation between evaluation and generation.
Model Performance
LLaVA-Critic-R1 demonstrates strong performance across diverse benchmarks:
Visual Reasoning: Competitive performance with specialized models on complex reasoning tasks
Critic Evaluation: Top-tier preference judgment and scalar scoring capabilities
Generation Quality: Maintained fluency and coherence with strong instruction following
The model comes in two variants:
LLaVA-Critic-R1: Base model trained from Qwen-2.5-VL-7B
LLaVA-Critic-R1+: Extended approach applied to strong reasoning VLMs
Implications for the Field
Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems. This work demonstrates that the traditional separation between critics and policies is not necessary—a single model can excel at both tasks simultaneously.
@article{llava-critic-r1-2025, title={LLaVA-Critic-R1: Unified Critic and Policy Model Through Reinforcement Learning}, author={Wang, Xiyao and Li, Chunyuan and Yang, Jianwei and Zhang, Kai and Liu, Bo and Xiong, Tianyi and Huang, Furong}, journal={arXiv preprint arXiv:2509.00676}, year={2025}}
Acknowledgments
This work represents a collaborative effort in advancing the capabilities of multimodal models through innovative training approaches, building upon the strong foundation of the LLaVA project series.
Our previous work, MMSearch-R1, represents a paradigm shift in multimodal AI as the first framework to employ end-to-end reinforcement learning for autonomous tool invocation in large multimodal models (LMMs). By enabling models to independently determine when and how to leverage external search tools, MMSearch-R1 achieves both high efficiency and state-of-the-art performance on open-world tasks, marking a significant advance in practical AI deployment.
What began as a specialized tool-calling model has since evolved into a general-purpose reasoning engine that seamlessly integrates knowledge retrieval with cognitive processing. This evolution offers critical insights into the future of autonomous AI systems: the most capable agents will not only be able to think deeply, but also actively seek and utilize relevant information as needed.
Reasoning-improved Search
Despite MMSearch-R1’s strong performance, we observed limitations in its ability to adapt to complex, dynamic information needs. To address these constraints, we propose a reasoning-first agent paradigm that emphasizes the following core capabilities:
Intelligent search: The model reasons about its knowledge gaps to make decisions about when and how to invoke search tools
Query generation: Deep task understanding enables context-aware query formulation that evolves with the problem
Knowledge integration: External information is systematically incorporated through reasoning processes, not merely retrieved and appended
Performance: The approach delivers fundamental advances in multimodal reasoning, not just incremental improvements
Training Recipe
Prior work in multimodal reasoning has demonstrated that training with verifiable rewards can significantly enhance a model’s capabilities in understanding and solving complex STEM problems.
In our initial experiments, we evaluated numerous multimodal STEM datasets. We discovered that many existing datasets suffer from various limitations: some lack sufficient difficulty for advanced models, while others contain noisy annotations, incomplete visual-text alignments, or unverifiable ground truth answers. These issues can produce unreliable reward signals that destabilize reinforcement learning training.
To address these challenges, we curated a comprehensive high-quality training set consisting of: MMPR[1], MMK12[2], MMR1[3], Multi-subject-RLVR[4], ScienceQA.
To ensure data quality for effective multimodal RL training, we implemented a rigorous filtering pipeline:
Multimodal Verification: Every problem undergoes automatic verification to ensure visual and textual components are properly aligned and complete. We filter datasets to include only problems where both modalities contribute meaningfully to the solution process.
Answer Verifiability: Each problem must have verifiable ground truth answers with clear reasoning paths. For mathematical problems, we verify symbolic and numerical answers; for scientific problems, we ensure explanations align with established principles.
Complexity Filtering: Problems must require genuine multimodal reasoning rather than being solvable through text or vision alone. We exclude problems where one modality is merely decorative.
After filtering, we obtained 80K high-quality multimodal STEM problems for RL training.
Our RL training stage follows DAPO[5] with the following modifications:
No Entropy Loss: We eliminate entropy loss entirely, as its inclusion frequently causes training instability characterized by exponential entropy growth and subsequent collapse.
No KL Loss: Following DAPO, we remove KL loss to allow the model to diverge from the original SFT policy’s trust region. This also eliminates reference policy log probability computation, accelerating training.
Overlong Filtering: We mask loss for truncated sequences to preserve long-context reasoning capabilities.
Learning Rate Schedule: We implement a sigmoid-based decay schedule. The sigmoid schedule provides smooth S-shaped transitions that stabilize early training and asymptotically approach target rates without discontinuities. We keeps the base learning rate to 2e−6 and the warmup steps to 60 steps with sigmoid curve progression. The decay is a sigmoid function reducing to 90% of base rate (final LR ≈1.8e−6).
Improved Exploration: We set the clip high ratio to 0.3 in the GRPO/PPO surrogate loss to encourage exploration and stabilize entropy dynamics.
Our reward function employs a two-stage hierarchical approach combining mathematical verification with LLM-based evaluation. We first apply a static mathematical verifier to assess answer correctness for questions with deterministic solutions. When the verifier returns zero — indicating either incorrect answers or inability to verify, we employ an LLM-as-judge for secondary assessment to handle questions requiring semantic evaluation or those with multiple valid representations (e.g., “teal blue” vs. “blue”), the LLM would judge based on given images, questions, answers and model predictions.
This design prioritizes computational verification for efficiency while leveraging LLM evaluation for complex semantic cases.
Result
Based on this foundation, we can build a very strong STEM-focused reasoning model that surpasses the rest of open models.
Models
MMK12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
Qwen2.5-VL-7B
34.4
46.2
24.0
66.6
49.8
OpenVL-Thinker
31.0
45.2
24.0
70.2
52.3
R1-OneVision
30.6
44.1
24.0
64.1
49.2
MM-Eureka-7B
27.0
50.3
26.9
73.0
50.7
General STEM
46.2
51.4
28.4
73.6
57.3
General STEM -> Search (Two Stage)
43.0
51.9
28.0
72.4
57.9
With this reasoning foundation, we can go further to improve the model’s search abilities. We first implemented a two-stage training process to seamlessly integrate search capabilities. This approach ensures that search becomes a natural extension of the model’s reasoning process rather than a separate module.
From the figure, compared with our original MMSearch baseline, which was built on Qwen-2.5-VL-7B (referred to as Instruct → Search in this context), we can observe that the model achieved good improvements. The reasoning-first approach enabled more intelligent search decisions, better query formulation, and more effective utilization of retrieved information.
Accuracy across four multimodal benchmarks (Infoseek, MMSearch, FVQA, and SimpleVQA). The Reasoning to Search paradigm consistently outperforms or matches Instruct -> Search, especially on Infoseek and MMSearch, demonstrating the benefit of reasoning-first strategies in complex information retrieval tasks.
One of the most intriguing findings emerged during our evaluation of STEM tasks (e.g., MMMU, MathVision) using Search prompts. We observed a counterintuitive phenomenon: excessive searching actually led to decreased performance. Specifically, models employing Search prompts tended to over-rely on external searches, frequently initiating queries for information that could have been inferred through reasoning or was already available internally.
Accuracy comparison across five challenging reasoning datasets. Results indicate that while integrating search generally helps, excessive or unguided searching can lower performance. This underscores the need for precise reasoning-guided search prompting to achieve optimal results in complex multimodal reasoning tasks.
These performance drops highlight critical insight: without effective reasoning capabilities to guide their search strategies, models tend to default to inefficient search behaviors. This not only results in unnecessary computational overhead but can also introduce irrelevant information, ultimately degrading the quality of answer generation.
Search Ratio
MM-K12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
Reason -> Search (Search Prompt)
16.8
22.9
9.5
12.5
24.7
Reason to Act for General Search Model
To achieve a robust balance between reasoning and search performance across general-domain tasks, we choose to integrate the training into one stage for both capabilities. Our goal is to build a model that not only retrieves relevant information efficiently but also demonstrates advanced reasoning over searched information.
Training Recipe
We unify the training process by adopting a ReACT-style prompt template, inspired by [REACT PAPER], which allows the model to interleave reasoning and action (search) steps within a single trajectory. This template is a slight refinement of the standard Search prompt, and full implementation details are provided in the Appendix.
The table below summarizes the lineage and training data for each model variant, clarifying the distinctions in model initialization and supervision strategies. For comprehensive information on hyperparameters and training dynamics, please refer to the Appendix.
Result
We evaluated both our two-stage and unified (one-stage) models across a broad suite of benchmarks and consistently observed performance improvements as model capacity increased.
The General STEM model showed that enhancing reasoning capabilities alone can lead to significant gains. In contrast, the General Search model revealed the multiplicative benefits of integrating reasoning with targeted search strategies. Notably, these improvements were not simply incremental - they represent fundamental advances in how models address complex, multimodal problems.
Models
MMK12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
AI2D
ChartQA
MME
RealworldQA
OCRBench
DocVQA
MMBench
MMStar
MiaBench
Qwen2.5-VL-7B
34.4
46.2
24.0
66.6
49.8
93.3
94.4
630.4/1685.2
68.5
85.2
94.6
82.9
62.6
81.7
General STEM
46.2
51.4
28.4
73.6
57.3
94.4
91.4
700.7/1662.1
67.5
83.7
92.1
83.8
65.5
76.0
Reason -> Search
43.2
51.7
25.0
71.8
57.9
94.0
93.6
652.5/1688.3
67.5
81.7
93.5
83.2
63.1
47.6
General Search
43.6
52.0
27.3
74.7
56.1
94.6
94.0
718.9/1775.3
65.5
77.8
89.4
84.0
60.4
44.4
Models
Infoseek
MMSearch
FVQA
SimpleVQA
Qwen2.5-VL-7B
20.1
12.8
20.3
38.4
MMSearch
55.1
53.8
58.4
57.4
Reasoning -> Search
58.5
57.1
57.9
57.7
General Search
52.0
54.9
52.8
57.0
Our results reveal that MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to 61.6% on Infoseek, compared to 28.5% for General Search.
MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to 61.6% on Infoseek, compared to 28.5% for General Search.
We found a strong positive correlation (Pearson r = 0.911) between search ratio and model performance, indicating that increased search engagement directly improves accuracy. However, this relationship has limits—excessive or undirected search introduces computational costs and answer noise that can degrade reliability.
Additional experiments with reduced STEM data, increased search data ratios, and shortened warmup periods (60 vs 45 steps) confirmed that better performance requires strategic search integration. Models perform best when search is invoked selectively through explicit reasoning about information needs, balancing enhanced knowledge access against computational efficiency.
These findings demonstrate that the key to multimodal model performance lies not in maximizing search frequency, but in developing sophisticated reasoning mechanisms that determine when external information retrieval adds value to complex query resolution.
Case Study
We show the following interesting cases to demonstrate versatile abilities of our final model.
Case: MME
In this example from the MME benchmark, the model is required to answer a question about a statue located in the National Gallery of Art in Washington, D.C. The process begins with the model analyzing the query image to determine what additional information is needed. It then performs searches for visually similar images, systematically evaluates the retrieved results, and conducts follow-up searches from different perspectives to verify its findings. This iterative search-and-reasoning approach allows the model to gather comprehensive evidence before arriving at a well-supported conclusion.
Case: Writing Email to a Public Figure
In this case, the model is tasked with composing an email to Abdullah Shahid Sial, a public figure. To accomplish this effectively, the model must gather comprehensive information about him through internet searches, including his social media presence (Twitter), official website, professional background, and other publicly available information sources.
{question}Please reason step by step. Output the thinking process within <think> </think> tags and final answer within <answer> </answer> tags.
Search Template
Answer the user's question based on the provided image. Examine the image carefully and identify any recognizable entities, such as faces, objects, locations, events, logos, or text. Determine whether you have sufficient knowledge to confidently recognize the main visual element and answer the user's question. If so, first explain your reasoning, then provide a clear and direct answer.\nIf you are unable to confidently identify the visual element, stop and invoke the image search tool by appending the string <search><img></search> at the end of your response. This will trigger a Google Lens search using the original image to retrieve relevant information that can help you confirm the visual content.\nOnce you have sufficient visual understanding, combine it with the user's question and assess whether you can confidently answer. If so, answer the question directly using your own knowledge. If not, invoke the text search tool by generating a concise and specific query, and output it in the format <text_search>your query here</text_search> at the end of your response. Carefully craft your query to accurately retrieve the information needed to help answer the question. The text search tool will then use Google Search to return relevant information based on your query.\nYou must include your reasoning inside <reason>...</reason> before taking any action, whether it is calling the image search tool, generating a text search query, or providing a final answer. The reasoning may involve analysis of the original image and question, interpretation of search results, or logical steps leading to the final answer.\nAll search results will be placed inside <information> and </information> and returned to you. When you are ready to answer the question, wrap your final answer between <answer> and </answer>, without detailed illustrations. For example: <answer>Titanic</answer>.\nHere is the image and the question:\n<image>{question}
ReACT Template
# System Message You are a helpful assistant. You should strictly follow reason-to-act thinking process to answer user provided question. Namely, you should first analyze the question & observation (e.g., user provided image or search results) and then inform the following action. The thinking process should be within <reason> and </reason> tags. The actions you can choose are:<answer>xxxxx</answer>: which returns the answer within <answer> and </answer> tags, and finishes the task.<search>image</search>: which searches user provided image on Google and returns image-related visual entity/concept/knowledge for further reason-to-act. The search results are placed between <observation> and </observation> tags.<search>text query</search>: which generates a text query and sent to Google and returns some snippets containing the answer for further reason-to-act. The search results are placed between <observation> and </observation> tags. Note that sometimes the snippets do not contain the answer, and some alternative search might be needed. Your output format should be one of the following three formats: <reason> YOUR THINKING PROCESS </reason> <answer> YOUR ANSWER AFTER GETTING ENOUGH INFORMATION </answer> or <reason> YOUR THINKING PROCESS </reason> <search> IMAGE </search> or <reason> YOUR THINKING PROCESS </reason> <search> YOUR GENERATED TEXT QUERY FOR HELPING YOU FIND INFORMATION ON GOOGLE TO ANSWER USER QUESTION </search> Only output the final answer (in words, numbers or phrase) inside the <answer></answer> tags, without any explanations or extra information. If this is a yes-or-no question, you should only answer yes or no.
SAE is inspired by a wealth of Sparse Autoencoder (SAE) work from Anthropic, OpenAI, Google, and the open-source community. SAE has become a powerful and widely-used tool in the field of explainable AI.
This project aims to provide a simple and flexible interface that allows users to inject SAE modules into their models at any layer with minimal effort. We adopt the elegant design of huggingface’s peft and regard SAE training as a kind of parameter efficient tuning, as long as the target is an nn.Module, SAE can be easily integrated and trained with only few lines.
Design Philosophy
The code design takes inspiration from PEFT, as we believe SAE shares many structural similarities with PEFT-based methods. By inheriting from a BaseTuner class, we enable seamless SAE integration into existing models.
With this design, injecting an SAE module is as simple as:
import torchimport torch.nn as nnfrom peft import inject_adapter_in_modelfrom sae import TopKSaeConfig, get_peft_sae_model, PeftSaeModelclass DummyModel(nn.Module): def __init__(self): super(DummyModel, self).__init__() self.linear = nn.Linear(10, 10) def forward(self, x): return self.linear(x)model = DummyModel()config = TopKSaeConfig(k=1, num_latents=5, target_modules=["linear"])# Inject the adapter into the modelmodel = inject_adapter_in_model(config, model)# Check if the adapter was injected correctlyresult = model(torch.randn(1, 512, 10))
You can also obtain a PEFT-wrapped model using the magic function from the PEFT library. The rest of your workflow remains the same:
# Get the PEFT modelpeft_model = get_peft_sae_model(model, config)result = peft_model(torch.randn(1, 512, 10))
To ensure consistency in data formatting, we recommend first processing your data and storing it in Parquet format. This standardization simplifies interface development and data preparation.
You are free to customize the preprocessing logic and define keys for different modalities. However, the final output should be compatible with chat templates and our preprocessing pipeline.
An example preprocessing script is available at: examples/data_process/llava_ov_clevr.py
If you find this repository useful, please consider checking out our previous paper on applying Sparse Autoencoders (SAE) to Large Multimodal Models, accepted at ICCV 2025.
You can cite our work as follows:
@misc{zhang2024largemultimodalmodelsinterpret, title={Large Multi-modal Models Can Interpret Features in Large Multi-modal Models}, author={Kaichen Zhang and Yifei Shen and Bo Li and Ziwei Liu}, year={2024}, eprint={2411.14982}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.14982},}
MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.
Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.
1. Introduction
Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with long-tail facts, newly emerging information, and domain-specific content that is often restricted by privacy or copyright constraints. As a result, their performance remains suboptimal on knowledge-intensive and information-seeking visual question answering tasks, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution, such as unfamiliar visual content or previously unseen textual information. This limitation raises important concerns regarding their factual reliability in real-world applications.
Integrating search capabilities into LMMs offers a promising solution to above limitations. However, existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal. RAG methods rely on a fixed retrieve-then-generate pipeline grounded in static corpora, often leading to over-retrieval, high computational cost, and the unrealistic assumption that all necessary information is already available. This rigid setup fails to reflect the dynamic and unpredictable nature of real-world scenarios. In contrast, prompt-based agents can access real-time search engines, but their parameters are not optimized through learning, preventing them from truly acquiring effective search behaviors or adapting to open-world environments.
To address these limitations, we aim to train LMMs that can interact with real-world environments and acquire three essential search-related capabilities: (1) when to search, (2) what to search for, and (3) how to reason over search results to answer user queries. Building on these goals, we introduce MMSearch-R1, the first end-to-end reinforcement learning framework designed to empower LMMs with on-demand search capabilities in open, internet-based environments. Our efforts are summarized as follows:
Dataset Construction We propose an automated approach to construct a multimodal search VQA dataset by estimating the model’s familiarity with each question. This enables the generation of search-required and search-free samples, further complemented by manually annotated test data covering diverse knowledge types and difficulty levels.
Multimodal Search Tool Integration We develop a real-world search pipeline combining an image search tool and a text search tool, enabling LMMs to retrieve relevant visual and textual information for unfamiliar inputs.
Wiser Search via Reinforcement Learning We introduce a GRPO-based RL framework that trains LMMs to decide when, what, and how to search. Our method achieves superior performance over RAG-based baselines while reducing search calls by over 30%.
Open-Sourced Dataset and Framework We will release our model, dataset and training framework to support future research in search-augmented multimodal reasoning.
2. Method
2.1. Building Iterative Multimodal Search-Integrated RL Framework
Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.
We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions with the real-world environment during the rollout process, as illustrated in Figure 2 and below.
Multimodal Search Tools We equip the model with two types of search tools to interact with real-world internet content. The first is an image search tool, which takes the input image and returns the top-5 visually similar webpages, each represented by a thumbnail and a title. This enables the model to identify unfamiliar visual entities in the image. The second is a text search pipeline, where the model formulates a query based on the user question, retrieves relevant webpages, and processes their content into concise summaries. This allows the model to acquire textual knowledge needed to answer the question accurately.
Rollout with Multi-turn Multimodal Search The rollout process is designed to be multi-turn and iterative. At each step, the model receives new information, such as the original question or retrieved search results, and performs reasoning based on the accumulated context. It then selects an action from a predefined action space, which includes invoking search tools or answering the question. This process continues until the model generates a final answer or reaches the maximum number of allowed turns. To support this interaction, we define and utilize a set of special tokens to structure the model’s outputs and the environment’s feedback.
Reward Modeling Our reward consists of two components: an accuracy score with search penalty and a format score. For accuracy score, we evaluate model performance using exact string match against the ground truth, assigning a score of 1 for correct answers and 0 otherwise. For correct responses, a penalty factor (between 0 and 1) is applied if any search was used, encouraging the model to rely on internal knowledge and invoke search only when necessary. This design promotes efficient, on-demand search behavior. The format score verifies whether the model follows the required output structure, ensuring compatibility with the environment interface.
Figure 3: Illustration of data construction process of FVQA dataset: (a). An automated pipeline for visual knowledge-required VQA samples collection; (b). Knowledge taxonomy; (c). Overall pipeline showing the composition and origin of FVQA from various automatic and manually curated sources.
To effectively train models for on-demand search using simple outcome-based reinforcement learning, we require a search-balanced dataset that includes both search-required and search-free questions. This balance allows the model to learn when to rely on internal knowledge and when to invoke external search. We propose three key criteria for such datasets: (1). Coverage of Both Search-Required/Free Questions; (2). Concise and Verifiabl Answers; (3). Diversity in Knowledge and Difficulty. Follow these criteria, we construct a multimodal search VQA dataset, FactualVQA (FVQA), using a combination of automated pipelines and manual annotation.
VQA Collection We first gather a pool of candidate VQA samples requiring either visual or textual knowledge. For visual knowledge, we develop an automated pipeline that collects images related to head and tail visual concepts in the MetaCLIP vocabulary from the internet. Based on these images, we use GPT-4o to generate corresponding questions that assess the model’s recognition capabilities. For textual knowledge, we sample questions from the InfoSeek training set. We annotate the knowledge type for each question using GPT4o and maintain a balanced distribution across categories.
Search Balancing To distinguish between search-required and search-free questions, we use a preliminary model equipped with search capabilities to classify the collected VQA samples. Based on this classification, we construct a search-balanced training set of 5,000 examples, named FVQA-train, which includes approximately 3,400 search-required and 1,600 search-free questions.
Human Annotation Human annotators are involved throughout the data curation process to ensure diversity, authenticity, and label quality—especially for the test set of FVQA.
3. Experimental Findings
We evaluated MMSearch-R1 against both closed-source models (GPT-4o and Gemini 2.5 Pro) and open-source models from the Qwen2.5-VL series on knowledge-intensive and information-seeking VQA tasks (FVQA-test, InfoSeek, MMSearch, SimpleVQA, and LiveVQA). All baseline models are tasked with solving VQA problems in two different workflows. (1) Direct Answer: Models are prompted to directly generate a short and precise answer without accessing external information. (2) Answer under RAG Workflow: In this workflow, models are required to perform exactly two search operations using our multimodal search tools for each VQA example, first performing an image search and then a text search. Specifically, given an input image and question, the model is provided with the image search results and the original question in the first round and is prompted to generate a text query to assist in answering. In the second round, the retrieved results based on the text query are fed into the model, and the model is asked to produce the final answer. Under a fixed budget of search steps, the RAG workflow typically exposes the model to more external information compared to the on-demand search strategy.
Table 1: Performance of MMSearch-R1 across benchmarks. "Acc (%)" denotes the accuracy evaluated by LLM-as-Judge, while "SR (%)" represents the search ratio, defined as the percentage of total search calls made relative to the maximum allowed search steps for each method.
Finding 1: RL training enables models to better recognize the boundaries of their knowledge and perform on-demand search more effectively. As shown in Table 1, MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%, across both in-domain and out-of-domain test sets. This demonstrates that our RL-trained model achieves higher correctness with fewer search calls, indicating more efficient and selective use of external information.
Figure 4: (a). Performance comparison between the Base model and the RL-trained model under the RAG workflow. (b). Answer behavior breakdown of Base (inner circle) and RL (outer circle) models in InfoSeek and SimpleVQA.
Finding 2: RL training enhances the model’s ability to generate effective text queries and summarize retrieved information. To evaluate the ablities of query generation and information summarization, we follow a fixed RAG setup where both image and text search are executed for every question. This isolates the model’s ability to interact with retrieved information. As shown in Figure 4(a), MMSearch-R1-7B consistently outperforms the base model on both in-domain and out-of-domain tasks.
Finding 3: RL improves the model’s ability to utilize its internal knowledge. As shown in Figure 4(b), there is a clear upward trend in the Correct without Search proportion from the base model to the RL-trained model. These gains indicate that the RL-trained model can answer substantially more questions correctly without invoking the search tool, demonstrating improved recall and reasoning based on its internal knowledge.
Figure 5: (a). Performance improvements of SFT and RL over Base across five VQA datasets. (b). Training dynamics of reward and search ratio for different strategies.
Finding 4: RL achieves greater performance improvements and exhibits higher data efficiency compared to supervised SFT. We distill GPT-4o’s behavior on our collected VQA samples to construct SFT data, and fine-tune Qwen2.5-VL-7B on it. This serves as a supervised learning baseline for comparison against our reinforcement learning-trained model. As shown in Figure 5(a), the results show that the model trained with RL consistently outperforms the one trained with SFT across all tasks, despite being trained on only about half as much data.
Finding 5: Training with balanced data and a search penalty in the reward effectively guide the model to perform on-demand search. Figure 5(b) illustrates the training dynamics of reward and search ratio during reinforcement learning. Removing either the search penalty or data balancing leads to distinct trade-offs. Although both ablated variants achieve slightly higher rewards, they do so at the cost of overusing the search tool, with search ratios rapidly converging to nearly 100%.
4. Conclusion
MMSearch-R1 learns to recognize knowledge gaps, selectively invoke image or text search, and reason over retrieved content. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls. Our framework, dataset, and findings offer practical insights into training LMMs with real-world interaction capabilities and lay the groundwork for building multimodal agents that are both adaptive and interactive. We look forward to the next major advancement in multimodal intelligence emerging as models increasingly engage with and explore the real world through more tools, further evolving their reasoning and adaptive capabilities.
@article{wu2025searchr1, title={Search-R1: A Multimodal Search-Augmented Reinforcement Learning Framework for LMMs}, author={Wu, Jinming and Deng, Zihao and Li, Wei and Liu, Yiding and You, Bo and Li, Bo and Ma, Zejun}, url={https://github.com/EvolvingLMMs-Lab/multimodal-search-r1}, year={2025}}
SOTA large multimodal model (LMM) architectures, such as Qwen2.5-VL, typically build on a powerful large language model (LLM) (e.g. Qwen2.5) integrated with an external Native Resolution Vision Transformer (NaViT). Such approach also presents challenges in high-resolution real-world scenarios, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. By comparison, when processing high-resolution real-world scenarios, the human visual system employs task-driven visual search strategies to ground and scrutinize critical regions of interest. Motivated by this biological mechanism, we attempt to equip LLMs with similar visual search capabilities by leveraging visual grounding to focus on key image regions.
However, empowering LMMs with such grounding-based visual reasoning capabilities is non-trivial, primarily due to the scarcity and high cost of obtaining grounding annotations for standard visual-question-answering (VQA) datasets, which are required for constructing multi-turn grounding-based conversation data for supervised fine-tuning (SFT). In this paper, we highlight that accurate grounding behavior can emerge within a reinforcement learning (RL) paradigm, even when training supervision is provided solely through a binary reward function derived from the correctness of the final answer.
To this end, we introduce Multi-turn Grounding-based Policy Optimization (MGPO), a reinforcement learning (RL) algorithm that enables LMMs to iteratively focus on key image regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Given a high-resolution image and a question, the model first predicts the coordinates of key regions relevant to the query. An image cropping function is then triggered to extract and return the corresponding sub-image. In subsequent turns, the model can integrate previous in-context convesations (including both the original image and cropped sub-image) to solve the question.
Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process. The conversation in the figure only shows key parts, the full conversation is provided in Figure 9.
In summary, MGPO mainly offers the following advantages:
Top-down and Interpretable Visual Reasoning. MGPO equips LMMs with a top-down, question-driven visual search mechanism for high-resolution scenarios and provides interpretable outputs that indicate which image regions are attended to throughout the reasoning process.
Overcomes Maximum Pixel Constraints. MGPO can overcomes the maximum pixel limitation of LMMs. As shown in the first example of Figure 1, even when resizing a high-resolution image within pixel limits results in a blurred input, the model can still identify relevant coordinates and crop clear sub-images from the original input for further analysis.
Without Additional Grounding Annotations. MGPO can be post-trained directly on standard VQA datasets without the need for extra grounding annotations, and experimental results demonstrate substantial improvements in intermediate grounding performance compared to GRPO
Ultimately, we utilize MGPO to post-train Qwen2.5-VL-7B using visual-question-short answering data, yet achieves strong intermediate grounding performance without requiring grounding annotations (examples shown in Figure 1). Compared to GRPO, MGPO yields a 5.4% improvement on the in-distribution MME-Realworld benchmark and a 5.2% gain on the challenging out-of-distribution V* Bench. Notably, leveraging with only 21K post-training samples, our model surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench.
2. Multi-turn Grounding-Based RL
Figure illustrates a comparison of different post-training paradigms for LMMs. In our MGPO, the model operates over K sequential interaction, dynamically grounding and reasoning by conditioning on the full history of visual and textual context at each step.
Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.
Multi-turn Template without Cold Start. In practice, we observe that LLMs struggle to autonomously generate grounding coordinates during the rollout process, which hinder effective multi-turn RL. To address this, we design a fixed two-turn dialogue template, as shown in Figure 3, to explicitly activate the model’s grounding and reasoning abilities.
Grounding Key Visual Areas. Within the two-turn MGPO framework, the extraction of sub-images is performed with respect to the original high-resolution image. Since the grounding coordinates predicted by Qwen2.5-VL are inherently dependent on the resolution of the input image, it is necessary to normalize the predicted coordinates by the input image dimensions and subsequently map them back to the coordinate space of the original image. This normalization procedure is particularly crucial when the original image resolution exceeds the maximum pixel limit of the LMM, as it enables the model to access higher-fidelity sub-image for processing. A illustration of this process is provided in the Figure 4.
Figure 4: A illustration of cropping sub-image based on grounding coordinates.
3. Experiments
3.1 Datasets & Metrics
To evaluate the effectiveness of the our approach, experiments are conducted on two established datasets: MME-Realworld and V* Bench. Both datasets are specifically designed to evaluate the capabilities of LMMs in analyzing high-resolution images and capturing fine-grained visual information.
MME-Realworld. The MME-Realworld dataset comprises a diverse array of tasks, which are systematically categorized into perception and reasoning domains. For in-distribution evaluation, the lite subset of MME-Realworld, consisting of 1,919 samples, is reserved as the test set, while the remaining 21,690 samples are utilized for training.
V Bench.* V* Bench serves as an out-of-distribution benchmark, focuses on detailed visual grounding on high-resolution images. This vision-centric benchmark requires LMMs to accurately localize and interpret specific visual information, which has also been adopted by OpenAI to assess the visual reasoning capabilities of their latest o3 and o4-mini models. This benchmark contains 191 test samples.
All datasets employ the multiple-choice question format, and model performance is consistently
measured by accuracy on both the in-distribution (MME-Realworld) and out-of-distribution (V*
Bench) test sets. Figure 5 illustrates the distribution of image resolutions across different datasets.
Figure 5: Distribution of image resolutions (width × height) across different datasets.
3.2 Experimental Setup
We employ the verl framework to enable distributed training across multiple machines and GPUs, and utilize vLLM to accelerate inference during the rollout phase. For reinforcement learning, we adopt the naive GRPO algorithm as RL baseline, where a post-prompt is added: “{question}\nOutput the coordinates ofthe key image area relevant to the problem in JSON format. And put the answer letter (A, B, C, D, or E) within \boxed{}.” Both GRPO and our proposed MGPO leverage a binary accuracy reward function, assigning a reward of 1 if the final multiple-choice answer is correct and 0 otherwise.
All experiments are conducted using the Qwen2.5-VL-7B model. To prevent out-of-memory errors, the maximum number of input image pixels is limited to 1,003,520 (1280 × 28 × 28), corresponding to a maximum of 1280 visual tokens per image. Images exceeding this pixel threshold are resized to comply with this constraint.
3.3 Main Results
Table 1 presents the performance comparison of different post-training paradigms on Qwen2.5-VL7B, including SFT, GRPO and our MGPO. All three post-training methods substantially improve the model’s performance on high-resolution visual tasks, as measured by both OOD V* Bench and ID MME-Realworld benchmarks.
Notably, we observe that GRPO does not yield significant improvements over SFT, which contrasts with conclusions drawn from prior work on multi-modal mathematical tasks. We hypothesize that, for high-resolution vision-centric tasks, the primary challenge lies in enabling the model to perceive fine-grained image details, rather than performing complex, lengthy reasoning.
In contrast, our MGPO algorithm achieves remarkable gains, outperforming both SFT and GRPO. Specifically, MGPO delivers a substantial 5.2% absolute improvement over the GRPO baseline on the V* Bench (OOD) benchmark, and a 5.4% gain in overall MME-Realworld (ID) performance. These results demonstrate the effectiveness of multi-turn grounding and iterative sub-image cropping in addressing the challenges of high-resolution visual understanding.
Additionally, we compare our results with OpenAI’s o1 and GPT-4o models. To ensure a fair comparison, we report only the OOD V* Bench results. Notably, our MGPO post-trained model surpasses both o1 and GPT-4o, despite being based on a 7B model and trained with a small-scale dataset of 21k samples.
Table 1: Performance comparison of different post-training paradigms for LMMs. V* Bench serves as an out-of-distribution evaluation, while MME-Realworld serves as an in-distribution evaluation. Abbreviations: OCR—Optical Character Recognition in the wild; RS—Remote Sensing; DT—Diagram and Table; MO—Video Monitoring; AD—Autonomous Driving.
Figure 6 illustrates the comparative performance trajectories of MGPO and GRPO on the V* Bench throughout the RL training process. As training progresses, MGPO consistently surpasses GRPO, highlighting its superior capacity to address high-resolution scenarios that remain unresolved by GRPO.
Figure 6: Performance comparison of V* Bench between MGPO and GRPO.
Effect of LMM Maximum Input Image Resolution. Table 2 compares the impact of varying maximum input image resolutions for LMMs. We observe that MGPO yields greater performance improvements on the V* Bench when the maximum input pixel limit is lower. This is because, when high-resolution images are aggressively resized, many tasks become more challenging to solve directly. however, MGPO can first identify key regions and crop clearer sub-images from the original image, thereby facilitating more effective task completion.
Table 2: Performance comparison of various post-training paradigms for LMMs under different maximum input image resolutions.
4. Grounding-based RL without Grounding Annotations
In this section, we highlight the insight that it is feasible to train powerful grounding-based RL models even without grounding annotations. This insight can broadens the applicability of grounding-based RL paradigms, as obtaining high-quality grounding annotations is often expensive and labor-intensive.
4.1 Emergent Grounding Ability During RL Training
To assess whether models can develop accurate grounding capabilities in the absence of grounding supervision, we analyze the proportion of rollouts that generate valid grounding coordinates during RL training (e.g., ensuring coordinates within the input image boundaries). Figure 7 illustrates the comparison between GRPO and MGPO. Regarding to GRPO, the ratio of valid grounding coordinates remains low and exhibits minimal improvement throughout training, indicating that the model struggles to ground correct image regions. In contrast, MGPO demonstrates a clear upward trajectory, with the proportion of valid grounding coordinates steadily increasing as training progresses.
Figure 7: The ratio of valid grounding coordinates during RL rollouts.
Additionally, we evaluate whether the grounding sub-images from the test set can be directly used to answer the question using Qwen2.5-VL-7B. As presented in Table 3, the comparative results across different methods demonstrate the superior accuracy of grounding achieved by MGPO. In the second stage of MGPO, the model is provided with either the cropped subimage or the original image, without any auxiliary reward for generating valid sub-image coordinates. Notably, the model autonomously increases the proportion of valid grounding coordinates, suggesting that it is capable of learning to localize key regions and utilize subimages to improve question answering performance.
Table 3: Ratio of grounding subimages that can directly answer the question using Qwen2.5-VL-7B on the V* Bench.
4.2 Further Experiments on Image Counting Tasks
To further substantiate the insight, we conduct additional experiments on the Image Counting task, leveraging the fact that the Image Count dataset provides both the grounding annotations (in point format) and the corresponding count as the final answer. Specifically, we randomly sample 3,000 instances from the Pixmo-Points dataset for post-training. Pixmo-Count is used as the indistribution (ID) evaluation benchmark, while FSC-147 serves as the out-of-distribution (OOD) benchmark.
During GRPO post-training, the model is prompted to first grounding (point) each object in the image and subsequently provide the total count. We compare two reward function: (1) the binary accuracy reward based solely on the correctness of the final count, and (2) incorporating an additional point reward. The point reward is computed by matching the model’s predicted point list with the ground-truth point list using the Hungarian algorithm, such that a higher number of matched ratio results in a higher reward.
The results, summarized in Table 4, indicate that introducing the additional point reward does not yield significant performance improvements. We further visualize the outputs of the GRPO model trained solely with the accuracy reward (see Figure 8), and observe that the model is capable of accurately localizing object points even without explicit grounding supervision. These results support our conclusion that explicit grounding annotations are not necessary for effective RL-based learning, as the model inherently learns to perform precise grounding as a prerequisite for solving the counting task.
Table 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.
Figure 8: Visualization of point predictions from the GRPO model trained with only accuracy reward.
5. Limitation
All experiments of MGPO are conducted using a fixed two-turn template, rather than allowing the model to autonomously decide when to perform image cropping based on the input question, as illustrated in lasted OpenAI models such as o3 and o4-mini. This limitation stems from our observation that Qwen2.5-VL, when directly subjected to RL post-training, struggles to generate grounding coordinates without explicit prompt guidance.
Nevertheless, we believe that our trained models can be leveraged to generate high-quality chain-ofthought (CoT) data for subsequent SFT. By adopting a multi-stage training strategy that combines SFT and RL, as in DeepSeek-R1, may ultimately enable the model to autonomously decide when and how to perform grounding. We leave this direction for future work.
If you find our work to be useful for your research, please consider citing.
@article{huang2025highres, title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning}, author={Huang, Xinyu and Dong, Yuhao and Li, Wei and Wu, Jinming and Deng, Zihao and Li, Bo and Ma, Zejun}, url={https://github.com/EvolvingLMMs-Lab/MGPO}, year={2025}}
Appendix
Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks.
Aero-1-Audio is a compact audio model adept at various audio tasks, including speech recognition, audio understanding, and following audio instructions. It is part of the Aero-1 series, the first generation of lightweight multimodal models developed by LMMs-Lab, with future expansions planned across additional modalities.
Built upon the Qwen-2.5-1.5B language model, Aero delivers strong performance across multiple audio benchmarks while remaining parameter-efficient, even compared with larger advanced models like Whisper and Qwen-2-Audio and Phi-4-Multimodal, or commercial services like ElevenLabs/Scribe.
Aero is trained within one day on 16 H100 GPUs using just 50k hours of audio data. Our insight suggests that audio model training could be sample efficient with high quality and filtered data.
Aero can accurately perform ASR and audio understanding on continuous audio inputs up to 15 minutes in length, which we find the scenario is still a challenge for other models.
ASR & Audio Understanding Performance
We evaluate our model performance on multiple dimensions and different benchmarks. Let’s first take a look at its overall performance compare with other models
Our model achieves a balance between performance and parameter efficiency. We evaluate it across multiple ASR and audio understanding benchmarks. On ASR tasks, our model attains the lowest WER scores on datasets such as AMI, LibriSpeech, and SPGISpeech. It also demonstrates strong audio understanding capabilities on various comprehension benchmarks. As illustrated in the plotted graph, our model falls within the highlighted triangular region that represents an optimal trade-off between parameter efficiency and performance.
Data Distribution
We present the contributions of our data mixture here. Our SFT data mixture includes over 20 publicly available datasets, and comparisons with other models highlight the data’s lightweight nature.
*The hours of some training datasets are estimated and may not be fully accurate
One of the key strengths of our training recipe lies in the quality and quantity of our data. Our training dataset consists of approximately 5 billion tokens, corresponding to around 50,000 hours of audio. Compared to models such as Qwen-Omni and Phi-4, our dataset is over 100 times smaller, yet our model achieves competitive performance. All data is sourced from publicly available open-source datasets, highlighting the sample efficiency of our training approach. A detailed breakdown of our data distribution is provided below, along with comparisons to other models.
What’s insightful
In this release, our primary focus is on developing an audio model capable of handling multiple audio tasks. The following examples showcase its core abilities across tasks such as audio understanding and speech recognition. Most notably, we highlight the model’s capability to perform long-form ASR, as demonstrated in the example below.
Long ASR
A common approach for current long-form ASR tasks is to split the audio into smaller, processable chunks and perform ASR on each segment individually. However, with the advancement of large language models (LLMs), long-context understanding has become increasingly important. We argue that a model’s ability to process long audio sequences continuously is essential for effective audio understanding and should be considered a critical capability. To demonstrate this, we set up a simple use case using examples from an NVIDIA conference and calculate the WER with respect to the auto-generated YouTube subtitles.
The image above presents a heatmap comparison of different models performing ASR tasks on a video with varying audio input lengths. As shown in the heatmap, Qwen-Omni and Phi-4 exhibit instability across different lengths and do not consistently produce the desired output.
Note: The ground truth is derived from the auto-generated subtitles downloaded from YouTube. Therefore, the WER does not necessarily imply that our model achieves perfect results, but rather demonstrates that our model is comparable to the YouTube ASR pipeline.
Model’s Output
Qwen Omni (12 minutes chunk)
When processing the audio in 12-minute chunks, Qwen-Omni failed to recognize the full speech content and was only able to capture portions of the audio.
▶
Qwen Omni (12 minutes chunk)
that’s like what’s going on why does itfocused on um ai and parallel parallelizable workloads but it’s still general to an extent it’s not as use case specific as something like grock with a queue that’s really designed to you know spit out tokens as fast as possible and that like is a goldilocks zone where it’s flexible enough to handle different workloads but not um but still much faster than um a traditional cpu and that google is one of the only companies that has a scaled internal custom silicon effort
Phi-4-Multimodal (full chunk)
When processing the full audio without splitting, the Phi-4-Multimodal model began to ignore the instructions and instead generated an overall summary of the audio.
▶
Phi-4-Multimodal (full chunk)
The conversation covered Nvidia’s focus on inference over training, the partnership with GM, the release of GUT-N1 for humanoid robotics, and the impact of China’s AI initiatives on global chip demand.
Aero (full chunk)
Aero Audio is able to generate the complete ASR output and accurately identify the full transcript.
▶
Aero (full chunk)
Welcome to the brainstorm episode eighty two frank downing joining us recap of nvidia’s gtc conference that is the gpu technology conference frank what happened what were the big takeaways i on my side i saw a gm and in video partnering but we can circle back to that what was
…
right nice timing good timing all right we’ll see everyone next week see everyone thank you
Results on LibriSpeech Unchunked
In the previous release, LibriSpeech split their audio files into smaller chunks and calculated the overall Word Error Rate (WER) based on these segmented samples. However, as we observed, it is straightforward to concatenate the chunks back into their original form, thereby creating a simple long-form Audio Speech Recognition benchmark. We evaluated various models on these benchmarks and found that their performance generally declined compared to their results on shorter samples. Among the models tested, our model achieved the best performance, showing the smallest drop in accuracy relative to the chunked version.
LS.Clean
LS.Other
LS.Clean(Long)
LS.Other(Long)
Avg Diff
Phi-4
1.68
3.83
11.51
24.72
30.72
Qwen2-Audio-Instruct
3.59
7.46
93.01
93.63
175.59
Qwen2.5-Omni
1.80
3.40
13.03
13.29
21.12
Aero-1-Audio
1.49
3.17
5.31
11.71
12.36
We present the evaluation of various models on the unchunked LibriSpeech dataset. The average result is calculated by averaging the WER score differences across the same splits. All models show some degradation when handling longer audio, whereas our model exhibits the least amount of performance drop.
Evaluation Result
We then present the full evaluation result here with the evaluation scores
We evaluate our model on AMI, Earnings22, LibriSpeech, SPGISpeech, and TedLium. Our model achieves the second-best WER score compared to other models, while maintaining a small and efficient size.
Audio Understanding Result
We then test our model’s understanding result across 3 dimensions, Audio Analysis and Understanding, Speech Instruction, and Audio Scene Understanding
We conducted evaluations on AIR-Bench-Chat and MMAU for audio analysis and understanding. Our model achieved an average score of 5.35, outperforming Mini-Omni2 and Vita. For Audio Instruction Following, we evaluated on OpenHermes and Alpaca-Audio, following the same pipeline as AudioBench. Our model demonstrates a strong ability to understand instructions in speech and provide correct responses. Additionally, when evaluated on AIR-Bench-Foundation for Audio Scene Understanding, our model outperformed Phi-4-Multimodal in the sound and music dimensions. Overall, the average score of our model indicates strong performance relative to other models with larger parameter sizes.
Training Techniques
Dynamic Batch Size
We implemented a dynamic batching strategy based on the estimated token length to control the batch size per device. In many cases, using a fixed batch size requires setting it conservatively small to avoid out-of-memory (OOM) errors on longer samples, which leads to underutilization of computing resources. To address this, we group samples into batches such that the total token length stays within a predefined threshold, thereby minimizing computational waste and improving efficiency.
Sequence Packing
To further optimize dynamic batching, we implemented sequence packing for both the audio encoder and the language model, enabling larger batch sizes and faster training. This operation was then fused with the Liger kernel to achieve even higher throughput and lower memory usage. With a fixed packing length of 4096 to regulate the dynamic batch size, the average Model FLOP Utilization (MFU) was limited to 0.03. However, with sequence packing enabled, the average MFU increased to approximately 0.34, demonstrating a significant improvement in training efficiency.
Packing Length
Sequence Packing
Num GPUs
Avg MFU
Zero
OOM
4096
FALSE
64
0.03
2
No
32768
FALSE
64
NA
2
Yes
32768
TRUE
32
0.34
2
No
We tested our implementations on different settings to demonstrate the efficiency of our implementation
@article{li2025aero, title={Aero: Audio-enhanced large language models}, author={Li, Bo and Chen Change Loy and Pu Fanyi and Yang Jingkang and Zhang Kaichen and Hu Kairui and Thang Luu Minh and Trung Nguyen Quang and Cong Pham Ba and Liu Shuai and Wang Yezhen and Liu Ziwei}, url={https://www.lmms-lab.com/posts/aero_audio/}, year={2025}}
We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses 👓. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities—including discussions 💬, shopping 🛍️, cooking 🍳, socializing 👥, and entertainment 🎮 - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset 📖, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA❓, a suite of 3K long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations.
To address the key technical challenges of 1) developing robust visual-audio models for egocentric data, 2) enabling identity recognition, and 3) facilitating long-context question answering over extensive temporal information, we introduce EgoBulter 🫡, an integrated system comprising EgoGPT 🧠 and EgoRAG 🔍. EgoGPT is a vision-language model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
Video-MMMU asks a fundamental question:
If a model ‘goes to class,’ can the model learn from the lecture and apply what it learned to MMMU-style exam problems?
Motivation
Our original goal was to build a video reasoning benchmark, motivated by the observation that the most demanding forms of reasoning arise in academic settings—for example, MMMU-style university exam questions.
Online lectures create an ideal environment for evaluating video reasoning. They effectively convey knowledge and naturally test a model’s ability to learn from video. These videos have three key attributes:
Temporal structure (concepts unfolding over time).
These properties make reasoning from lecture video notably harder. This leads to our core question: When a model watches an online lecture, can it learn like a student—understand the content, acquire the knowledge, and then solve related problems?
Therefore, we introduce Video-MMMU, a video reasoning benchmark that evaluates knowledge acquisition from video.
We introduce Video-MMMU, a multi-modal, multi-disciplinary, multi-track benchmark designed to evaluate how effectively large multimodal models (LMMs) acquire knowledge from educational videos.
1) Video: Knowledge Source
Traditional VideoQA benchmarks focus on scene understanding. Video-MMMU treats video as a source of knowledge, evaluating whether LMMs can actually learn from instructional content. VideoMMMU includes 300 college-level, lecture-style videos across 30 subjects in 6 disciplines: Art, Business, Science, Medicine, Humanities, and Engineering.
2) QA Design: Three Stages of Knowledge Acquisition
Each video is paired with three questions, designed to reflect a progression in knowledge acquisition:
Perception – Identifying relevant surface information
Comprehension – Understanding underlying concepts or strategies
Adaptation – Applying learned knowledge to new scenarios
Adaptation: Case analysis (Medicine, top-right); Strategy adaptation (Engineering, bottom-right)
3) In-Context Knowledge Acquisition from Video: Can Models Learn Like Humans?
Humans consistently learn from the world around them. For models to operate effectively in real-world environments, the same principle should apply: they must be able to learn from the world, because unlike humans, they cannot be endlessly re-trained after deployment. In this sense, videos provide a natural proxy for the world. For a model, the video becomes its world. The ability to learn from video therefore becomes more than a technical benchmark—it is a measure of true, dynamic intelligence. It marks the shift from simply solving a task to demonstrating the ability to learn how to solve the task.
4) Metric: From Absolute Accuracy to Learning Efficiency (Δknowledge)
Following point 3, a core innovation in Video-MMMU is its shift—from measuring only final performance to measuring learning.
A model may initially fail to solve an MMMU-style exam question, but we give the model a video where a human learner could learn to solve the question by watching the video. Video-MMMU tests how well LMMs improve their performance after watching the videos. Video-MMMU introduces Δknowledge to quantify the model’s learning gain from the videos. Δknowledge is defined as the normalized performance gain on the Adaptation track questions:
1. Initial Test: The model attempts to answer a question *without* seeing the video.2. Re-Test after video viewing: We provide the corresponding lecture video. The model is asked the same question again.3. Performance Gain: If the model succeeds after watching, it demonstrates successful knowledge acquisition from video.
This setup mirrors a human’s natural educational process:
Don’t know → Learn by watching → Apply the knowledge
Key Insights
Progressive Performance Decline. Model performance decreases as cognitive demands increase. While models perform relatively better on Perception tasks, accuracy drops on Comprehension and declines further on Adaptation.
Knowledge Acquisition from Videos is Challenging. The Δknowledge metric reveals a significant human–model gap. Humans show substantial improvement (e.g., Δknowledge ≈ 33.1%), whereas top-performing models show smaller gains (e.g., GPT-4o: 15.6%, Claude-3.5-Sonnet: 11.4%). This highlights a current limitation: LMMs still struggle to learn from videos in the way humans do.
Evaluation
Please refer to our Code@Github for full evaluation instructions.
Case Study
We provide two case studies. Fig. 5 demonstrates a method adaptation error, in which the model failed to adapt the method from video to solve the Adaptation question.
Fig. 6 denotes a successful learning from video, turning an initial wronog answer into a correct one.
@article{hu2025videommmu, title={Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos}, author={Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu}, journal={arXiv preprint arXiv:2501.13826}, year={2025}, url={https://arxiv.org/abs/2501.13826}}
For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing the solution for feature interpretation across various model scales.
This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimdoal models, we discovered intriguing features that correlate with diverse semantics and can be leveraged to steer model behavior, enabling more precise control and understanding of LMM functionality.
The Sparse Autoencoder (SAE) is trained on LLaVA-NeXT data by integrating it into a specific layer of the model, with all other components frozen. The features learned by the SAE are subsequently interpreted through the proposed auto-explanation pipeline, which analyzes the visual features based on their activation regions.
These features can then be used to steer model’s behavior to output desire output. You can check our papers for more details.
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Video Instruction-Following Data Synthesis
A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.
Video Sources
We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.
Automated Generation for Video Detail Description
For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.
Automated Generation for Video Question Answering
In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model’s ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.
Dataset Statistics
We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.
Dataset Comparison
We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.
A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model.
High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage.
Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.