skip to content
LMMs-Lab

Search

Tags #research

  • Thumbnail

    🔗 Code | Paper | Model | Data

    MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.

    MMSearch-R1-Blog-Fig1
    Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.

    1. Introduction

    Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with long-tail facts, newly emerging information, and domain-specific content that is often restricted by privacy or copyright constraints. As a result, their performance remains suboptimal on knowledge-intensive and information-seeking visual question answering tasks, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution, such as unfamiliar visual content or previously unseen textual information. This limitation raises important concerns regarding their factual reliability in real-world applications.

    Integrating search capabilities into LMMs offers a promising solution to above limitations. However, existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal. RAG methods rely on a fixed retrieve-then-generate pipeline grounded in static corpora, often leading to over-retrieval, high computational cost, and the unrealistic assumption that all necessary information is already available. This rigid setup fails to reflect the dynamic and unpredictable nature of real-world scenarios. In contrast, prompt-based agents can access real-time search engines, but their parameters are not optimized through learning, preventing them from truly acquiring effective search behaviors or adapting to open-world environments.

    To address these limitations, we aim to train LMMs that can interact with real-world environments and acquire three essential search-related capabilities: (1) when to search, (2) what to search for, and (3) how to reason over search results to answer user queries. Building on these goals, we introduce MMSearch-R1, the first end-to-end reinforcement learning framework designed to empower LMMs with on-demand search capabilities in open, internet-based environments. Our efforts are summarized as follows:

    • Dataset Construction We propose an automated approach to construct a multimodal search VQA dataset by estimating the model’s familiarity with each question. This enables the generation of search-required and search-free samples, further complemented by manually annotated test data covering diverse knowledge types and difficulty levels.
    • Multimodal Search Tool Integration We develop a real-world search pipeline combining an image search tool and a text search tool, enabling LMMs to retrieve relevant visual and textual information for unfamiliar inputs.
    • Wiser Search via Reinforcement Learning We introduce a GRPO-based RL framework that trains LMMs to decide when, what, and how to search. Our method achieves superior performance over RAG-based baselines while reducing search calls by over 30%.
    • Open-Sourced Dataset and Framework We will release our model, dataset and training framework to support future research in search-augmented multimodal reasoning.

    2. Method

    2.1. Building Iterative Multimodal Search-Integrated RL Framework

    MMSearch-R1-Blog-Fig2
    Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.

    We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions with the real-world environment during the rollout process, as illustrated in Figure 2 and below.

    • Multimodal Search Tools We equip the model with two types of search tools to interact with real-world internet content. The first is an image search tool, which takes the input image and returns the top-5 visually similar webpages, each represented by a thumbnail and a title. This enables the model to identify unfamiliar visual entities in the image. The second is a text search pipeline, where the model formulates a query based on the user question, retrieves relevant webpages, and processes their content into concise summaries. This allows the model to acquire textual knowledge needed to answer the question accurately.
    • Rollout with Multi-turn Multimodal Search The rollout process is designed to be multi-turn and iterative. At each step, the model receives new information, such as the original question or retrieved search results, and performs reasoning based on the accumulated context. It then selects an action from a predefined action space, which includes invoking search tools or answering the question. This process continues until the model generates a final answer or reaches the maximum number of allowed turns. To support this interaction, we define and utilize a set of special tokens to structure the model’s outputs and the environment’s feedback.
    • Reward Modeling Our reward consists of two components: an accuracy score with search penalty and a format score. For accuracy score, we evaluate model performance using exact string match against the ground truth, assigning a score of 1 for correct answers and 0 otherwise. For correct responses, a penalty factor (between 0 and 1) is applied if any search was used, encouraging the model to rely on internal knowledge and invoke search only when necessary. This design promotes efficient, on-demand search behavior. The format score verifies whether the model follows the required output structure, ensuring compatibility with the environment interface.

    $$ \texttt{reward} = (1 - \alpha)\cdot \texttt{Acc_Score}\cdot \texttt{Search_Penalty} + \alpha\cdot \texttt{Format_Score} $$

    2.2. Curating Search-balanced VQA Datasets

    MMSearch-R1-Blog-Fig3
    Figure 3: Illustration of data construction process of FVQA dataset: (a). An automated pipeline for visual knowledge-required VQA samples collection; (b). Knowledge taxonomy; (c). Overall pipeline showing the composition and origin of FVQA from various automatic and manually curated sources.

    To effectively train models for on-demand search using simple outcome-based reinforcement learning, we require a search-balanced dataset that includes both search-required and search-free questions. This balance allows the model to learn when to rely on internal knowledge and when to invoke external search. We propose three key criteria for such datasets: (1). Coverage of Both Search-Required/Free Questions; (2). Concise and Verifiabl Answers; (3). Diversity in Knowledge and Difficulty. Follow these criteria, we construct a multimodal search VQA dataset, FactualVQA (FVQA), using a combination of automated pipelines and manual annotation.

    • VQA Collection We first gather a pool of candidate VQA samples requiring either visual or textual knowledge. For visual knowledge, we develop an automated pipeline that collects images related to head and tail visual concepts in the MetaCLIP vocabulary from the internet. Based on these images, we use GPT-4o to generate corresponding questions that assess the model’s recognition capabilities. For textual knowledge, we sample questions from the InfoSeek training set. We annotate the knowledge type for each question using GPT4o and maintain a balanced distribution across categories.
    • Search Balancing To distinguish between search-required and search-free questions, we use a preliminary model equipped with search capabilities to classify the collected VQA samples. Based on this classification, we construct a search-balanced training set of 5,000 examples, named FVQA-train, which includes approximately 3,400 search-required and 1,600 search-free questions.
    • Human Annotation Human annotators are involved throughout the data curation process to ensure diversity, authenticity, and label quality—especially for the test set of FVQA.

    3. Experimental Findings

    We evaluated MMSearch-R1 against both closed-source models (GPT-4o and Gemini 2.5 Pro) and open-source models from the Qwen2.5-VL series on knowledge-intensive and information-seeking VQA tasks (FVQA-test, InfoSeek, MMSearch, SimpleVQA, and LiveVQA). All baseline models are tasked with solving VQA problems in two different workflows. (1) Direct Answer: Models are prompted to directly generate a short and precise answer without accessing external information. (2) Answer under RAG Workflow: In this workflow, models are required to perform exactly two search operations using our multimodal search tools for each VQA example, first performing an image search and then a text search. Specifically, given an input image and question, the model is provided with the image search results and the original question in the first round and is prompted to generate a text query to assist in answering. In the second round, the retrieved results based on the text query are fed into the model, and the model is asked to produce the final answer. Under a fixed budget of search steps, the RAG workflow typically exposes the model to more external information compared to the on-demand search strategy.

    MMSearch-R1-Blog-Tab1
    Table 1: Performance of MMSearch-R1 across benchmarks. "Acc (%)" denotes the accuracy evaluated by LLM-as-Judge, while "SR (%)" represents the search ratio, defined as the percentage of total search calls made relative to the maximum allowed search steps for each method.
    • Finding 1: RL training enables models to better recognize the boundaries of their knowledge and perform on-demand search more effectively. As shown in Table 1, MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%, across both in-domain and out-of-domain test sets. This demonstrates that our RL-trained model achieves higher correctness with fewer search calls, indicating more efficient and selective use of external information.
    MMSearch-R1-Blog-Fig4
    Figure 4: (a). Performance comparison between the Base model and the RL-trained model under the RAG workflow. (b). Answer behavior breakdown of Base (inner circle) and RL (outer circle) models in InfoSeek and SimpleVQA.
    • Finding 2: RL training enhances the model’s ability to generate effective text queries and summarize retrieved information. To evaluate the ablities of query generation and information summarization, we follow a fixed RAG setup where both image and text search are executed for every question. This isolates the model’s ability to interact with retrieved information. As shown in Figure 4(a), MMSearch-R1-7B consistently outperforms the base model on both in-domain and out-of-domain tasks.
    • Finding 3: RL improves the model’s ability to utilize its internal knowledge. As shown in Figure 4(b), there is a clear upward trend in the Correct without Search proportion from the base model to the RL-trained model. These gains indicate that the RL-trained model can answer substantially more questions correctly without invoking the search tool, demonstrating improved recall and reasoning based on its internal knowledge.
    MMSearch-R1-Blog-Fig5
    Figure 5: (a). Performance improvements of SFT and RL over Base across five VQA datasets. (b). Training dynamics of reward and search ratio for different strategies.
    • Finding 4: RL achieves greater performance improvements and exhibits higher data efficiency compared to supervised SFT. We distill GPT-4o’s behavior on our collected VQA samples to construct SFT data, and fine-tune Qwen2.5-VL-7B on it. This serves as a supervised learning baseline for comparison against our reinforcement learning-trained model. As shown in Figure 5(a), the results show that the model trained with RL consistently outperforms the one trained with SFT across all tasks, despite being trained on only about half as much data.
    • Finding 5: Training with balanced data and a search penalty in the reward effectively guide the model to perform on-demand search. Figure 5(b) illustrates the training dynamics of reward and search ratio during reinforcement learning. Removing either the search penalty or data balancing leads to distinct trade-offs. Although both ablated variants achieve slightly higher rewards, they do so at the cost of overusing the search tool, with search ratios rapidly converging to nearly 100%.

    4. Conclusion

    MMSearch-R1 learns to recognize knowledge gaps, selectively invoke image or text search, and reason over retrieved content. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls. Our framework, dataset, and findings offer practical insights into training LMMs with real-world interaction capabilities and lay the groundwork for building multimodal agents that are both adaptive and interactive. We look forward to the next major advancement in multimodal intelligence emerging as models increasingly engage with and explore the real world through more tools, further evolving their reasoning and adaptive capabilities.

    Authors

    *equal contribution

    Citation

    @article{wu2025searchr1,
      title={Search-R1: A Multimodal Search-Augmented Reinforcement Learning Framework for LMMs},
      author={Wu, Jinming and Deng, Zihao and Li, Wei and Liu, Yiding and You, Bo and Li, Bo and Ma, Zejun},
      url={https://github.com/EvolvingLMMs-Lab/multimodal-search-r1},
      year={2025}
    }
  • Pix-Pin-2025-04-28-12-16-06.gif

    Code@Github

    1. Introduction

    SOTA large multimodal model (LMM) architectures, such as Qwen2.5-VL, typically build on a powerful large language model (LLM) (e.g. Qwen2.5) integrated with an external Native Resolution Vision Transformer (NaViT). Such approach also presents challenges in high-resolution real-world scenarios, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. By comparison, when processing high-resolution real-world scenarios, the human visual system employs task-driven visual search strategies to ground and scrutinize critical regions of interest. Motivated by this biological mechanism, we attempt to equip LLMs with similar visual search capabilities by leveraging visual grounding to focus on key image regions.

    However, empowering LMMs with such grounding-based visual reasoning capabilities is non-trivial, primarily due to the scarcity and high cost of obtaining grounding annotations for standard visual-question-answering (VQA) datasets, which are required for constructing multi-turn grounding-based conversation data for supervised fine-tuning (SFT). In this paper, we highlight that accurate grounding behavior can emerge within a reinforcement learning (RL) paradigm, even when training supervision is provided solely through a binary reward function derived from the correctness of the final answer.

    To this end, we introduce Multi-turn Grounding-based Policy Optimization (MGPO), a reinforcement learning (RL) algorithm that enables LMMs to iteratively focus on key image regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Given a high-resolution image and a question, the model first predicts the coordinates of key regions relevant to the query. An image cropping function is then triggered to extract and return the corresponding sub-image. In subsequent turns, the model can integrate previous in-context convesations (including both the original image and cropped sub-image) to solve the question.

    Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks.
    Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process. The conversation in the figure only shows key parts, the full conversation is provided in Figure 9.

    In summary, MGPO mainly offers the following advantages:

    • Top-down and Interpretable Visual Reasoning. MGPO equips LMMs with a top-down, question-driven visual search mechanism for high-resolution scenarios and provides interpretable outputs that indicate which image regions are attended to throughout the reasoning process.
    • Overcomes Maximum Pixel Constraints. MGPO can overcomes the maximum pixel limitation of LMMs. As shown in the first example of Figure 1, even when resizing a high-resolution image within pixel limits results in a blurred input, the model can still identify relevant coordinates and crop clear sub-images from the original input for further analysis.
    • Without Additional Grounding Annotations. MGPO can be post-trained directly on standard VQA datasets without the need for extra grounding annotations, and experimental results demonstrate substantial improvements in intermediate grounding performance compared to GRPO

    Ultimately, we utilize MGPO to post-train Qwen2.5-VL-7B using visual-question-short answering data, yet achieves strong intermediate grounding performance without requiring grounding annotations (examples shown in Figure 1). Compared to GRPO, MGPO yields a 5.4% improvement on the in-distribution MME-Realworld benchmark and a 5.2% gain on the challenging out-of-distribution V* Bench. Notably, leveraging with only 21K post-training samples, our model surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench.

    2. Multi-turn Grounding-Based RL

    Figure illustrates a comparison of different post-training paradigms for LMMs. In our MGPO, the model operates over K sequential interaction, dynamically grounding and reasoning by conditioning on the full history of visual and textual context at each step.

    Figure 2
    Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.

    Multi-turn Template without Cold Start. In practice, we observe that LLMs struggle to autonomously generate grounding coordinates during the rollout process, which hinder effective multi-turn RL. To address this, we design a fixed two-turn dialogue template, as shown in Figure 3, to explicitly activate the model’s grounding and reasoning abilities.

    Figure 3
    Figure 3: Fixed multi-turn grounding template, which eliminate cold start SFT process.

    Grounding Key Visual Areas. Within the two-turn MGPO framework, the extraction of sub-images is performed with respect to the original high-resolution image. Since the grounding coordinates predicted by Qwen2.5-VL are inherently dependent on the resolution of the input image, it is necessary to normalize the predicted coordinates by the input image dimensions and subsequently map them back to the coordinate space of the original image. This normalization procedure is particularly crucial when the original image resolution exceeds the maximum pixel limit of the LMM, as it enables the model to access higher-fidelity sub-image for processing. A illustration of this process is provided in the Figure 4.

    Figure 4
    Figure 4: A illustration of cropping sub-image based on grounding coordinates.

    3. Experiments

    3.1 Datasets & Metrics

    To evaluate the effectiveness of the our approach, experiments are conducted on two established datasets: MME-Realworld and V* Bench. Both datasets are specifically designed to evaluate the capabilities of LMMs in analyzing high-resolution images and capturing fine-grained visual information.

    MME-Realworld. The MME-Realworld dataset comprises a diverse array of tasks, which are systematically categorized into perception and reasoning domains. For in-distribution evaluation, the lite subset of MME-Realworld, consisting of 1,919 samples, is reserved as the test set, while the remaining 21,690 samples are utilized for training.

    V Bench.* V* Bench serves as an out-of-distribution benchmark, focuses on detailed visual grounding on high-resolution images. This vision-centric benchmark requires LMMs to accurately localize and interpret specific visual information, which has also been adopted by OpenAI to assess the visual reasoning capabilities of their latest o3 and o4-mini models. This benchmark contains 191 test samples.

    All datasets employ the multiple-choice question format, and model performance is consistently measured by accuracy on both the in-distribution (MME-Realworld) and out-of-distribution (V* Bench) test sets. Figure 5 illustrates the distribution of image resolutions across different datasets.

    Figure 5
    Figure 5: Distribution of image resolutions (width × height) across different datasets.

    3.2 Experimental Setup

    We employ the verl framework to enable distributed training across multiple machines and GPUs, and utilize vLLM to accelerate inference during the rollout phase. For reinforcement learning, we adopt the naive GRPO algorithm as RL baseline, where a post-prompt is added: “{question}\nOutput the coordinates ofthe key image area relevant to the problem in JSON format. And put the answer letter (A, B, C, D, or E) within \boxed{}.” Both GRPO and our proposed MGPO leverage a binary accuracy reward function, assigning a reward of 1 if the final multiple-choice answer is correct and 0 otherwise.

    All experiments are conducted using the Qwen2.5-VL-7B model. To prevent out-of-memory errors, the maximum number of input image pixels is limited to 1,003,520 (1280 × 28 × 28), corresponding to a maximum of 1280 visual tokens per image. Images exceeding this pixel threshold are resized to comply with this constraint.

    3.3 Main Results

    Table 1 presents the performance comparison of different post-training paradigms on Qwen2.5-VL7B, including SFT, GRPO and our MGPO. All three post-training methods substantially improve the model’s performance on high-resolution visual tasks, as measured by both OOD V* Bench and ID MME-Realworld benchmarks.

    Notably, we observe that GRPO does not yield significant improvements over SFT, which contrasts with conclusions drawn from prior work on multi-modal mathematical tasks. We hypothesize that, for high-resolution vision-centric tasks, the primary challenge lies in enabling the model to perceive fine-grained image details, rather than performing complex, lengthy reasoning.

    In contrast, our MGPO algorithm achieves remarkable gains, outperforming both SFT and GRPO. Specifically, MGPO delivers a substantial 5.2% absolute improvement over the GRPO baseline on the V* Bench (OOD) benchmark, and a 5.4% gain in overall MME-Realworld (ID) performance. These results demonstrate the effectiveness of multi-turn grounding and iterative sub-image cropping in addressing the challenges of high-resolution visual understanding.

    Additionally, we compare our results with OpenAI’s o1 and GPT-4o models. To ensure a fair comparison, we report only the OOD V* Bench results. Notably, our MGPO post-trained model surpasses both o1 and GPT-4o, despite being based on a 7B model and trained with a small-scale dataset of 21k samples.

    Table 1
    Table 1: Performance comparison of different post-training paradigms for LMMs. V* Bench serves as an out-of-distribution evaluation, while MME-Realworld serves as an in-distribution evaluation. Abbreviations: OCR—Optical Character Recognition in the wild; RS—Remote Sensing; DT—Diagram and Table; MO—Video Monitoring; AD—Autonomous Driving.

    Figure 6 illustrates the comparative performance trajectories of MGPO and GRPO on the V* Bench throughout the RL training process. As training progresses, MGPO consistently surpasses GRPO, highlighting its superior capacity to address high-resolution scenarios that remain unresolved by GRPO.

    Figure 6
    Figure 6: Performance comparison of V* Bench between MGPO and GRPO.

    Effect of LMM Maximum Input Image Resolution. Table 2 compares the impact of varying maximum input image resolutions for LMMs. We observe that MGPO yields greater performance improvements on the V* Bench when the maximum input pixel limit is lower. This is because, when high-resolution images are aggressively resized, many tasks become more challenging to solve directly. however, MGPO can first identify key regions and crop clearer sub-images from the original image, thereby facilitating more effective task completion.

    Table 2
    Table 2: Performance comparison of various post-training paradigms for LMMs under different maximum input image resolutions.

    4. Grounding-based RL without Grounding Annotations

    In this section, we highlight the insight that it is feasible to train powerful grounding-based RL models even without grounding annotations. This insight can broadens the applicability of grounding-based RL paradigms, as obtaining high-quality grounding annotations is often expensive and labor-intensive.

    4.1 Emergent Grounding Ability During RL Training

    To assess whether models can develop accurate grounding capabilities in the absence of grounding supervision, we analyze the proportion of rollouts that generate valid grounding coordinates during RL training (e.g., ensuring coordinates within the input image boundaries). Figure 7 illustrates the comparison between GRPO and MGPO. Regarding to GRPO, the ratio of valid grounding coordinates remains low and exhibits minimal improvement throughout training, indicating that the model struggles to ground correct image regions. In contrast, MGPO demonstrates a clear upward trajectory, with the proportion of valid grounding coordinates steadily increasing as training progresses.

    Figure 7
    Figure 7: The ratio of valid grounding coordinates during RL rollouts.

    Additionally, we evaluate whether the grounding sub-images from the test set can be directly used to answer the question using Qwen2.5-VL-7B. As presented in Table 3, the comparative results across different methods demonstrate the superior accuracy of grounding achieved by MGPO. In the second stage of MGPO, the model is provided with either the cropped subimage or the original image, without any auxiliary reward for generating valid sub-image coordinates. Notably, the model autonomously increases the proportion of valid grounding coordinates, suggesting that it is capable of learning to localize key regions and utilize subimages to improve question answering performance.

    Table 3
    Table 3: Ratio of grounding subimages that can directly answer the question using Qwen2.5-VL-7B on the V* Bench.

    4.2 Further Experiments on Image Counting Tasks

    To further substantiate the insight, we conduct additional experiments on the Image Counting task, leveraging the fact that the Image Count dataset provides both the grounding annotations (in point format) and the corresponding count as the final answer. Specifically, we randomly sample 3,000 instances from the Pixmo-Points dataset for post-training. Pixmo-Count is used as the indistribution (ID) evaluation benchmark, while FSC-147 serves as the out-of-distribution (OOD) benchmark.

    During GRPO post-training, the model is prompted to first grounding (point) each object in the image and subsequently provide the total count. We compare two reward function: (1) the binary accuracy reward based solely on the correctness of the final count, and (2) incorporating an additional point reward. The point reward is computed by matching the model’s predicted point list with the ground-truth point list using the Hungarian algorithm, such that a higher number of matched ratio results in a higher reward.

    The results, summarized in Table 4, indicate that introducing the additional point reward does not yield significant performance improvements. We further visualize the outputs of the GRPO model trained solely with the accuracy reward (see Figure 8), and observe that the model is capable of accurately localizing object points even without explicit grounding supervision. These results support our conclusion that explicit grounding annotations are not necessary for effective RL-based learning, as the model inherently learns to perform precise grounding as a prerequisite for solving the counting task.

    Table 4
    Table 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.
    Figure 8
    Figure 8: Visualization of point predictions from the GRPO model trained with only accuracy reward.

    5. Limitation

    All experiments of MGPO are conducted using a fixed two-turn template, rather than allowing the model to autonomously decide when to perform image cropping based on the input question, as illustrated in lasted OpenAI models such as o3 and o4-mini. This limitation stems from our observation that Qwen2.5-VL, when directly subjected to RL post-training, struggles to generate grounding coordinates without explicit prompt guidance.

    Nevertheless, we believe that our trained models can be leveraged to generate high-quality chain-ofthought (CoT) data for subsequent SFT. By adopting a multi-stage training strategy that combines SFT and RL, as in DeepSeek-R1, may ultimately enable the model to autonomously decide when and how to perform grounding. We leave this direction for future work.

    Authors

    Citation

    If you find our work to be useful for your research, please consider citing.

    @article{huang2025highres,
      title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning},
      author={Huang, Xinyu and Dong, Yuhao and Li, Wei and Wu, Jinming and Deng, Zihao and Li, Bo and Ma, Zejun},
      url={https://github.com/EvolvingLMMs-Lab/MGPO},
      year={2025}
    }

    Appendix

    Figure 9
    Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks.
  • Pix-Pin-2025-04-28-12-16-06.gif

    What is Aero Audio?

    Github | Playground | Models | Evaluation Results | Cookbook

    Aero-1-Audio is a compact audio model adept at various audio tasks, including speech recognition, audio understanding, and following audio instructions. It is part of the Aero-1 series, the first generation of lightweight multimodal models developed by LMMs-Lab, with future expansions planned across additional modalities.

    1. Built upon the Qwen-2.5-1.5B language model, Aero delivers strong performance across multiple audio benchmarks while remaining parameter-efficient, even compared with larger advanced models like Whisper and Qwen-2-Audio and Phi-4-Multimodal, or commercial services like ElevenLabs/Scribe.

    2. Aero is trained within one day on 16 H100 GPUs using just 50k hours of audio data. Our insight suggests that audio model training could be sample efficient with high quality and filtered data.

    3. Aero can accurately perform ASR and audio understanding on continuous audio inputs up to 15 minutes in length, which we find the scenario is still a challenge for other models.

    ASR & Audio Understanding Performance

    We evaluate our model performance on multiple dimensions and different benchmarks. Let’s first take a look at its overall performance compare with other models

    ASR-Understanding-Compare.png ASR-Detail.png

    Our model achieves a balance between performance and parameter efficiency. We evaluate it across multiple ASR and audio understanding benchmarks. On ASR tasks, our model attains the lowest WER scores on datasets such as AMI, LibriSpeech, and SPGISpeech. It also demonstrates strong audio understanding capabilities on various comprehension benchmarks. As illustrated in the plotted graph, our model falls within the highlighted triangular region that represents an optimal trade-off between parameter efficiency and performance.

    Data Distribution

    We present the contributions of our data mixture here. Our SFT data mixture includes over 20 publicly available datasets, and comparisons with other models highlight the data’s lightweight nature.

    Data-distribution.png training-time.png

    *The hours of some training datasets are estimated and may not be fully accurate
    One of the key strengths of our training recipe lies in the quality and quantity of our data. Our training dataset consists of approximately 5 billion tokens, corresponding to around 50,000 hours of audio. Compared to models such as Qwen-Omni and Phi-4, our dataset is over 100 times smaller, yet our model achieves competitive performance. All data is sourced from publicly available open-source datasets, highlighting the sample efficiency of our training approach. A detailed breakdown of our data distribution is provided below, along with comparisons to other models.

    What’s insightful

    In this release, our primary focus is on developing an audio model capable of handling multiple audio tasks. The following examples showcase its core abilities across tasks such as audio understanding and speech recognition. Most notably, we highlight the model’s capability to perform long-form ASR, as demonstrated in the example below.

    Long ASR

    A common approach for current long-form ASR tasks is to split the audio into smaller, processable chunks and perform ASR on each segment individually. However, with the advancement of large language models (LLMs), long-context understanding has become increasingly important. We argue that a model’s ability to process long audio sequences continuously is essential for effective audio understanding and should be considered a critical capability. To demonstrate this, we set up a simple use case using examples from an NVIDIA conference and calculate the WER with respect to the auto-generated YouTube subtitles.

    Long-ASR-eval.png

    The image above presents a heatmap comparison of different models performing ASR tasks on a video with varying audio input lengths. As shown in the heatmap, Qwen-Omni and Phi-4 exhibit instability across different lengths and do not consistently produce the desired output.

    Note: The ground truth is derived from the auto-generated subtitles downloaded from YouTube. Therefore, the WER does not necessarily imply that our model achieves perfect results, but rather demonstrates that our model is comparable to the YouTube ASR pipeline.

    Model’s Output

    Qwen Omni (12 minutes chunk)

    When processing the audio in 12-minute chunks, Qwen-Omni failed to recognize the full speech content and was only able to capture portions of the audio.

    Qwen Omni (12 minutes chunk)
    that’s like what’s going on why does itfocused on um ai and parallel parallelizable workloads but it’s still general to an extent it’s not as use case specific as something like grock with a queue that’s really designed to you know spit out tokens as fast as possible and that like is a goldilocks zone where it’s flexible enough to handle different workloads but not um but still much faster than um a traditional cpu and that google is one of the only companies that has a scaled internal custom silicon effort

    Phi-4-Multimodal (full chunk)

    When processing the full audio without splitting, the Phi-4-Multimodal model began to ignore the instructions and instead generated an overall summary of the audio.

    Phi-4-Multimodal (full chunk)
    The conversation covered Nvidia’s focus on inference over training, the partnership with GM, the release of GUT-N1 for humanoid robotics, and the impact of China’s AI initiatives on global chip demand.

    Aero (full chunk)

    Aero Audio is able to generate the complete ASR output and accurately identify the full transcript.

    Aero (full chunk)
    Welcome to the brainstorm episode eighty two frank downing joining us recap of nvidia’s gtc conference that is the gpu technology conference frank what happened what were the big takeaways i on my side i saw a gm and in video partnering but we can circle back to that what was … right nice timing good timing all right we’ll see everyone next week see everyone thank you

    Results on LibriSpeech Unchunked

    In the previous release, LibriSpeech split their audio files into smaller chunks and calculated the overall Word Error Rate (WER) based on these segmented samples. However, as we observed, it is straightforward to concatenate the chunks back into their original form, thereby creating a simple long-form Audio Speech Recognition benchmark. We evaluated various models on these benchmarks and found that their performance generally declined compared to their results on shorter samples. Among the models tested, our model achieved the best performance, showing the smallest drop in accuracy relative to the chunked version.

    LS.Clean LS.Other LS.Clean(Long) LS.Other(Long) Avg Diff
    Phi-4 1.68 3.83 11.51 24.72 30.72
    Qwen2-Audio-Instruct 3.59 7.46 93.01 93.63 175.59
    Qwen2.5-Omni 1.80 3.40 13.03 13.29 21.12
    Aero-1-Audio 1.49 3.17 5.31 11.71 12.36

    We present the evaluation of various models on the unchunked LibriSpeech dataset. The average result is calculated by averaging the WER score differences across the same splits. All models show some degradation when handling longer audio, whereas our model exhibits the least amount of performance drop.

    Evaluation Result

    We then present the full evaluation result here with the evaluation scores

    ASR Benchmarks

    Model Parameters Automatic Speech Recognition Average
    AMI Earnings22 LibriSpeech
    Clean
    LibriSpeech
    Other
    SPGispeech Tedlium
    ElevenLabs/Scribe N/A 14.43 12.14 1.79 3.31 3.30 3.17 6.36
    REV.AI/Fusion N/A 10.93 12.09 2.88 6.23 4.05 2.80 6.50
    OpenAI/Whisper-large-v3 1.5B 15.95 11.29 2.01 3.91 2.94 3.86 6.66
    Assembly.AI/AssemblyBest N/A 15.64 13.54 1.74 3.11 1.81 3.43 6.55
    Alibaba/Qwen2.5-Omni 7B 12.41 12.74 1.80 3.40 2.35 3.11 5.97
    Microsoft/Phi-4-Multimodal 4B+1.6B 11.45 10.50 1.67 3.82 3.11 2.89 5.57
    LMMs-Lab/Aero-1-Audio 1.5B 10.53 13.79 1.49 3.17 1.97 2.87 5.64

    We evaluate our model on AMI, Earnings22, LibriSpeech, SPGISpeech, and TedLium. Our model achieves the second-best WER score compared to other models, while maintaining a small and efficient size.

    Audio Understanding Result

    We then test our model’s understanding result across 3 dimensions, Audio Analysis and Understanding, Speech Instruction, and Audio Scene Understanding

    Model Parameters Audio Analysis and Understanding Speech Instruction Audio Scene Understanding Average
    AIR-Chat MMAU OpenHermes Alpaca Audio AIR-Foundation
    Speech Sound Music Mix Avg testmini test test Speech Sound Music
    Alibaba/Qwen2-Audio-Instruct 7B 7.2 7.0 6.8 6.8 6.9 49.2 46.8 49.2 62.9 55.4 56.8 56.7
    Alibaba/Qwen2.5-Omni 7B 6.8 5.7 4.8 5.4 5.7 65.6 57.2 57.4 67.2 76.3 63.0 64.4
    Microsoft/Phi-4-Multimodal 4B+1.6B 7.5 7.0 6.7 6.8 7.0 65.0 57.8 62.6 48.3 40.6 35.5 52.8
    Tencent/Ola 7B 7.3 6.4 5.9 6.0 6.4 70.3 62.6 62.8 58.8 70.4 53.1 63.2
    Tencent/Vita 1.5 7B 4.8 5.5 4.9 2.9 4.5 35.5 9.6 7.0 31.5 24.1 25.5 28.6
    InspirAI/Mini-Omni2 0.5B 3.6 3.5 2.6 3.1 3.2 - - - - - - -
    LMMs-Lab/Aero-1-Audio 1.5B 5.7 5.3 4.7 5.8 5.4 59.4 40.0 45.4 48.0 57.6 44.2 50.5

    We conducted evaluations on AIR-Bench-Chat and MMAU for audio analysis and understanding. Our model achieved an average score of 5.35, outperforming Mini-Omni2 and Vita. For Audio Instruction Following, we evaluated on OpenHermes and Alpaca-Audio, following the same pipeline as AudioBench. Our model demonstrates a strong ability to understand instructions in speech and provide correct responses. Additionally, when evaluated on AIR-Bench-Foundation for Audio Scene Understanding, our model outperformed Phi-4-Multimodal in the sound and music dimensions. Overall, the average score of our model indicates strong performance relative to other models with larger parameter sizes.

    Training Techniques

    Dynamic Batch Size

    We implemented a dynamic batching strategy based on the estimated token length to control the batch size per device. In many cases, using a fixed batch size requires setting it conservatively small to avoid out-of-memory (OOM) errors on longer samples, which leads to underutilization of computing resources. To address this, we group samples into batches such that the total token length stays within a predefined threshold, thereby minimizing computational waste and improving efficiency.

    Sequence Packing

    To further optimize dynamic batching, we implemented sequence packing for both the audio encoder and the language model, enabling larger batch sizes and faster training. This operation was then fused with the Liger kernel to achieve even higher throughput and lower memory usage. With a fixed packing length of 4096 to regulate the dynamic batch size, the average Model FLOP Utilization (MFU) was limited to 0.03. However, with sequence packing enabled, the average MFU increased to approximately 0.34, demonstrating a significant improvement in training efficiency.

    Packing Length Sequence Packing Num GPUs Avg MFU Zero OOM
    4096 FALSE 64 0.03 2 No
    32768 FALSE 64 NA 2 Yes
    32768 TRUE 32 0.34 2 No

    We tested our implementations on different settings to demonstrate the efficiency of our implementation

    Contributor List

    alphabetical order

    *main contributors

    Citation

    @article{li2025aero,
      title={Aero: Audio-enhanced large language models},
      author={Li, Bo and Chen Change Loy and Pu Fanyi and Yang Jingkang and Zhang Kaichen and Hu Kairui and Thang Luu Minh and Trung Nguyen Quang and Cong Pham Ba and Liu Shuai and Wang Yezhen and Liu Ziwei},
      url={https://www.lmms-lab.com/posts/aero_audio/},
      year={2025}
    }
  • teaser

    We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses 👓. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities—including discussions 💬, shopping 🛍️, cooking 🍳, socializing 👥, and entertainment 🎮 - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset 📖, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA❓, a suite of 3K long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations.

    To address the key technical challenges of 1) developing robust visual-audio models for egocentric data, 2) enabling identity recognition, and 3) facilitating long-context question answering over extensive temporal information, we introduce EgoBulter 🫡, an integrated system comprising EgoGPT 🧠 and EgoRAG 🔍. EgoGPT is a vision-language model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.

  • Banner

    Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs).

    To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs’ ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, Δ_knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs’ capability to learn and adapt from videos.

  • Banner

    For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing the solution for feature interpretation across various model scales.

    This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimdoal models, we discovered intriguing features that correlate with diverse semantics and can be leveraged to steer model behavior, enabling more precise control and understanding of LMM functionality.

    The Sparse Autoencoder (SAE) is trained on LLaVA-NeXT data by integrating it into a specific layer of the model, with all other components frozen. The features learned by the SAE are subsequently interpreted through the proposed auto-explanation pipeline, which analyzes the visual features based on their activation regions.

    Steer

    These features can then be used to steer model’s behavior to output desire output. You can check our papers for more details.

  • The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

    Video Instruction-Following Data Synthesis

    A high-quality dataset for video instruction-tuning is crucial for developing effective video-language models. We identify a key factor in building such datasets: ensuring richness and diversity in both video content and its language annotations. We perform comprehensive survey on the existing video benchmarks, covering across various public video captioning and question-answering datasets, then identify ten unique video sources that contribute to over 40 video-language benchmarks. From each source, we select videos that exhibit significant temporal dynamics. To maintain diversity in the annotations, we establish a pipeline capable of generating detailed captions for videos of any length. Additionally, we define 16 types of questions that guide GPT-4o in creating question-answer pairs to assess the perceptual and reasoning skills of the video-language models.

    Video Sources

    We noticed that although different video-language datasets focus on various video understanding tasks , most are sourced from ten main video sources, which offer a wide range of video data from different websites, viewpoints, and domains. The relationship between these ten selected video datasets and others is shown in figure below. We select the dynamic video from these source, we detail the video selection logic in the paper.

    Automated Generation for Video Detail Description

    For selected videos, we use GPT-4o to systematically describe their content. We start by sampling video frames at one frame per second (fps). However, due to the input size constraints of GPT-4o, we cannot use all sampled frames. Instead, we describe the videos sequentially, as shown in figure below. We create descriptions at three distinct levels, detailed below.

    Automated Generation for Video Question Answering

    In addition to detailed video descriptions, our dataset includes a variety of question-answer pairs designed for complex interactions. This setup improves the video understanding model’s ability to handle real-life queries. We refer to public video question-answering benchmarks to organize these questions into 16 specific categories, as shown in Figure 3. Given a detailed video description, we use GPT-4o to generate at most one question-answer pair for each type of question. Please refer to the paper for more details of the question types and the generation process.

    Dataset Statistics

    We carefully select from our collected data sources to form a balanced and comprehensive collection, resulting in a total of 178K videos and 1.3M instruction-following samples. This includes 178K captions, 960K open-ended QAs, and 196K multiple-choice QAs.

    Dataset Comparison

    We provide a comparison of high-quality instruction-following video-language datasets, with a focus on synthetic data created with strong AI models, as shown in Table 1.

    A broad collection of dynamic videos. In terms of video sources, although LLaVA-Hound contains the largest number of videos, 44% of its video data are sourced from WebVid, where most videos are static. ShareGPT4Video includes 30% of its videos from Pexels, ,Pixabay, and Mixkit, which are aesthetically good but also mostly static. Additionally, the majority of its videos come from Panda-70M, which are short clips from longer videos, suggesting simpler plots. In contrast, we carefully select video sources that offer dynamic, untrimmed videos with complex plots, which are crucial for developing a powerful video understanding model. High frames per second. Regarding frame sampling in language annotations, the proposed dataset considers 1 FPS, while other datasets consider much lower FPS. LLaVA-Hound uniformly samples 10 frames from videos of any length. The average FPS is 0.008, which may miss some fine details. ShareGPT4Video picks key frames using CLIP based on frame uniqueness. This method might also miss subtle changes in the video because CLIP embeddings do not capture fine-grained dynamics well. Our method samples FPS=1 without using key frame selection algorithms, ensuring that detailed temporal information can be expressed in annotations with high coverage. Diverse tasks. The proposed dataset considers three common task types, including caption, free-form, and closed-form QA, while existing datasets only consider a subset. Meanwhile, the quality and number of samples in our dataset is higher.

  • LLaVA-OneVision

    We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate show that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image and video scenarios. Importantly, the design of LLaVA-OneVision allow strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demosntrated through task transfer from images to videos.

    We open-source the LLaVA-OneVision to facilitate future development of LMM in the community.

    Training Code: Cook a SOTA model with our released training code

    🤗 Checkpoints: Access pre-trained model checkpoints (0.5B, 7B, 72B)

    🤗 LLaVA-OneVision Data: Explore training datasets for Single-Image and OneVision stages

  • Banner

    In today’s world, we’re on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

    To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI.

    However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we’re on a treasure hunt, but the maps are scattered everywhere.

    In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.

    We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.

  • Banner

    Gemini has amazed the world with its capability to understand hour-long videos. However, we still lack an open-source alternative with similar capabilities. Our latest research presents an innovative solution towards long video LMM, shifting the focus from reducing visual tokens per frame to leveraging the long context capabilities of language models. Here, we present our SoTA video model, Long Video Assistant (LongVA), and our novel benchmark, Visual Needle-In-A-Haystack (V-NIAH).

    Long Context Transfer We discovered and verified that the long context capability of language models can be directly transferred to the video domain in modality-aligned multi-modal models. On V-NIAH, LongVA is the only open-source model capable of accurately retrieving visual information from inputs with 2000 frames or more than 200K visual tokens.

    UniRes We proposed UniRes, a unified visual encoding scheme that encodes both images and videos. In UniRes, a video is encoded the same as multiple image crops in a sequence. Leveraging the Long Context Transfer property and UniRes, LongVA exhibits superior zero-shot performance in video tasks without any video-specific training data.

    SoTA Performance LongVA achieves state-of-the-art performance on the comprehensive Video-MME benchmarks among 7B models. Its performance increases with denser sampling of video frames. We also conduct careful experiments to ablate where it improvements come from.