skip to content

Tags #internet

  • MMSearch-R1 Thumbnail
    MMSearch-R1: Bridging the gap between internal knowledge and external search

    MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.

    MMSearch-R1 Overview Figure
    Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.

    1. Introduction

    Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with:

    • Long-tail facts and newly emerging information
    • Domain-specific content restricted by privacy or copyright constraints
    • Knowledge-intensive and information-seeking visual question answering tasks

    As a result, their performance remains suboptimal, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution.

    Current Limitations

    Existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal:

    • RAG methods rely on fixed retrieve-then-generate pipelines, leading to over-retrieval and high computational costs
    • Prompt-based agents can access real-time search engines but lack parameter optimization through learning

    Our Solution: MMSearch-R1

    To address these limitations, we introduce MMSearch-R1, training LMMs to acquire three essential search-related capabilities:

    1. When to search - Recognizing knowledge boundaries
    2. What to search for - Formulating effective queries
    3. How to reason over search results to answer user queries

    Key Contributions

    • 🏗️ Dataset Construction - Automated approach to construct multimodal search VQA dataset
    • 🔧 Multimodal Search Tool Integration - Real-world search pipeline with image and text tools
    • 🧠 Wiser Search via Reinforcement Learning - GRPO-based RL framework for optimal search decisions
    • 🌐 Open-Sourced Framework - Complete model, dataset, and training framework release

    2. Method

    2.1. Building Iterative Multimodal Search-Integrated RL Framework

    MMSearch-R1 Training Framework
    Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.

    We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions during the rollout process.

    Multimodal Search Tools

    Our framework equips models with two types of search tools:

    1. Image Search Tool

      • Takes input image and returns top-5 visually similar webpages
      • Each result includes thumbnail and title
      • Enables identification of unfamiliar visual entities
    2. Text Search Pipeline

      • Model formulates queries based on user questions
      • Retrieves relevant webpages and processes content
      • Provides concise summaries for accurate answering

    Reward Modeling

    Our reward system consists of two components:

    reward = (1 - α) × Acc_Score × Search_Penalty + α × Format_Score
    
    • Accuracy Score - Exact string match against ground truth (1 for correct, 0 otherwise)
    • Search Penalty - Applied to correct responses that used search, encouraging internal knowledge use
    • Format Score - Ensures model follows required output structure

    2.2. Curating Search-balanced VQA Datasets

    FVQA Dataset Construction
    Figure 3: Illustration of data construction process of FVQA dataset: (a) Automated pipeline for visual knowledge-required VQA samples collection; (b) Knowledge taxonomy; (c) Overall pipeline showing composition and origin of FVQA from various sources.

    We construct FactualVQA (FVQA), a search-balanced dataset following three key criteria:

    1. Coverage of Both Search-Required/Free Questions
    2. Concise and Verifiable Answers
    3. Diversity in Knowledge and Difficulty

    Data Construction Pipeline

    • VQA Collection - Gather candidates requiring visual or textual knowledge
    • Search Balancing - Use preliminary model to classify search requirements
    • Human Annotation - Ensure diversity, authenticity, and label quality

    3. Experimental Findings

    We evaluated MMSearch-R1 against both closed-source models (GPT-4o, Gemini 2.5 Pro) and open-source models (Qwen2.5-VL series) on knowledge-intensive VQA tasks.

    Performance Results Table
    Table 1: Performance of MMSearch-R1 across benchmarks. 'Acc (%)' denotes accuracy evaluated by LLM-as-Judge, while 'SR (%)' represents the search ratio.

    Key Findings

    Finding 1: Enhanced Knowledge Boundary Recognition

    MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%.

    Performance Comparison Analysis
    Figure 4: (a) Performance comparison between Base model and RL-trained model under RAG workflow. (b) Answer behavior breakdown of Base (inner circle) and RL (outer circle) models.

    Finding 2: Improved Query Generation and Summarization

    RL training enhances the model’s ability to generate effective text queries and summarize retrieved information under fixed RAG setup.

    Finding 3: Better Internal Knowledge Utilization

    Clear upward trend in Correct without Search proportion demonstrates improved recall and reasoning based on internal knowledge.

    Training Dynamics Analysis
    Figure 5: (a) Performance improvements of SFT and RL over Base across five VQA datasets. (b) Training dynamics of reward and search ratio for different strategies.

    Finding 4: RL vs. Supervised Learning

    RL consistently outperforms SFT across all tasks despite being trained on only about half as much data, demonstrating superior data efficiency.

    Finding 5: Balanced Training Effectiveness

    Training with balanced data and search penalty effectively guides the model to perform on-demand search without overusing the search tool.

    4. Conclusion

    MMSearch-R1 represents a significant advancement in multimodal AI, learning to:

    • Recognize knowledge gaps and boundaries
    • Selectively invoke image or text search
    • Reason effectively over retrieved content

    Our framework outperforms same-sized RAG baselines and approaches larger model performance while requiring significantly fewer search calls. This work lays the groundwork for building multimodal agents that are both adaptive and interactive, paving the way for the next major advancement in multimodal intelligence.