
MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.

1. Introduction
Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with:
- Long-tail facts and newly emerging information
- Domain-specific content restricted by privacy or copyright constraints
- Knowledge-intensive and information-seeking visual question answering tasks
As a result, their performance remains suboptimal, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution.
Current Limitations
Existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal:
- RAG methods rely on fixed retrieve-then-generate pipelines, leading to over-retrieval and high computational costs
- Prompt-based agents can access real-time search engines but lack parameter optimization through learning
Our Solution: MMSearch-R1
To address these limitations, we introduce MMSearch-R1, training LMMs to acquire three essential search-related capabilities:
- When to search - Recognizing knowledge boundaries
- What to search for - Formulating effective queries
- How to reason over search results to answer user queries
Key Contributions
- 🏗️ Dataset Construction - Automated approach to construct multimodal search VQA dataset
- 🔧 Multimodal Search Tool Integration - Real-world search pipeline with image and text tools
- 🧠 Wiser Search via Reinforcement Learning - GRPO-based RL framework for optimal search decisions
- 🌐 Open-Sourced Framework - Complete model, dataset, and training framework release
2. Method
2.1. Building Iterative Multimodal Search-Integrated RL Framework

We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions during the rollout process.
Multimodal Search Tools
Our framework equips models with two types of search tools:
-
Image Search Tool
- Takes input image and returns top-5 visually similar webpages
- Each result includes thumbnail and title
- Enables identification of unfamiliar visual entities
-
Text Search Pipeline
- Model formulates queries based on user questions
- Retrieves relevant webpages and processes content
- Provides concise summaries for accurate answering
Reward Modeling
Our reward system consists of two components:
reward = (1 - α) × Acc_Score × Search_Penalty + α × Format_Score
- Accuracy Score - Exact string match against ground truth (1 for correct, 0 otherwise)
- Search Penalty - Applied to correct responses that used search, encouraging internal knowledge use
- Format Score - Ensures model follows required output structure
2.2. Curating Search-balanced VQA Datasets

We construct FactualVQA (FVQA), a search-balanced dataset following three key criteria:
- Coverage of Both Search-Required/Free Questions
- Concise and Verifiable Answers
- Diversity in Knowledge and Difficulty
Data Construction Pipeline
- VQA Collection - Gather candidates requiring visual or textual knowledge
- Search Balancing - Use preliminary model to classify search requirements
- Human Annotation - Ensure diversity, authenticity, and label quality
3. Experimental Findings
We evaluated MMSearch-R1 against both closed-source models (GPT-4o, Gemini 2.5 Pro) and open-source models (Qwen2.5-VL series) on knowledge-intensive VQA tasks.

Key Findings
Finding 1: Enhanced Knowledge Boundary Recognition
MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%.

Finding 2: Improved Query Generation and Summarization
RL training enhances the model’s ability to generate effective text queries and summarize retrieved information under fixed RAG setup.
Finding 3: Better Internal Knowledge Utilization
Clear upward trend in Correct without Search proportion demonstrates improved recall and reasoning based on internal knowledge.

Finding 4: RL vs. Supervised Learning
RL consistently outperforms SFT across all tasks despite being trained on only about half as much data, demonstrating superior data efficiency.
Finding 5: Balanced Training Effectiveness
Training with balanced data and search penalty effectively guides the model to perform on-demand search without overusing the search tool.
4. Conclusion
MMSearch-R1 represents a significant advancement in multimodal AI, learning to:
- Recognize knowledge gaps and boundaries
- Selectively invoke image or text search
- Reason effectively over retrieved content
Our framework outperforms same-sized RAG baselines and approaches larger model performance while requiring significantly fewer search calls. This work lays the groundwork for building multimodal agents that are both adaptive and interactive, paving the way for the next major advancement in multimodal intelligence.