Figure 1: LLaVA-Critic-R1 is trained on top of the base model Qwen-2.5-VL-7B. Building upon a stronger reasoning VLM, ThinkLite-VL-7B, we further develop LLaVA-Critic-R1+ by applying the same RL critic training procedure. **Left**: Performance comparison of LLaVA-Critic-R1 with other base and reasoning VLMs on multiple visual reasoning, visual understanding, and visual reward benchmarks. LLaVA-Critic-R1 not only significantly outperforms other models in critic performance, but also demonstrates stronger policy capabilities. **Right**: Performance improvement of critic training and test-time self-critic scaling on five common visual reasoning and visual understanding benchmarks. Critic training alone significantly improves the base model's performance. Building upon this, leveraging the dual policy and critic capabilities of LLaVA-Critic-R1 for a 'Best-of-128' self-critic scaling procedure at test time leads to a further substantial boost in performance.
Breaking the Critic-Policy Divide
In vision-language modeling, critic models are typically trained to evaluate outputs—assigning scalar scores or pairwise preferences—rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use.
LLaVA-Critic-R1 challenges this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing a multimodal critic trained to optimize preference judgments while retaining full generation ability.
Surprising Dual Excellence
LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model—matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B).
Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a state-of-the-art 71.9 on MMMU at the 7B scale.
Self-Critique at Test Time
The enhanced critic ability benefits inference significantly. Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. This demonstrates the power of unified critic-policy models for creating self-improving systems.
Technical Innovation
Our approach centers on three key innovations:
Data Reorganization: We transform preference-labeled critic datasets into verifiable training signals suitable for reinforcement learning.
GRPO Training: We apply Group Relative Policy Optimization directly on generative models, enabling them to learn from critic data while maintaining generation capabilities.
Unified Architecture: We maintain a single model for both critic and policy functions, eliminating the traditional separation between evaluation and generation.
Model Performance
LLaVA-Critic-R1 demonstrates strong performance across diverse benchmarks:
Visual Reasoning: Competitive performance with specialized models on complex reasoning tasks
Critic Evaluation: Top-tier preference judgment and scalar scoring capabilities
Generation Quality: Maintained fluency and coherence with strong instruction following
The model comes in two variants:
LLaVA-Critic-R1: Base model trained from Qwen-2.5-VL-7B
LLaVA-Critic-R1+: Extended approach applied to strong reasoning VLMs
Implications for the Field
Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems. This work demonstrates that the traditional separation between critics and policies is not necessary—a single model can excel at both tasks simultaneously.
Project Resources
Access code, models, and research paper for LLaVA-Critic-R1
Our previous work, MMSearch-R1, represents a paradigm shift in multimodal AI as the first framework to employ end-to-end reinforcement learning for autonomous tool invocation in large multimodal models (LMMs). By enabling models to independently determine when and how to leverage external search tools, MMSearch-R1 achieves both high efficiency and state-of-the-art performance on open-world tasks, marking a significant advance in practical AI deployment.
What began as a specialized tool-calling model has since evolved into a general-purpose reasoning engine that seamlessly integrates knowledge retrieval with cognitive processing. This evolution offers critical insights into the future of autonomous AI systems: the most capable agents will not only be able to think deeply, but also actively seek and utilize relevant information as needed.
Reasoning-improved Search
Despite MMSearch-R1’s strong performance, we observed limitations in its ability to adapt to complex, dynamic information needs. To address these constraints, we propose a reasoning-first agent paradigm that emphasizes the following core capabilities:
Intelligent search: The model reasons about its knowledge gaps to make decisions about when and how to invoke search tools
Query generation: Deep task understanding enables context-aware query formulation that evolves with the problem
Knowledge integration: External information is systematically incorporated through reasoning processes, not merely retrieved and appended
Performance: The approach delivers fundamental advances in multimodal reasoning, not just incremental improvements
Training Recipe
Prior work in multimodal reasoning has demonstrated that training with verifiable rewards can significantly enhance a model’s capabilities in understanding and solving complex STEM problems.
In our initial experiments, we evaluated numerous multimodal STEM datasets. We discovered that many existing datasets suffer from various limitations: some lack sufficient difficulty for advanced models, while others contain noisy annotations, incomplete visual-text alignments, or unverifiable ground truth answers. These issues can produce unreliable reward signals that destabilize reinforcement learning training.
To address these challenges, we curated a comprehensive high-quality training set consisting of: MMPR[1], MMK12[2], MMR1[3], Multi-subject-RLVR[4], ScienceQA.
To ensure data quality for effective multimodal RL training, we implemented a rigorous filtering pipeline:
Multimodal Verification: Every problem undergoes automatic verification to ensure visual and textual components are properly aligned and complete. We filter datasets to include only problems where both modalities contribute meaningfully to the solution process.
Answer Verifiability: Each problem must have verifiable ground truth answers with clear reasoning paths. For mathematical problems, we verify symbolic and numerical answers; for scientific problems, we ensure explanations align with established principles.
Complexity Filtering: Problems must require genuine multimodal reasoning rather than being solvable through text or vision alone. We exclude problems where one modality is merely decorative.
After filtering, we obtained 80K high-quality multimodal STEM problems for RL training.
Our RL training stage follows DAPO[5] with the following modifications:
No Entropy Loss: We eliminate entropy loss entirely, as its inclusion frequently causes training instability characterized by exponential entropy growth and subsequent collapse.
No KL Loss: Following DAPO, we remove KL loss to allow the model to diverge from the original SFT policy’s trust region. This also eliminates reference policy log probability computation, accelerating training.
Overlong Filtering: We mask loss for truncated sequences to preserve long-context reasoning capabilities.
Learning Rate Schedule: We implement a sigmoid-based decay schedule. The sigmoid schedule provides smooth S-shaped transitions that stabilize early training and asymptotically approach target rates without discontinuities. We keeps the base learning rate to 2e−6 and the warmup steps to 60 steps with sigmoid curve progression. The decay is a sigmoid function reducing to 90% of base rate (final LR ≈1.8e−6).
Improved Exploration: We set the clip high ratio to 0.3 in the GRPO/PPO surrogate loss to encourage exploration and stabilize entropy dynamics.
Our reward function employs a two-stage hierarchical approach combining mathematical verification with LLM-based evaluation. We first apply a static mathematical verifier to assess answer correctness for questions with deterministic solutions. When the verifier returns zero — indicating either incorrect answers or inability to verify, we employ an LLM-as-judge for secondary assessment to handle questions requiring semantic evaluation or those with multiple valid representations (e.g., “teal blue” vs. “blue”), the LLM would judge based on given images, questions, answers and model predictions.
This design prioritizes computational verification for efficiency while leveraging LLM evaluation for complex semantic cases.
Result
Based on this foundation, we can build a very strong STEM-focused reasoning model that surpasses the rest of open models.
Models
MMK12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
Qwen2.5-VL-7B
34.4
46.2
24.0
66.6
49.8
OpenVL-Thinker
31.0
45.2
24.0
70.2
52.3
R1-OneVision
30.6
44.1
24.0
64.1
49.2
MM-Eureka-7B
27.0
50.3
26.9
73.0
50.7
General STEM
46.2
51.4
28.4
73.6
57.3
General STEM -> Search (Two Stage)
43.0
51.9
28.0
72.4
57.9
With this reasoning foundation, we can go further to improve the model’s search abilities. We first implemented a two-stage training process to seamlessly integrate search capabilities. This approach ensures that search becomes a natural extension of the model’s reasoning process rather than a separate module.
From the figure, compared with our original MMSearch baseline, which was built on Qwen-2.5-VL-7B (referred to as Instruct → Search in this context), we can observe that the model achieved good improvements. The reasoning-first approach enabled more intelligent search decisions, better query formulation, and more effective utilization of retrieved information.
Accuracy across four multimodal benchmarks (Infoseek, MMSearch, FVQA, and SimpleVQA). The Reasoning to Search paradigm consistently outperforms or matches Instruct -> Search, especially on Infoseek and MMSearch, demonstrating the benefit of reasoning-first strategies in complex information retrieval tasks.
One of the most intriguing findings emerged during our evaluation of STEM tasks (e.g., MMMU, MathVision) using Search prompts. We observed a counterintuitive phenomenon: excessive searching actually led to decreased performance. Specifically, models employing Search prompts tended to over-rely on external searches, frequently initiating queries for information that could have been inferred through reasoning or was already available internally.
Accuracy comparison across five challenging reasoning datasets. Results indicate that while integrating search generally helps, excessive or unguided searching can lower performance. This underscores the need for precise reasoning-guided search prompting to achieve optimal results in complex multimodal reasoning tasks.
These performance drops highlight critical insight: without effective reasoning capabilities to guide their search strategies, models tend to default to inefficient search behaviors. This not only results in unnecessary computational overhead but can also introduce irrelevant information, ultimately degrading the quality of answer generation.
Search Ratio
MM-K12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
Reason -> Search (Search Prompt)
16.8
22.9
9.5
12.5
24.7
Reason to Act for General Search Model
To achieve a robust balance between reasoning and search performance across general-domain tasks, we choose to integrate the training into one stage for both capabilities. Our goal is to build a model that not only retrieves relevant information efficiently but also demonstrates advanced reasoning over searched information.
Training Recipe
We unify the training process by adopting a ReACT-style prompt template, inspired by [REACT PAPER], which allows the model to interleave reasoning and action (search) steps within a single trajectory. This template is a slight refinement of the standard Search prompt, and full implementation details are provided in the Appendix.
The table below summarizes the lineage and training data for each model variant, clarifying the distinctions in model initialization and supervision strategies. For comprehensive information on hyperparameters and training dynamics, please refer to the Appendix.
Result
We evaluated both our two-stage and unified (one-stage) models across a broad suite of benchmarks and consistently observed performance improvements as model capacity increased.
The General STEM model showed that enhancing reasoning capabilities alone can lead to significant gains. In contrast, the General Search model revealed the multiplicative benefits of integrating reasoning with targeted search strategies. Notably, these improvements were not simply incremental - they represent fundamental advances in how models address complex, multimodal problems.
Models
MMK12
MathVerse (testmini)
MathVision (testmini)
MathVista (testmini)
MMMU (val)
AI2D
ChartQA
MME
RealworldQA
OCRBench
DocVQA
MMBench
MMStar
MiaBench
Qwen2.5-VL-7B
34.4
46.2
24.0
66.6
49.8
93.3
94.4
630.4/1685.2
68.5
85.2
94.6
82.9
62.6
81.7
General STEM
46.2
51.4
28.4
73.6
57.3
94.4
91.4
700.7/1662.1
67.5
83.7
92.1
83.8
65.5
76.0
Reason -> Search
43.2
51.7
25.0
71.8
57.9
94.0
93.6
652.5/1688.3
67.5
81.7
93.5
83.2
63.1
47.6
General Search
43.6
52.0
27.3
74.7
56.1
94.6
94.0
718.9/1775.3
65.5
77.8
89.4
84.0
60.4
44.4
Models
Infoseek
MMSearch
FVQA
SimpleVQA
Qwen2.5-VL-7B
20.1
12.8
20.3
38.4
MMSearch
55.1
53.8
58.4
57.4
Reasoning -> Search
58.5
57.1
57.9
57.7
General Search
52.0
54.9
52.8
57.0
Our results reveal that MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to 61.6% on Infoseek, compared to 28.5% for General Search.
MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.
We found a strong positive correlation (Pearson r = 0.911) between search ratio and model performance, indicating that increased search engagement directly improves accuracy. However, this relationship has limits—excessive or undirected search introduces computational costs and answer noise that can degrade reliability.
Additional experiments with reduced STEM data, increased search data ratios, and shortened warmup periods (60 vs 45 steps) confirmed that better performance requires strategic search integration. Models perform best when search is invoked selectively through explicit reasoning about information needs, balancing enhanced knowledge access against computational efficiency.
These findings demonstrate that the key to multimodal model performance lies not in maximizing search frequency, but in developing sophisticated reasoning mechanisms that determine when external information retrieval adds value to complex query resolution.
Case Study
We show the following interesting cases to demonstrate versatile abilities of our final model.
Case: MME
In this example from the MME benchmark, the model is required to answer a question about a statue located in the National Gallery of Art in Washington, D.C. The process begins with the model analyzing the query image to determine what additional information is needed. It then performs searches for visually similar images, systematically evaluates the retrieved results, and conducts follow-up searches from different perspectives to verify its findings. This iterative search-and-reasoning approach allows the model to gather comprehensive evidence before arriving at a well-supported conclusion.
Example from the MME benchmark showing the model's iterative search-and-reasoning approach to identify a statue in the National Gallery of Art.
Case: Writing Email to a Public Figure
In this case, the model is tasked with composing an email to Abdullah Shahid Sial, a public figure. To accomplish this effectively, the model must gather comprehensive information about him through internet searches, including his social media presence (Twitter), official website, professional background, and other publicly available information sources.
Case study showing the model's research process when tasked with writing an email to Abdullah Shahid Sial, demonstrating comprehensive information gathering capabilities.
{question}Please reason step by step. Output the thinking process within <think> </think> tags and final answer within <answer> </answer> tags.
Search Template
Answer the user's question based on the provided image. Examine the image carefully and identify any recognizable entities, such as faces, objects, locations, events, logos, or text. Determine whether you have sufficient knowledge to confidently recognize the main visual element and answer the user's question. If so, first explain your reasoning, then provide a clear and direct answer.\nIf you are unable to confidently identify the visual element, stop and invoke the image search tool by appending the string <search><img></search> at the end of your response. This will trigger a Google Lens search using the original image to retrieve relevant information that can help you confirm the visual content.\nOnce you have sufficient visual understanding, combine it with the user's question and assess whether you can confidently answer. If so, answer the question directly using your own knowledge. If not, invoke the text search tool by generating a concise and specific query, and output it in the format <text_search>your query here</text_search> at the end of your response. Carefully craft your query to accurately retrieve the information needed to help answer the question. The text search tool will then use Google Search to return relevant information based on your query.\nYou must include your reasoning inside <reason>...</reason> before taking any action, whether it is calling the image search tool, generating a text search query, or providing a final answer. The reasoning may involve analysis of the original image and question, interpretation of search results, or logical steps leading to the final answer.\nAll search results will be placed inside <information> and </information> and returned to you. When you are ready to answer the question, wrap your final answer between <answer> and </answer>, without detailed illustrations. For example: <answer>Titanic</answer>.\nHere is the image and the question:\n<image>{question}
ReACT Template
# System MessageYou are a helpful assistant. You should strictly follow reason-to-act thinking process to answer user provided question. Namely, you should first analyze the question & observation (e.g., user provided image or search results) and then inform the following action. The thinking process should be within <reason> and </reason> tags. The actions you can choose are:<answer>xxxxx</answer>: which returns the answer within <answer> and </answer> tags, and finishes the task.<search>image</search>: which searches user provided image on Google and returns image-related visual entity/concept/knowledge for further reason-to-act. The search results are placed between <observation> and </observation> tags.<search>text query</search>: which generates a text query and sent to Google and returns some snippets containing the answer for further reason-to-act. The search results are placed between <observation> and </observation> tags. Note that sometimes the snippets do not contain the answer, and some alternative search might be needed.Your output format should be one of the following three formats:<reason> YOUR THINKING PROCESS </reason><answer> YOUR ANSWER AFTER GETTING ENOUGH INFORMATION </answer>or<reason> YOUR THINKING PROCESS </reason><search> IMAGE </search>or<reason> YOUR THINKING PROCESS </reason><search> YOUR GENERATED TEXT QUERY FOR HELPING YOU FIND INFORMATION ON GOOGLE TO ANSWER USER QUESTION </search>Only output the final answer (in words, numbers or phrase) inside the <answer></answer> tags, without any explanations or extra information. If this is a yes-or-no question, you should only answer yes or no.
MMSearch-R1: Bridging the gap between internal knowledge and external search
MMSearch-R1 is the first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments. It outperforms same-sized RAG baselines and approaches the performance of larger models while requiring significantly fewer search calls.
Figure 1: MMSearch-R1 learns to recognize the boundaries of its knowledge and perform on-demand search, significantly reducing the number of searches required while outperforming RAG-based models on knowledge-intensive and info-seeking VQA tasks.
1. Introduction
Scaling up vision-language paired data has become a widely adopted paradigm for Large Multimodal Models (LMMs) to acquire grounded knowledge of the visual world. Although this static training strategy has proven effective, it remains limited in capturing complex and evolving real-world knowledge. In particular, state-of-the-art LMMs continue to struggle with:
Long-tail facts and newly emerging information
Domain-specific content restricted by privacy or copyright constraints
Knowledge-intensive and information-seeking visual question answering tasks
As a result, their performance remains suboptimal, frequently generating hallucinated outputs when confronted with inputs beyond their training distribution.
Current Limitations
Existing approaches such as Retrieval-Augmented Generation (RAG) and prompt-based agents remain suboptimal:
RAG methods rely on fixed retrieve-then-generate pipelines, leading to over-retrieval and high computational costs
Prompt-based agents can access real-time search engines but lack parameter optimization through learning
Our Solution: MMSearch-R1
To address these limitations, we introduce MMSearch-R1, training LMMs to acquire three essential search-related capabilities:
When to search - Recognizing knowledge boundaries
What to search for - Formulating effective queries
How to reason over search results to answer user queries
Key Contributions
🏗️ Dataset Construction - Automated approach to construct multimodal search VQA dataset
🔧 Multimodal Search Tool Integration - Real-world search pipeline with image and text tools
🧠 Wiser Search via Reinforcement Learning - GRPO-based RL framework for optimal search decisions
🌐 Open-Sourced Framework - Complete model, dataset, and training framework release
2. Method
2.1. Building Iterative Multimodal Search-Integrated RL Framework
Figure 2: Illustration of training in MMSearch-R1. Top: The GRPO training pipeline integrated with multimodal search tools. Bottom: A detailed view of the rollout process and search tool execution.
We built on veRL and adopt standard GRPO as our base RL algorithm, with modifications to allow search interactions during the rollout process.
Multimodal Search Tools
Our framework equips models with two types of search tools:
Image Search Tool
Takes input image and returns top-5 visually similar webpages
Each result includes thumbnail and title
Enables identification of unfamiliar visual entities
Accuracy Score - Exact string match against ground truth (1 for correct, 0 otherwise)
Search Penalty - Applied to correct responses that used search, encouraging internal knowledge use
Format Score - Ensures model follows required output structure
2.2. Curating Search-balanced VQA Datasets
Figure 3: Illustration of data construction process of FVQA dataset: (a) Automated pipeline for visual knowledge-required VQA samples collection; (b) Knowledge taxonomy; (c) Overall pipeline showing composition and origin of FVQA from various sources.
We construct FactualVQA (FVQA), a search-balanced dataset following three key criteria:
Coverage of Both Search-Required/Free Questions
Concise and Verifiable Answers
Diversity in Knowledge and Difficulty
Data Construction Pipeline
VQA Collection - Gather candidates requiring visual or textual knowledge
Search Balancing - Use preliminary model to classify search requirements
Human Annotation - Ensure diversity, authenticity, and label quality
3. Experimental Findings
We evaluated MMSearch-R1 against both closed-source models (GPT-4o, Gemini 2.5 Pro) and open-source models (Qwen2.5-VL series) on knowledge-intensive VQA tasks.
Table 1: Performance of MMSearch-R1 across benchmarks. 'Acc (%)' denotes accuracy evaluated by LLM-as-Judge, while 'SR (%)' represents the search ratio.
MMSearch-R1-7B outperforms same-sized RAG-based models by an average of 3% in accuracy while reducing the average search rate by 32.9%.
Figure 4: (a) Performance comparison between Base model and RL-trained model under RAG workflow. (b) Answer behavior breakdown of Base (inner circle) and RL (outer circle) models.
Finding 2: Improved Query Generation and Summarization
RL training enhances the model’s ability to generate effective text queries and summarize retrieved information under fixed RAG setup.
Finding 3: Better Internal Knowledge Utilization
Clear upward trend in Correct without Search proportion demonstrates improved recall and reasoning based on internal knowledge.
Figure 5: (a) Performance improvements of SFT and RL over Base across five VQA datasets. (b) Training dynamics of reward and search ratio for different strategies.
Finding 4: RL vs. Supervised Learning
RL consistently outperforms SFT across all tasks despite being trained on only about half as much data, demonstrating superior data efficiency.
Finding 5: Balanced Training Effectiveness
Training with balanced data and search penalty effectively guides the model to perform on-demand search without overusing the search tool.
4. Conclusion
MMSearch-R1 represents a significant advancement in multimodal AI, learning to:
Recognize knowledge gaps and boundaries
Selectively invoke image or text search
Reason effectively over retrieved content
Our framework outperforms same-sized RAG baselines and approaches larger model performance while requiring significantly fewer search calls. This work lays the groundwork for building multimodal agents that are both adaptive and interactive, paving the way for the next major advancement in multimodal intelligence.
Project Resources
Complete implementation, research paper, models, and datasets for MMSearch-R1
SOTA large multimodal model (LMM) architectures, such as Qwen2.5-VL, typically build on a powerful large language model (LLM) (e.g. Qwen2.5) integrated with an external Native Resolution Vision Transformer (NaViT). Such approach also presents challenges in high-resolution real-world scenarios, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the downstream task. By comparison, when processing high-resolution real-world scenarios, the human visual system employs task-driven visual search strategies to ground and scrutinize critical regions of interest. Motivated by this biological mechanism, we attempt to equip LLMs with similar visual search capabilities by leveraging visual grounding to focus on key image regions.
However, empowering LMMs with such grounding-based visual reasoning capabilities is non-trivial, primarily due to the scarcity and high cost of obtaining grounding annotations for standard visual-question-answering (VQA) datasets, which are required for constructing multi-turn grounding-based conversation data for supervised fine-tuning (SFT). In this paper, we highlight that accurate grounding behavior can emerge within a reinforcement learning (RL) paradigm, even when training supervision is provided solely through a binary reward function derived from the correctness of the final answer.
To this end, we introduce Multi-turn Grounding-based Policy Optimization (MGPO), a reinforcement learning (RL) algorithm that enables LMMs to iteratively focus on key image regions by automatically cropping sub-images, based on model-predicted grounding coordinates within a multi-turn conversation framework. Given a high-resolution image and a question, the model first predicts the coordinates of key regions relevant to the query. An image cropping function is then triggered to extract and return the corresponding sub-image. In subsequent turns, the model can integrate previous in-context convesations (including both the original image and cropped sub-image) to solve the question.
Figure 1: Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.
In summary, MGPO mainly offers the following advantages:
Top-down and Interpretable Visual Reasoning. MGPO equips LMMs with a top-down, question-driven visual search mechanism for high-resolution scenarios and provides interpretable outputs that indicate which image regions are attended to throughout the reasoning process.
Overcomes Maximum Pixel Constraints. MGPO can overcomes the maximum pixel limitation of LMMs. As shown in the first example of Figure 1, even when resizing a high-resolution image within pixel limits results in a blurred input, the model can still identify relevant coordinates and crop clear sub-images from the original input for further analysis.
Without Additional Grounding Annotations. MGPO can be post-trained directly on standard VQA datasets without the need for extra grounding annotations, and experimental results demonstrate substantial improvements in intermediate grounding performance compared to GRPO
Ultimately, we utilize MGPO to post-train Qwen2.5-VL-7B using visual-question-short answering data, yet achieves strong intermediate grounding performance without requiring grounding annotations (examples shown in Figure 1). Compared to GRPO, MGPO yields a 5.4% improvement on the in-distribution MME-Realworld benchmark and a 5.2% gain on the challenging out-of-distribution V* Bench. Notably, leveraging with only 21K post-training samples, our model surpasses OpenAI’s o1 and GPT-4o models on the OOD V* Bench.
2. Multi-turn Grounding-Based RL
Figure illustrates a comparison of different post-training paradigms for LMMs. In our MGPO, the model operates over K sequential interaction, dynamically grounding and reasoning by conditioning on the full history of visual and textual context at each step.
Figure 2: Comparison of different post-training paradigms for LMMs. Our MGPO automatically crops and returns sub-image to the model based on its predicted grounding coordinates, enabling the model to iteratively focus on key regions and effectively solve high-resolution visual tasks.
Multi-turn Template without Cold Start. In practice, we observe that LLMs struggle to autonomously generate grounding coordinates during the rollout process, which hinder effective multi-turn RL. To address this, we design a fixed two-turn dialogue template, as shown in Figure 3, to explicitly activate the model’s grounding and reasoning abilities.
Figure 3: Our two-turn dialogue template design to explicitly activate the model's grounding and reasoning abilities.
Multi-turn Grounding-Based RL Process. The MGPO training process consists of the following key steps:
Initial Grounding: Given a high-resolution image and question, the model predicts bounding box coordinates for key regions
Image Cropping: Based on predicted coordinates, relevant sub-images are automatically cropped from the original image
Multi-turn Reasoning: The model integrates both original and cropped images in subsequent conversation turns
Reward Learning: Binary rewards are provided based on final answer correctness, enabling the emergence of grounding behavior through RL
Figure 4: The Multi-turn Grounding-based Policy Optimization (MGPO) algorithm workflow.
3. Experimental Results
We evaluate MGPO on multiple high-resolution visual reasoning benchmarks and demonstrate significant improvements over baseline methods.
3.1 Main Results
Table 1: Performance comparison on high-resolution visual reasoning benchmarks. MGPO achieves superior performance across multiple datasets.
Our experimental results show that MGPO yields substantial improvements:
5.4% improvement on MME-Realworld benchmark compared to GRPO
5.2% gain on challenging out-of-distribution V* Bench
Surpasses OpenAI’s o1 and GPT-4o models on OOD V* Bench with only 21K post-training samples
3.2 Ablation Studies
Table 2: Ablation study showing the contribution of different components in MGPO.
3.3 Grounding Performance Analysis
Figure 5: Analysis of grounding performance showing emergence of accurate grounding behavior through RL training.
4. Additional Analysis
4.1 Point Counting Task
Table 4: Performance comparison of image count task. Additional point reward do not lead to significant performance improvements.
4.2 Visualization Results
Figure 8: Visualization of point predictions from the GRPO model trained with only accuracy reward.
5. Limitation
All experiments of MGPO are conducted using a fixed two-turn template, rather than allowing the model to autonomously decide when to perform image cropping based on the input question, as illustrated in lasted OpenAI models such as o3 and o4-mini. This limitation stems from our observation that Qwen2.5-VL, when directly subjected to RL post-training, struggles to generate grounding coordinates without explicit prompt guidance.
Nevertheless, we believe that our trained models can be leveraged to generate high-quality chain-ofthought (CoT) data for subsequent SFT. By adopting a multi-stage training strategy that combines SFT and RL, as in DeepSeek-R1, may ultimately enable the model to autonomously decide when and how to perform grounding. We leave this direction for future work.
Appendix
Figure 9: A full conversation example of MGPO post-trained model on high-resolution image tasks.