skip to content

Tags #reasoning

  • Overview

    LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

    Our contributions are threefold:

    (1) Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

    (2) Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

    (3) Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

    RL Training Data Distribution
    Distribution of task categories in the RL training data (67K total instances)

    RL Data Strategy

    Discrepancy-Driven Selection

    We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics. This approach targets “latent capability” rather than knowledge injection, ensuring the model learns to better utilize its existing knowledge.

    Reward-Based Sampling

    Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.


    Reward System Architecture

    We employ a rule-based paradigm with domain-specific verification rules rather than learned preference models:

    CategorySourceReward Design
    STEMViRL39KChoice accuracy & math expression equivalence
    GroundingRef-L4, VigoRL-SAIoU between predicted/reference boxes; choice accuracy
    SpatialVigoRL-SATChoice accuracy
    CountingPixmoCountNumeric token equivalence
    CodingWebCode2M, UniSVGToken/tag overlap; SVG rendering similarity [0,1]
    OCRInfoVQAText similarity
    DiagramAI2DChoice accuracy

    Two-Stage Training Procedure

    Training uses Group Relative Policy Optimization (GRPO) within the AReaL asynchronous framework:

    Stage 1: Answer-only RL

    Normal split training with instruction “Put ONLY your final answer within <answer></answer>.” This stage stabilizes concise task performance.

    Stage 2: Chain-of-Thought RL

    Long-reasoning data with instruction “Think and solve… within <think></think>…” This stage unlocks deeper reasoning capabilities. A small proportion of normal-set examples are interspersed to prevent forgetting perception skills.


    Performance Results

    Core Capability Enhancement

    General VQA Benchmarks (Average +1.0):

    BenchmarkBase+RL
    MMStar67.768.2
    MMBench (EN)84.185.7
    MMBench (CN)81.084.2
    MME-RealWorld (EN)61.763.4
    CV-Bench80.782.9
    RealWorldQA68.168.4

    Reasoning Tasks (Average +6.0):

    BenchmarkBase+RLΔ
    MathVista Mini69.672.3+2.7
    WeMath61.569.4+7.9
    MathVision25.634.4+8.8
    MMMU Validation55.458.8+3.4
    MMMU-Pro25.235.7+10.5

    OCR & Chart (Average +0.0):

    BenchmarkBase+RL
    ChartQA86.587.4
    DocVQA95.091.9
    InfoVQA78.476.6

    Extended Capability Analysis

    Extended Performance Comparison
    Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks

    Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

    Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.


    Development Roadmap

    This release represents Stage 3 in a multi-phase project:

    StageFocusData Scale
    Stage 1 & 1.5Pre-training & Mid-training85M multimodal samples
    Stage 2Visual instruction tuning (SFT)22M instruction-following samples
    Stage 3 (Current)RL post-training with GRPO67K curated samples

    Acknowledgements

    We thank the following projects and frameworks:

    • AReaL: Lightning-Fast RL for LLM Reasoning and Agents
    • sglang: Fast serving framework for LLMs and vision language models
    • lmms-eval: Standardized evaluation framework
    • LLaVA: Large Language-and-Vision Assistant
    • LLaVA-NeXT: Next-generation multi-modal assistant

    Open-Source Resources
    Complete LLaVA-OneVision-1.5-RL resources for the community
    Model Checkpoints
    Pre-trained models with RL optimization
    Training Datasets
    Curated RL training data
    Base Model
    LLaVA-OneVision-1.5 foundation
  • Overview

    Our contributions are threefold:

    (1) LongVT: An End-to-End Agentic Framework for “Thinking with Long Videos”
    We introduce a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought (CoT) with on-demand clip inspection over hours-long videos, thereby enabling large multimodal models (LMMs) to perform more effective and reliable long-video reasoning.

    (2) VideoSIAH: A Fine-Grained Data Suite for Evidence-Sparse Long-Video Reasoning
    We construct a scalable data pipeline that produces diverse and high-quality question-answering (QA) data and tool-integrated reasoning traces, and a dedicated benchmark under a video segment-in-a-haystack setting.

    (3) LongVT-7B-RFT: A State-of-the-Art Baseline with Invaluable Insights
    Through extensive quantitative comparisons, systematic ablations on data recipes, training strategies, and design choices, as well as in-depth analyses of training dynamics, we establish and open-source a powerful baseline model with “thinking with long videos” capabilities.

    LongVT Interleaved Multimodal Chain-of-Tool-Thought

    Interleaved Multimodal Chain-of-Tool-Thought (iMCoTT). Compared to prior text-based CoT reasoning, iMCoTT in our proposed LongVT can natively perform self-reflection via calling crop_video(start_time, end_time) tool. It proposes a time window after a global preview, proactively fetches the corresponding short clip, rethinks based on the new evidence, and determines whether to refine or answer directly. Such tool-augmented reasoning behaviors ground each step in what is actually seen rather than blindly rephrasing in text-only CoT, which mitigates hallucination and leads to enhanced temporal localization and answer correctness.


    Motivation of VideoSIAH

    Long-video reasoning presents a fundamentally different challenge from previous video QA settings: LMMs must locate sparse, fine-grained, and causally decisive moments embedded within hours-long content. However, existing LMMs are mostly trained with coarse-grained and clip-level data. This mismatch leaves modern LMMs lacking the supervision needed to learn how temporal hypotheses are formed, verified, or revised—a critical yet underexplored capability for agentic long-video reasoning.

    Moreover, most existing video understanding benchmarks only offer multiple-choice QAs, which can be solved without genuine temporal grounding and are vulnerable to dataset leakage or shortcut exploitation. To fill this gap, we introduce VideoSIAH, a large-scale, diverse, and high-quality data suite that serves collectively as a training dataset capturing the reasoning dynamics required for video segment-in-a-haystack QA, and a fine-grained evaluation benchmark, VideoSIAH-Eval, with human-in-the-loop validation for long-video open-ended question-answering.

    We conduct a rigorous contamination study on the Qwen-VL series across two probing settings: (1) No Visual, where we feed the text prompt without video frames to test for direct memorization; (2) Rearranged Choices, where we randomize the mapping between option labels and their textual content for multiple-choice questions to detect label memorization. Our experimental results reveal significant vulnerabilities in existing benchmarks and highlight the necessity of our proposed VideoSIAH-Eval.

    SettingVideoMME (w/o sub)VideoMMMU adapt.VideoMMMU comp.VideoMMMU perc.VideoSIAH-Eval
    Qwen2.5-VL-7B-Instruct
    Original64.335.744.356.733.8
    No Visual40.125.738.339.312.7
    Rearranged Choices56.029.740.367.0-
    Qwen3-VL-8B-Instruct
    Original69.340.760.371.346.6
    No Visual44.133.739.346.70.00
    Rearranged Choices69.036.347.769.3-

    Contamination Tests for Qwen-VL Series on Long Video Understanding and Reasoning Benchmarks. The VideoSIAH-Eval column shows ”-” entries for Rearranged Choices since our proposed benchmark is fully open-ended QA, where random option-answer mapping is not applicable.


    Data Pipeline

    VideoSIAH Data Pipeline

    Data Pipeline of VideoSIAH. We construct a semi-automatic data pipeline that integrates several state-of-the-art LMMs to sequentially perform long video segmentation, video clip captioning, segment-in-a-haystack QA generation, cross-modal QA filtering, and iMCoTT generation. Icons with human silhouettes denote human-in-the-loop validation, where annotators inspect a small set of representative failures to refine prompting rules for QA generation, QA filtering, and iMCoTT generation. Note that iMCoTT traces are generated only for the cold-start supervised fine-tuning (SFT) stage, whereas reinforcement learning (RL) operates solely on the filtered QA pairs.


    Dataset Statistics

    SplitSourcePurposeSamplesTotal
    SFT (w/o tool)LongVideo-Reason CoTReasoning-augmented Open-ended QA5,238228,835
    Video-R1 CoTReasoning-augmented Video QA165,575
    Image-based CoTReasoning-augmented Image QA58,022
    SFT (w/ tool)Gemini-distilled iMCoTTTool-augmented Open-ended QA12,76619,161
    Qwen-distilled iMCoTTTool-augmented Temporal Grounding6,395
    RLGemini-distilled QAsOpen-ended QA over Long Videos1,66717,020
    RFTSelf-distilled iMCoTTAgentic Behaviors15,353

    Dataset Statistics of VideoSIAH. Our proposed dataset contains a large-scale of non-tool SFT data, tool-augmented SFT data, RL QAs, and self-distilled reinforcement fine-tuning (RFT) traces.

    Video Category Distribution
    Question Category Distribution

    Category Distribution of VideoSIAH-Eval. We present the distribution of video types (left) and question types (right), highlighting the diversity of our proposed benchmark.


    Quantitative Comparisons

    We compare our LongVT models against proprietary LMMs and state-of-the-art open-source video reasoning models across various long video understanding and reasoning benchmarks.

    ModelReasoningToolVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
    PromptCallingw/ subadapt.comp.perc.
    Proprietary LMMs
    GPT-4o77.266.062.055.730.817.451.5
    Gemini 1.5 Pro81.359.053.349.333.1-55.2
    Open-Source (Sparse Sampling)
    Qwen2.5-VL-7B62.637.328.036.730.728.137.2
    Video-R1-7B61.036.340.752.337.227.942.6
    VideoRFT-7B60.936.742.053.034.726.542.3
    Video-Thinker-7B61.034.344.753.052.210.442.6
    LongVT-7B-SFT (Ours)12.537.746.058.336.026.836.2
    LongVT-7B-RL (Ours)66.132.744.750.037.831.043.7
    Open-Source (Dense Sampling)
    Qwen2.5-VL-7B64.335.744.356.740.933.846.0
    Video-R1-7B60.537.338.746.340.133.142.7
    VideoRFT-7B49.237.740.748.718.726.937.0
    Video-Thinker-7B60.837.742.755.354.36.642.9
    LongVT-7B-SFT (Ours)64.932.342.049.741.134.844.1
    LongVT-7B-RL (Ours)66.137.742.356.341.435.946.6
    LongVT-7B-RFT (Ours)67.035.743.756.741.342.047.7

    Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks. The best and second-best results among open-source models in each column are marked in bold and underlined, respectively.


    Ablation Studies

    We conduct comprehensive ablation studies to examine the impact of data recipes, training stages, and reward design on model performance.

    Data Recipe

    SettingVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
    w/ subadapt.comp.perc.
    SFT w/o self-curated iMCoTT8.433.641.646.015.14.124.8
    SFT w/ self-curated iMCoTT64.932.342.049.741.134.844.1
    RL w/o self-curated QAs55.130.642.045.638.430.840.4
    RL w/ self-curated QAs66.137.742.356.341.435.946.6

    Training Stage

    SettingVideoMMEVideoMMMULVBenchVideoSIAH-EvalAvg
    w/ subadapt.comp.perc.
    SFT only64.932.342.049.741.134.844.1
    RL only52.735.343.055.137.128.241.9
    SFT+RL66.137.742.356.341.435.946.6
    SFT+RL+RFT67.035.743.756.741.342.047.7

    Training Dynamics

    Training Dynamics and Ablations on Reward Design

    (a) shows training dynamics under different accuracy and time rewards, and (b) shows the effect of tool-call reward on tool usage.

    Recall encourages coverage; IoU demands precision. Using Recall as the reward function during RL presents a drawback: the policy can enlarge the predicted span to envelop the ground-truth interval, which monotonically raises the Recall-based score while ignoring boundary quality. This plateau in the curve of Recall Accuracy Score validates our hypothesized reward hacking. In contrast, IoU explicitly penalizes span inflation via the union term, yielding better-aligned boundaries and more disciplined tool use.

    Is tool reward really necessary? The Qwen2.5-VL-7B baseline collapses to near-zero tool calls after training in both configurations (w/ and w/o tool reward), indicating that the model does not internalize the tool’s function. After performing cold-start SFT to obtain LongVT-7B-SFT, tool-call frequency rises during training under both configurations and accuracy improves in tandem. Hence, the tool reward is not required for basic competence: once SFT grounds the tool’s semantics, the model learns when and how to invoke the tool.


    Open-Source Resources
    We open-source LongVT to facilitate future development of long-video reasoning with tool calling in the community
    Model Checkpoints
    Pre-trained models with SFT, RL, and RFT optimization
    Training Datasets
    VideoSIAH data suite for long-video reasoning
  • Overview

    Our contributions are threefold:

    (1) High-quality multimodal reasoning data curation.
    We provide the first systematic study on constructing SFT and RL datasets for multimodal reasoning, showing that both source diversity and answer diversity are crucial for building reliable supervision signals.

    (2) A strong and reproducible SFT recipe.
    We introduce a robust SFT pipeline with step-by-step validation, careful teacher-model selection, and cross-domain data integration, enabling the construction of a high-quality cold-start reasoning dataset.

    (3) An advanced RL training recipe.
    Through an extensive comparison of GSPO, GRPO, and DAPO, we identify the most stable and scalable RL strategy and build a reliable RL pipeline that significantly strengthens multimodal reasoning performance.

    OpenMMReasoner Performance Comparison

    Performance Comparison with State-of-the-Art Large Multimodal Reasoning Models across Various Benchmarks. Our proposed OpenMMReasoner consistently outperforms competing methods, highlighting its effectiveness in complex reasoning tasks.


    OpenMMReasoner-Data

    OpenMMReasoner-Data presents two training recipes covering both the SFT and RL phases. The pipeline begins by collecting diverse data sources and selecting teacher models to generate new answer traces. During the RL phase, we explore different algorithm choices and filtering strategies, leading to our final optimized recipe.

    OpenMMReasoner Pipeline
    Data Distribution

    Experimental Results on Visual Reasoning Benchmarks

    We evaluate our approach on a suite of public visual reasoning benchmarks. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

    Main Experimental Results

    Analysis and Insights for SFT

    Our Analysis and Insights for SFT are as follows:

    (1) Answer diversity enhances reasoning.
    Increasing the diversity of generated answers consistently improves the model’s overall reasoning performance, even when using the same question sources, suggesting that exposure to varied solutions strengthens understanding.

    (2) Teacher model selection is crucial.
    Distilling from a strong teacher model substantially boosts the model’s reasoning ability while maintaining high data efficiency. Careful selection for teacher model directly affects the quality of the distilled dataset and the final model performance.

    (3) Over-filtering reduces diversity and performance.
    The best results are achieved without excessive filtering, indicating that maintaining greater answer diversity encourages more robust reasoning abilities.

    (4) Cross-domain knowledge improves generalization.
    Incorporating diverse data from multiple domains consistently enhances the model’s overall reasoning capabilities across tasks.

    Teacher Model Analysis
    Answer Diversity Analysis
    Cross-domain Analysis

    Analysis and Insights for RL

    Our Analysis and Insights for RL are as follows:

    (1) GSPO outperforms other algorithms.
    GSPO demonstrates superior stability and faster convergence compared to alternative methods in multimodal RL training.

    (2) Token efficiency is crucial.
    While increasing reasoning steps at test time can improve performance, excessive tokens reduce efficiency. Our results show that a smaller reasoning budget can achieve comparable or even better accuracy.

    (3) Reasoning ability transfers across domains.
    Gains in reasoning during training consistently translate into stronger performance across multiple domains.

    RL Experimental Results
    RL Training Curves
    Validation Curves
    Rollout Number Experiment Curves

    Open-Source Resources
    We open-source OpenMMReasoner to facilitate future development of multimodal reasoning in the community
  • Our previous work, MMSearch-R1, represents a paradigm shift in multimodal AI as the first framework to employ end-to-end reinforcement learning for autonomous tool invocation in large multimodal models (LMMs). By enabling models to independently determine when and how to leverage external search tools, MMSearch-R1 achieves both high efficiency and state-of-the-art performance on open-world tasks, marking a significant advance in practical AI deployment.

    What began as a specialized tool-calling model has since evolved into a general-purpose reasoning engine that seamlessly integrates knowledge retrieval with cognitive processing. This evolution offers critical insights into the future of autonomous AI systems: the most capable agents will not only be able to think deeply, but also actively seek and utilize relevant information as needed.

    Reasoning-improved Search

    Despite MMSearch-R1’s strong performance, we observed limitations in its ability to adapt to complex, dynamic information needs. To address these constraints, we propose a reasoning-first agent paradigm that emphasizes the following core capabilities:

    1. Intelligent search: The model reasons about its knowledge gaps to make decisions about when and how to invoke search tools
    2. Query generation: Deep task understanding enables context-aware query formulation that evolves with the problem
    3. Knowledge integration: External information is systematically incorporated through reasoning processes, not merely retrieved and appended
    4. Performance: The approach delivers fundamental advances in multimodal reasoning, not just incremental improvements

    Training Recipe

    Prior work in multimodal reasoning has demonstrated that training with verifiable rewards can significantly enhance a model’s capabilities in understanding and solving complex STEM problems. In our initial experiments, we evaluated numerous multimodal STEM datasets. We discovered that many existing datasets suffer from various limitations: some lack sufficient difficulty for advanced models, while others contain noisy annotations, incomplete visual-text alignments, or unverifiable ground truth answers. These issues can produce unreliable reward signals that destabilize reinforcement learning training. To address these challenges, we curated a comprehensive high-quality training set consisting of: MMPR[1], MMK12[2], MMR1[3], Multi-subject-RLVR[4], ScienceQA. To ensure data quality for effective multimodal RL training, we implemented a rigorous filtering pipeline:

    1. Multimodal Verification: Every problem undergoes automatic verification to ensure visual and textual components are properly aligned and complete. We filter datasets to include only problems where both modalities contribute meaningfully to the solution process.

    2. Answer Verifiability: Each problem must have verifiable ground truth answers with clear reasoning paths. For mathematical problems, we verify symbolic and numerical answers; for scientific problems, we ensure explanations align with established principles.

    3. Complexity Filtering: Problems must require genuine multimodal reasoning rather than being solvable through text or vision alone. We exclude problems where one modality is merely decorative.

    After filtering, we obtained 80K high-quality multimodal STEM problems for RL training.

    Our RL training stage follows DAPO[5] with the following modifications:

    • No Entropy Loss: We eliminate entropy loss entirely, as its inclusion frequently causes training instability characterized by exponential entropy growth and subsequent collapse.
    • No KL Loss: Following DAPO, we remove KL loss to allow the model to diverge from the original SFT policy’s trust region. This also eliminates reference policy log probability computation, accelerating training.
    • Overlong Filtering: We mask loss for truncated sequences to preserve long-context reasoning capabilities.
    • Learning Rate Schedule: We implement a sigmoid-based decay schedule. The sigmoid schedule provides smooth S-shaped transitions that stabilize early training and asymptotically approach target rates without discontinuities. We keeps the base learning rate to 2e62e-6 and the warmup steps to 60 steps with sigmoid curve progression. The decay is a sigmoid function reducing to 90% of base rate (final LR 1.8e6\approx 1.8e-6).
    • Improved Exploration: We set the clip high ratio to 0.3 in the GRPO/PPO surrogate loss to encourage exploration and stabilize entropy dynamics.

    Our reward function employs a two-stage hierarchical approach combining mathematical verification with LLM-based evaluation. We first apply a static mathematical verifier to assess answer correctness for questions with deterministic solutions. When the verifier returns zero — indicating either incorrect answers or inability to verify, we employ an LLM-as-judge for secondary assessment to handle questions requiring semantic evaluation or those with multiple valid representations (e.g., “teal blue” vs. “blue”), the LLM would judge based on given images, questions, answers and model predictions.

    This design prioritizes computational verification for efficiency while leveraging LLM evaluation for complex semantic cases.

    Result

    Based on this foundation, we can build a very strong STEM-focused reasoning model that surpasses the rest of open models.

    ModelsMMK12MathVerse (testmini)MathVision (testmini)MathVista (testmini)MMMU (val)
    Qwen2.5-VL-7B34.446.224.066.649.8
    OpenVL-Thinker31.045.224.070.252.3
    R1-OneVision30.644.124.064.149.2
    MM-Eureka-7B27.050.326.973.050.7
    General STEM46.251.428.473.657.3
    General STEM -> Search (Two Stage)43.051.928.072.457.9

    With this reasoning foundation, we can go further to improve the model’s search abilities. We first implemented a two-stage training process to seamlessly integrate search capabilities. This approach ensures that search becomes a natural extension of the model’s reasoning process rather than a separate module.

    From the figure, compared with our original MMSearch baseline, which was built on Qwen-2.5-VL-7B (referred to as Instruct → Search in this context), we can observe that the model achieved good improvements. The reasoning-first approach enabled more intelligent search decisions, better query formulation, and more effective utilization of retrieved information.

    Accuracy across four multimodal benchmarks
    Accuracy across four multimodal benchmarks (Infoseek, MMSearch, FVQA, and SimpleVQA). The Reasoning to Search paradigm consistently outperforms or matches Instruct -> Search, especially on Infoseek and MMSearch, demonstrating the benefit of reasoning-first strategies in complex information retrieval tasks.

    One of the most intriguing findings emerged during our evaluation of STEM tasks (e.g., MMMU, MathVision) using Search prompts. We observed a counterintuitive phenomenon: excessive searching actually led to decreased performance. Specifically, models employing Search prompts tended to over-rely on external searches, frequently initiating queries for information that could have been inferred through reasoning or was already available internally.

    Accuracy comparison across five challenging reasoning datasets
    Accuracy comparison across five challenging reasoning datasets. Results indicate that while integrating search generally helps, excessive or unguided searching can lower performance. This underscores the need for precise reasoning-guided search prompting to achieve optimal results in complex multimodal reasoning tasks.

    These performance drops highlight critical insight: without effective reasoning capabilities to guide their search strategies, models tend to default to inefficient search behaviors. This not only results in unnecessary computational overhead but can also introduce irrelevant information, ultimately degrading the quality of answer generation.

    Search RatioMM-K12MathVerse (testmini)MathVision (testmini)MathVista (testmini)MMMU (val)
    Reason -> Search (Search Prompt)16.822.99.512.524.7

    Reason to Act for General Search Model

    To achieve a robust balance between reasoning and search performance across general-domain tasks, we choose to integrate the training into one stage for both capabilities. Our goal is to build a model that not only retrieves relevant information efficiently but also demonstrates advanced reasoning over searched information.

    Training Recipe

    We unify the training process by adopting a ReACT-style prompt template, inspired by [REACT PAPER], which allows the model to interleave reasoning and action (search) steps within a single trajectory. This template is a slight refinement of the standard Search prompt, and full implementation details are provided in the Appendix.

    The table below summarizes the lineage and training data for each model variant, clarifying the distinctions in model initialization and supervision strategies. For comprehensive information on hyperparameters and training dynamics, please refer to the Appendix.

    Result

    We evaluated both our two-stage and unified (one-stage) models across a broad suite of benchmarks and consistently observed performance improvements as model capacity increased.

    The General STEM model showed that enhancing reasoning capabilities alone can lead to significant gains. In contrast, the General Search model revealed the multiplicative benefits of integrating reasoning with targeted search strategies. Notably, these improvements were not simply incremental - they represent fundamental advances in how models address complex, multimodal problems.

    ModelsMMK12MathVerse (testmini)MathVision (testmini)MathVista (testmini)MMMU (val)AI2DChartQAMMERealworldQAOCRBenchDocVQAMMBenchMMStarMiaBench
    Qwen2.5-VL-7B34.446.224.066.649.893.394.4630.4/1685.268.585.294.682.962.681.7
    General STEM46.251.428.473.657.394.491.4700.7/1662.167.583.792.183.865.576.0
    Reason -> Search43.251.725.071.857.994.093.6652.5/1688.367.581.793.583.263.147.6
    General Search43.652.027.374.756.194.694.0718.9/1775.365.577.889.484.060.444.4

    ModelsInfoseekMMSearchFVQASimpleVQA
    Qwen2.5-VL-7B20.112.820.338.4
    MMSearch55.153.858.457.4
    Reasoning -> Search58.557.157.957.7
    General Search52.054.952.857.0

    Our results reveal that MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to 61.6%61.6\% on Infoseek, compared to 28.5%28.5\% for General Search.

    MMSearchR1 performance comparison
    MMSearchR1 achieves the highest accuracy across all benchmarks, significantly outperforming standard General Search configurations. The key differentiator is search utilization: MMSearchR1 demonstrates search ratios up to $61.6\%$ on Infoseek, compared to $28.5\%$ for General Search.

    We found a strong positive correlation (Pearson r = 0.911) between search ratio and model performance, indicating that increased search engagement directly improves accuracy. However, this relationship has limits—excessive or undirected search introduces computational costs and answer noise that can degrade reliability. Additional experiments with reduced STEM data, increased search data ratios, and shortened warmup periods (60 vs 45 steps) confirmed that better performance requires strategic search integration. Models perform best when search is invoked selectively through explicit reasoning about information needs, balancing enhanced knowledge access against computational efficiency. These findings demonstrate that the key to multimodal model performance lies not in maximizing search frequency, but in developing sophisticated reasoning mechanisms that determine when external information retrieval adds value to complex query resolution.

    Case Study

    We show the following interesting cases to demonstrate versatile abilities of our final model.

    Case: MME

    In this example from the MME benchmark, the model is required to answer a question about a statue located in the National Gallery of Art in Washington, D.C. The process begins with the model analyzing the query image to determine what additional information is needed. It then performs searches for visually similar images, systematically evaluates the retrieved results, and conducts follow-up searches from different perspectives to verify its findings. This iterative search-and-reasoning approach allows the model to gather comprehensive evidence before arriving at a well-supported conclusion.

    MME benchmark case study
    Example from the MME benchmark showing the model's iterative search-and-reasoning approach to identify a statue in the National Gallery of Art.

    Case: Writing Email to a Public Figure

    In this case, the model is tasked with composing an email to Abdullah Shahid Sial, a public figure. To accomplish this effectively, the model must gather comprehensive information about him through internet searches, including his social media presence (Twitter), official website, professional background, and other publicly available information sources.

    Email composition case study
    Case study showing the model's research process when tasked with writing an email to Abdullah Shahid Sial, demonstrating comprehensive information gathering capabilities.

    Reference

    [1] https://huggingface.co/datasets/OpenGVLab/MMPR-v1.2

    [2] https://huggingface.co/datasets/FanqingM/MMK12

    [3] https://huggingface.co/datasets/MMR1/MMR1-Math-RL-Data-v0

    [4] https://huggingface.co/datasets/virtuoussy/Multi-subject-RLVR

    Appendix

    Reasoning Template

    {question}
    Please reason step by step. Output the thinking process within <think> </think> tags and final answer within <answer> </answer> tags.

    Search Template

    Answer the user's question based on the provided image. Examine the image carefully and identify any recognizable entities, such as faces, objects, locations, events, logos, or text. Determine whether you have sufficient knowledge to confidently recognize the main visual element and answer the user's question. If so, first explain your reasoning, then provide a clear and direct answer.\nIf you are unable to confidently identify the visual element, stop and invoke the image search tool by appending the string <search><img></search> at the end of your response. This will trigger a Google Lens search using the original image to retrieve relevant information that can help you confirm the visual content.\nOnce you have sufficient visual understanding, combine it with the user's question and assess whether you can confidently answer. If so, answer the question directly using your own knowledge. If not, invoke the text search tool by generating a concise and specific query, and output it in the format <text_search>your query here</text_search> at the end of your response. Carefully craft your query to accurately retrieve the information needed to help answer the question. The text search tool will then use Google Search to return relevant information based on your query.\nYou must include your reasoning inside <reason>...</reason> before taking any action, whether it is calling the image search tool, generating a text search query, or providing a final answer. The reasoning may involve analysis of the original image and question, interpretation of search results, or logical steps leading to the final answer.\nAll search results will be placed inside <information> and </information> and returned to you. When you are ready to answer the question, wrap your final answer between <answer> and </answer>, without detailed illustrations. For example: <answer>Titanic</answer>.\nHere is the image and the question:\n<image>
    {question}

    ReACT Template

    # System Message
    You are a helpful assistant. You should strictly follow reason-to-act thinking process to answer user provided question. Namely, you should first analyze the question & observation (e.g., user provided image or search results) and then inform the following action. The thinking process should be within <reason> and </reason> tags. The actions you can choose are:
    <answer>xxxxx</answer>:  which returns the answer within <answer> and </answer> tags, and finishes the task.
    <search>image</search>: which searches user provided image on Google and returns image-related visual entity/concept/knowledge for further reason-to-act. The search results are placed between <observation> and </observation> tags.
    <search>text query</search>:  which generates a text query and sent to Google and returns some snippets containing the answer for further reason-to-act. The search results are placed between <observation> and </observation> tags. Note that sometimes the snippets do not contain the answer, and some alternative search might be needed.
     
    Your output format should be one of the following three formats:
    <reason> YOUR THINKING PROCESS </reason>
    <answer> YOUR ANSWER AFTER GETTING ENOUGH INFORMATION </answer>
    or
    <reason> YOUR THINKING PROCESS </reason>
    <search> IMAGE </search>
    or
    <reason> YOUR THINKING PROCESS </reason>
    <search> YOUR GENERATED TEXT QUERY FOR HELPING YOU FIND INFORMATION ON GOOGLE TO ANSWER USER QUESTION </search>
     
    Only output the final answer (in words, numbers or phrase) inside the <answer></answer> tags, without any explanations or extra information. If this is a yes-or-no question, you should only answer yes or no.