skip to content

Tags #analysis

  • Overview

    Our contributions are threefold:

    (1) High-quality multimodal reasoning data curation.
    We provide the first systematic study on constructing SFT and RL datasets for multimodal reasoning, showing that both source diversity and answer diversity are crucial for building reliable supervision signals.

    (2) A strong and reproducible SFT recipe.
    We introduce a robust SFT pipeline with step-by-step validation, careful teacher-model selection, and cross-domain data integration, enabling the construction of a high-quality cold-start reasoning dataset.

    (3) An advanced RL training recipe.
    Through an extensive comparison of GSPO, GRPO, and DAPO, we identify the most stable and scalable RL strategy and build a reliable RL pipeline that significantly strengthens multimodal reasoning performance.

    OpenMMReasoner Performance Comparison

    Performance Comparison with State-of-the-Art Large Multimodal Reasoning Models across Various Benchmarks. Our proposed OpenMMReasoner consistently outperforms competing methods, highlighting its effectiveness in complex reasoning tasks.


    OpenMMReasoner-Data

    OpenMMReasoner-Data presents two training recipes covering both the SFT and RL phases. The pipeline begins by collecting diverse data sources and selecting teacher models to generate new answer traces. During the RL phase, we explore different algorithm choices and filtering strategies, leading to our final optimized recipe.

    OpenMMReasoner Pipeline
    Data Distribution

    Experimental Results on Visual Reasoning Benchmarks

    We evaluate our approach on a suite of public visual reasoning benchmarks. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

    Main Experimental Results

    Analysis and Insights for SFT

    Our Analysis and Insights for SFT are as follows:

    (1) Answer diversity enhances reasoning.
    Increasing the diversity of generated answers consistently improves the model’s overall reasoning performance, even when using the same question sources, suggesting that exposure to varied solutions strengthens understanding.

    (2) Teacher model selection is crucial.
    Distilling from a strong teacher model substantially boosts the model’s reasoning ability while maintaining high data efficiency. Careful selection for teacher model directly affects the quality of the distilled dataset and the final model performance.

    (3) Over-filtering reduces diversity and performance.
    The best results are achieved without excessive filtering, indicating that maintaining greater answer diversity encourages more robust reasoning abilities.

    (4) Cross-domain knowledge improves generalization.
    Incorporating diverse data from multiple domains consistently enhances the model’s overall reasoning capabilities across tasks.

    Teacher Model Analysis
    Answer Diversity Analysis
    Cross-domain Analysis

    Analysis and Insights for RL

    Our Analysis and Insights for RL are as follows:

    (1) GSPO outperforms other algorithms.
    GSPO demonstrates superior stability and faster convergence compared to alternative methods in multimodal RL training.

    (2) Token efficiency is crucial.
    While increasing reasoning steps at test time can improve performance, excessive tokens reduce efficiency. Our results show that a smaller reasoning budget can achieve comparable or even better accuracy.

    (3) Reasoning ability transfers across domains.
    Gains in reasoning during training consistently translate into stronger performance across multiple domains.

    RL Experimental Results
    RL Training Curves
    Validation Curves
    Rollout Number Experiment Curves

    Open-Source Resources
    We open-source OpenMMReasoner to facilitate future development of multimodal reasoning in the community
  • Multimodal-SAE Banner
    Multimodal-SAE: First demonstration of SAE-based feature interpretation in Large Multimodal Models

    Overview

    For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing a breakthrough solution for feature interpretation across various model scales.

    Inspiration and Motivation

    This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimodal models, we discovered intriguing features that:

    • Correlate with diverse semantics across visual and textual modalities
    • Can be leveraged to steer model behavior for precise control
    • Enable deeper understanding of LMM functionality and decision-making

    Technical Approach

    SAE Training Pipeline

    The Sparse Autoencoder (SAE) is trained using a targeted approach:

    1. Integration Strategy - SAE integrated into a specific layer of the model
    2. Frozen Architecture - All other model components remain frozen during training
    3. Training Data - Utilizes LLaVA-NeXT dataset for comprehensive multimodal coverage
    4. Feature Learning - Learns sparse, interpretable representations of multimodal features

    Auto-Explanation Pipeline

    Our novel auto-explanation pipeline analyzes visual features through:

    • Activation Region Analysis - Identifies where features activate in visual inputs
    • Semantic Correlation - Maps features to interpretable semantic concepts
    • Cross-Modal Understanding - Leverages larger LMMs for feature interpretation
    • Automated Processing - Scalable interpretation without manual annotation

    Feature Steering and Control

    Feature Steering Demonstration
    Demonstration of feature steering: These learned features can be used to control model behavior and generate desired outputs

    Behavioral Control Capabilities

    The learned features enable precise model steering by:

    • Selective Feature Activation - Amplifying specific semantic features
    • Behavioral Modification - Directing model attention and responses
    • Interpretable Control - Understanding why specific outputs are generated
    • Fine-Grained Manipulation - Precise control over model behavior

    Key Contributions

    🔬 First Multimodal SAE Implementation

    Pioneering application of SAE methodology to multimodal models, opening new research directions in mechanistic interpretability.

    🎯 Cross-Scale Feature Interpretation

    Demonstration that smaller LMMs can learn features interpretable by larger models, enabling scalable analysis approaches.

    🎮 Model Steering Capabilities

    Practical application of learned features for controllable model behavior and output generation.

    🔄 Auto-Explanation Pipeline

    Automated methodology for interpreting visual features without requiring manual semantic labeling.

    Research Impact

    Mechanistic Interpretability Advancement

    This work represents a significant advancement in understanding how multimodal models process and integrate information across modalities.

    Practical Applications

    • Model Debugging - Understanding failure modes and biases
    • Controllable Generation - Steering model outputs for specific applications
    • Safety and Alignment - Better control over model behavior
    • Feature Analysis - Deep understanding of learned representations

    Future Directions

    Our methodology opens new research avenues in:

    1. Cross-Modal Feature Analysis - Understanding feature interactions across modalities
    2. Scalable Interpretability - Extending to larger and more complex models
    3. Real-Time Steering - Dynamic control during inference
    4. Safety Applications - Preventing harmful or biased outputs

    Technical Details

    Architecture Integration

    The SAE is carefully integrated to:

    • Preserve Model Performance - Minimal impact on original capabilities
    • Capture Rich Features - Learn meaningful sparse representations
    • Enable Interpretation - Facilitate analysis by larger models
    • Support Steering - Allow runtime behavioral modification

    Evaluation Methodology

    Our approach is validated through:

    • Feature Interpretability - Qualitative analysis of learned features
    • Steering Effectiveness - Quantitative measurement of behavioral control
    • Cross-Model Validation - Testing interpretation across different model sizes
    • Semantic Consistency - Verifying feature stability and meaning

    Conclusion

    Multimodal-SAE represents a breakthrough in multimodal mechanistic interpretability, providing the first successful demonstration of SAE-based feature interpretation in the multimodal domain. Our work enables:

    • Deeper Understanding of how LMMs process multimodal information
    • Practical Control over model behavior through feature steering
    • Scalable Interpretation methods for increasingly complex models
    • Foundation Research for future advances in multimodal AI safety and control

    This research establishes a new paradigm for understanding and controlling Large Multimodal Models, with significant implications for AI safety, controllability, and interpretability research.