skip to content

LLaVA-OneVision-1.5-RL: Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning

4 min read

Applying reinforcement learning post-training to enhance reasoning capabilities in multimodal models with significant improvements on STEM, coding, and reasoning tasks.

Overview

LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

Our contributions are threefold:

(1) Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

(2) Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

(3) Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

RL Training Data Distribution
Distribution of task categories in the RL training data (67K total instances)

RL Data Strategy

Discrepancy-Driven Selection

We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics. This approach targets “latent capability” rather than knowledge injection, ensuring the model learns to better utilize its existing knowledge.

Reward-Based Sampling

Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.


Reward System Architecture

We employ a rule-based paradigm with domain-specific verification rules rather than learned preference models:

CategorySourceReward Design
STEMViRL39KChoice accuracy & math expression equivalence
GroundingRef-L4, VigoRL-SAIoU between predicted/reference boxes; choice accuracy
SpatialVigoRL-SATChoice accuracy
CountingPixmoCountNumeric token equivalence
CodingWebCode2M, UniSVGToken/tag overlap; SVG rendering similarity [0,1]
OCRInfoVQAText similarity
DiagramAI2DChoice accuracy

Two-Stage Training Procedure

Training uses Group Relative Policy Optimization (GRPO) within the AReaL asynchronous framework:

Stage 1: Answer-only RL

Normal split training with instruction “Put ONLY your final answer within <answer></answer>.” This stage stabilizes concise task performance.

Stage 2: Chain-of-Thought RL

Long-reasoning data with instruction “Think and solve… within <think></think>…” This stage unlocks deeper reasoning capabilities. A small proportion of normal-set examples are interspersed to prevent forgetting perception skills.


Performance Results

Core Capability Enhancement

General VQA Benchmarks (Average +1.0):

BenchmarkBase+RL
MMStar67.768.2
MMBench (EN)84.185.7
MMBench (CN)81.084.2
MME-RealWorld (EN)61.763.4
CV-Bench80.782.9
RealWorldQA68.168.4

Reasoning Tasks (Average +6.0):

BenchmarkBase+RLΔ
MathVista Mini69.672.3+2.7
WeMath61.569.4+7.9
MathVision25.634.4+8.8
MMMU Validation55.458.8+3.4
MMMU-Pro25.235.7+10.5

OCR & Chart (Average +0.0):

BenchmarkBase+RL
ChartQA86.587.4
DocVQA95.091.9
InfoVQA78.476.6

Extended Capability Analysis

Extended Performance Comparison
Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks

Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.


Development Roadmap

This release represents Stage 3 in a multi-phase project:

StageFocusData Scale
Stage 1 & 1.5Pre-training & Mid-training85M multimodal samples
Stage 2Visual instruction tuning (SFT)22M instruction-following samples
Stage 3 (Current)RL post-training with GRPO67K curated samples

Acknowledgements

We thank the following projects and frameworks:

  • AReaL: Lightning-Fast RL for LLM Reasoning and Agents
  • sglang: Fast serving framework for LLMs and vision language models
  • lmms-eval: Standardized evaluation framework
  • LLaVA: Large Language-and-Vision Assistant
  • LLaVA-NeXT: Next-generation multi-modal assistant

Open-Source Resources
Complete LLaVA-OneVision-1.5-RL resources for the community
Model Checkpoints
Pre-trained models with RL optimization
Training Datasets
Curated RL training data
Base Model
LLaVA-OneVision-1.5 foundation
2025
Vision
Multimodal
Research
Llava
Reasoning
Reinforcement-learning

Authors

Didi Zhu*, Zhiyu Qu*, Zerui Chen, Polydefkis Gkagkos, Xiang An, Bo Li

* Main Authors

Acknowledgement

Project led by Changrui Chen and Jiankang Deng. Built upon contributions from the LLaVA-OneVision community.