LLaVA-OneVision-1.5-RL: Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning

Dec 15, 2025/models

Project led by Changrui Chen and Jiankang Deng

LLaVA-OneVision-1.5-RL

Unlocking multimodal reasoning via lightweight reinforcement learning!

Overview

LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.

Our contributions are threefold:

(1) Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.

(2) Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.

(3) Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.

LLaVA-OneVision-1.5RL DataTotal: 67.0K
STEM
58.5%
Grounding
22.6%
Spatial
6.4%
Coding
6.0%
Counting
4.2%
OCR & Diagram
2.3%
(a)
Stage 1Answer-onlyTotal: 19.9K
Grounding
75.0%
OCR & Diagram
10.9%
Counting
14.1%
(b)
Stage 2Chain-of-ThoughtTotal: 49.2K
STEM
79.0%
OCR & Diagram
0.8%
Counting
0.6%
Coding
8.1%
Spatial
8.5%
Grounding
3.0%
(c)

Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.


RL Data Strategy

Discrepancy-Driven Selection

If a model can solve a task given enough attempts (high Pass@N) but rarely gets it right on the first try (low Pass@1), it already has the latent capability — it just needs to learn to use it reliably. We select tasks with this gap for RL training, filtering out tasks that are too easy (high Pass@1, nothing to learn) or too hard (low Pass@N, beyond current capability).

Task Selection by Capability Gap
32%
78%
gap
Selected
85%
91%
Too Easy
5%
8%
Too Hard
Pass@1Pass@NLatent capability gap

Reward-Based Sampling

Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.

Reward-Based Sampling Filter
✗ Trivial
reward ≈ 1.0
✓ Selected
medium difficulty
✗ Unsolvable
reward ≈ 0.0
0.01.0
excludedoptimal signalexcluded

Reward System Architecture

Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules via the reward/ module.

CategorySourceReward Design Details
STEMViRL39KChoice accuracy & math expression equivalence
View reward logic
reward/math.py
python
def math_reward_fn(completions, answer):
    model_answer = extract_answer(completion, format_strict=True)
    gold_parsed = parse(gt_answer, extraction_config=[
        StringExtractionConfig(), LatexExtractionConfig(), ExprExtractionConfig(),
    ])
    correct = verify(answer_parsed, gold_parsed)
    if not correct:
        correct = is_equal(gt_answer, model_answer)  # SymPy fallback
    return 1.0 if correct else 0.0
GroundingRef-L4, VigoRL-SAIoU between predicted/ref boxes; choice accuracy
View reward logic
reward/bbox.py
python
def bbox_reward_fn(completions, answer):
    xA = max(predicted[0], gt[0]); yA = max(predicted[1], gt[1])
    xB = min(predicted[2], gt[2]); yB = min(predicted[3], gt[3])
    inter = max(0, xB-xA) * max(0, yB-yA)
    iou = inter / float(areaA + areaB - inter)
    return iou  # ∈ [0, 1]

IoU = Intersection / (Area₁ + Area₂ − Intersection)

SpatialVigoRL-SATChoice accuracy
View reward logic
reward/multiple_choice.py
python
def multiplechoice_reward_fn(completions, answer):
    predicted = extract_boxed_content(completions)[-1]
    predicted = predicted.strip().strip('.()')
    predicted = predicted[0].upper() if predicted else ""
    return 1 if predicted == answer.upper() else 0
CountingPixmoCountNumeric token equivalence
View reward logic
reward/number.py
python
def number_reward_fn(completions, answer):
    answer_str = extract_boxed_content(completions)[-1]
    match = re.findall(r"([0-9\\.]+)", answer_str)
    count = match[-1] if match else ""
    return float(count.strip() == answer.strip())
CodingWebCode2M, UniSVGToken/tag overlap; SVG rendering similarity [0,1]
View reward logic — HTML
reward/htmlcode.py
python
def html_reward_fn(completions, answer):
    token_score = calculate_token_overlap(gen, ref)
    structure_score = calculate_tag_structure_similarity(gen, ref)
    reward = 0.6 * token_score + 0.4 * structure_score
    return max(0.0, min(1.0, reward))
View reward logic — SVG
reward/svgcode.py
python
def svg_reward_fn(completions, answer):
    token_score = calculate_token_overlap(gen, ref)
    structure_score = calculate_structure_similarity(gen, ref)
    image_score = calculate_image_similarity(gen_png, ref_png)  # SSIM
    reward = 0.5 * image_score + 0.25 * (token_score + structure_score)
    return reward  # ∈ [0, 1]

HTML: 0.6 × TokenJaccard + 0.4 × TagJaccard  |  SVG: 0.5 × SSIM + 0.25 × (Token + Tag)

OCRInfoVQAText similarity
View reward logic
reward/ocr.py
python
def ocr_reward_fn(completions, answer):
    dist = levenshtein_distance(gt, det)
    length = max(len(target), len(det))
    reward = 1 - min(values)
    return reward if reward >= 0.5 else 0  # threshold

Similarity = 1 − (Levenshtein / max(len₁, len₂)), clipped at 0.5

DiagramAI2DChoice accuracy
Format Reward (cross-cutting): requires exactly one <think> block, at least one \boxed{}, and boxed content ≤ 20% of total length.

Two-Stage Training Procedure

Interactive Training Pipeline: We utilize Group Relative Policy Optimization (GRPO) within the asynchronous AReaL framework. Click each stage to view the full hyperparameters and configuration.

Stage 1

Answer-only RL

Stabilizes task performance with concise answers (19.9K samples, ./data/stage1-normal)

Model BaseLLaVA-OneVision-1.5-8B-Instruct
Data./data/stage1-normal (19.9K)
Prompt Template
Put ONLY your final answer within <answer></answer>.
Stage 2

Chain-of-Thought RL ✨

Unlocks deeper reasoning via explicit thinking prompts (49.2K samples, ./data/stage2-long)

Model InitStage 1 Checkpoint
Data./data/stage2-long (49.2K)
Prompt Template
Think and solve the following question step by step. Please put your thinking and analysis procedure within <think></think>. Put ONLY your final answer within <answer></answer>.

Extended Capability Analysis

Spatial Reasoning & Grounding
Coding
Score / Accuracy (%)
100
90
80
70
60
50
40
50.6
52.2
53.2
55.9
61.7
63.6
62.6
60.0
41.5
41.7
41.2
38.5
50.3
57.1
61.6
87.0
81.4
88.4
64.0
57.2
68.8
94.4
92.4
94.6
89.3
55.6
58.3
55.7
55.8
60.1
63.9
60.5
60.6
50.4
57.5
53.8
50.4
SAT
test
SAT
val
Tree-
Bench
Ref-L4
(iou)
Ref-L4
(acc)
RefCOCO
(iou)
RefCOCO
(acc)
WebCode
(short)
Design-
2Code
UniSVG
LLaVA-OV-1.5 8B
LLaVA-OV-1.5 RL (Thinking)
LLaVA-OV-1.5 RL (Fast)
Qwen 2.5-VL

Figure 5  Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.

Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.

Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.


Performance Results

Table 1Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified.
TaskBenchmarkLLaVA-OV-1.5LLaVA-OV-1.5 RL
8B8B
-thinkingfast
General VQAMMStar67.768.2↑0.568.3↑0.6
MMBenchen84.185.7↑1.685.7↑1.6
MMBenchcn81.084.2↑3.281.5↑0.5
MME-RealWorlden61.763.4↑1.763.3↑1.6
MME-RealWorldcn56.156.1↑0.056.3↑0.2
SeedBenchimage77.376.777.6↑0.3
CV-Bench80.782.9↑2.281.1↑0.4
SEED-Bench-2-Plus69.269.5↑0.369.2↑0.0
RealWorldQA68.168.4↑0.370.6↑2.5
Avg.71.872.8↑1.072.6↑0.8
ReasoningMathVistamini69.672.3↑2.771.8↑2.2
WeMath61.569.4↑7.960.8
MathVision25.634.4↑8.826.2↑0.6
MMMUval55.458.8↑3.454.9
MMMU-Prostandard37.439.9↑2.538.0↑0.6
MMMU-Provision25.235.7↑10.529.0↑3.8
Avg.45.851.8↑6.046.8↑1.0
OCR & ChartChartQA86.587.4↑0.987.0↑0.5
CharXivDQ70.968.471.2↑0.3
DocVQA95.091.995.0↑0.0
OCRBench82.981.782.3
AI2Dw M84.283.784.3↑0.1
AI2Dw/o M94.193.793.9
InfoVQA78.476.678.7↑0.3
Avg.84.683.384.6↑0.0
OthersPixmoCount62.265.7↑3.571.1↑8.9
CountBench88.286.888.6↑0.4
VL-RewardBench47.744.049.7↑2.0
V*78.079.1↑1.178.0↑0.0
Avg.69.066.071.6↑2.6

GRPO Algorithm & AReaL Async Framework

GRPO

GRPO eliminates the critic by sampling G = 16 completions per prompt and using group-normalized rewards as the baseline.

Objective
J(θ) = 𝔼 [ min( rt · Ât , clip(rt, 1−ε, 1+ε′) · Ât ) · wt ]
Ratio
rt = πθ(yt|y<t,x) / πprox(yt|y<t,x)
BehavWeight
wt = exp( log πprox − log πbehave ), cap = 5.0
Advantage
Â(x, yi) = ( ri − μgroup ) / ( σgroup + ε )
Reward
r′ = (rtask − 0.5) × 10.0
Show GRPO Loss Implementation (utils/functional.py)
ppo_actor_loss_fn
python
# utils/functional.py — ppo_actor_loss_fn
ratio = torch.exp(logprobs - proximal_logprobs)
clipped_ratio = torch.clamp(ratio, 1.0 - eps_clip, 1.0 + eps_clip_higher)
pg_loss = torch.max(-advantages * ratio, -advantages * clipped_ratio)

# Behavior importance weight (off-policy correction)
behav_kl = proximal_logprobs - old_logprobs
behav_imp_weight = torch.clamp(behav_kl.exp(), max=behav_imp_weight_cap)
pg_loss = pg_loss * behav_imp_weight

AReaL Async Training

AReaL decouples rollout from gradient computation — 2.77× throughput by eliminating GPU idle time.

Rollout Engine
SGLang · vLLM · 4×GPU
batch + logprobs →
Actor (FSDP)
4×GPU · gradient update
2.77×
Rollout
R1
R2
R3
Actor
Train 1
Train 2
Train 3
t₀t₁t₂t₃
Decoupled PPO
Three policies: πbehave (rollout), πprox (recomputed), πθ (current). Ratio uses πθprox.
Staleness Control
max_head_offpolicyness η = 4. Rollout samples are restricted to ≤4 gradient steps behind the current policy.
Behavior Weight
w = exp(log πprox − log πbehave), capped at 5.0 to prevent gradient explosion.
Show AReaL Training Loop (trains/grpo.py)
async_training_loop
python
# trains/grpo.py — main training loop
for global_step in range(start_step, max_steps):
    batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
    batch["prox_logp"] = actor.compute_logp(batch)
    actor.compute_advantages(batch)
    actor.ppo_update(batch)

    rollout.pause()
    actor.update_weights(weight_update_meta)
    rollout.set_version(global_step + 1)
    rollout.resume()

Training Configuration

eps_clip
0.2 / 0.28 (asymmetric)
kl_ctl
0.0 (disabled)
reward_scaling
10.0
reward_bias
−0.5
group_size
16
max_new_tokens
4096
temperature
1.0
learning_rate
2e-6
epochs
30
batch_size
32
offpolicyness (η)
4
behav_weight_cap
5.0
dtype
bfloat16
allocation
d4p1t1 + d4p1t1

Acknowledgements

We thank the following projects and frameworks:

  • AReaL: Lightning-Fast RL for LLM Reasoning and Agents
  • sglang: Fast serving framework for LLMs and vision language models
  • lmms-eval: Standardized evaluation framework
  • LLaVA: Large Language-and-Vision Assistant
  • LLaVA-NeXT: Next-generation multi-modal assistant

Citation

citation.bib
bibtex
@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arXiv},
  year={2025}
}

@inproceedings{xie2025region,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research},
  year={2024}
}