LLaVA-OneVision-1.5-RL: Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning
Project led by Changrui Chen and Jiankang Deng
Unlocking multimodal reasoning via lightweight reinforcement learning!
Overview
LLaVA-OneVision-1.5-RL presents an RL post-training stage utilizing 67K curated examples with discrepancy-based selection to generate explicit chain-of-thought reasoning, achieving significant performance gains on STEM, coding, and reasoning benchmarks while maintaining visual understanding capabilities.
Our contributions are threefold:
(1) Discrepancy-Driven Data Curation. We identify tasks where model performance gap exists between Pass@N and Pass@1 metrics, targeting “latent capability” rather than knowledge injection.
(2) Rule-Based Reward System. We employ domain-specific verification rules rather than learned preference models, enabling precise feedback across STEM, grounding, spatial reasoning, counting, coding, OCR, and diagram tasks.
(3) Two-Stage Curriculum Training. We design a training curriculum that first stabilizes concise task performance with answer-only RL, then unlocks deeper reasoning through chain-of-thought RL.
Distribution of task categories in the RL training data. (a) Total RL corpus (67K instances). (b) Stage 1: Answer-only training. (c) Stage 2: Chain-of-thought training.
RL Data Strategy
Discrepancy-Driven Selection
If a model can solve a task given enough attempts (high Pass@N) but rarely gets it right on the first try (low Pass@1), it already has the latent capability — it just needs to learn to use it reliably. We select tasks with this gap for RL training, filtering out tasks that are too easy (high Pass@1, nothing to learn) or too hard (low Pass@N, beyond current capability).
Reward-Based Sampling
Multiple candidate responses are filtered by average reward scores to exclude trivial and unsolvable cases, focusing on medium-difficulty instances that provide optimal learning signals.
Reward System Architecture
Our RL setup employs a rule-based reward paradigm, where rewards are derived directly from task outcomes rather than learned preference models. Since different answer types require distinct verification strategies, we design answer-type-specific scoring rules via the reward/ module.
| Category | Source | Reward Design Details |
|---|---|---|
| STEM | ViRL39K | Choice accuracy & math expression equivalenceView reward logic |
| Grounding | Ref-L4, VigoRL-SA | IoU between predicted/ref boxes; choice accuracyView reward logicIoU = Intersection / (Area₁ + Area₂ − Intersection) |
| Spatial | VigoRL-SAT | Choice accuracyView reward logic |
| Counting | PixmoCount | Numeric token equivalenceView reward logic |
| Coding | WebCode2M, UniSVG | Token/tag overlap; SVG rendering similarity [0,1]View reward logic — HTMLView reward logic — SVGHTML: 0.6 × TokenJaccard + 0.4 × TagJaccard | SVG: 0.5 × SSIM + 0.25 × (Token + Tag) |
| OCR | InfoVQA | Text similarityView reward logicSimilarity = 1 − (Levenshtein / max(len₁, len₂)), clipped at 0.5 |
| Diagram | AI2D | Choice accuracy |
<think> block, at least one \boxed{}, and boxed content ≤ 20% of total length.Two-Stage Training Procedure
Interactive Training Pipeline: We utilize Group Relative Policy Optimization (GRPO) within the asynchronous AReaL framework. Click each stage to view the full hyperparameters and configuration.
Answer-only RL
Stabilizes task performance with concise answers (19.9K samples, ./data/stage1-normal)
Put ONLY your final answer within <answer></answer>.
Chain-of-Thought RL ✨
Unlocks deeper reasoning via explicit thinking prompts (49.2K samples, ./data/stage2-long)
Think and solve the following question step by step. Please put your thinking and analysis procedure within <think></think>. Put ONLY your final answer within <answer></answer>.
Extended Capability Analysis
test
val
Bench
(iou)
(acc)
(iou)
(acc)
(short)
2Code
Figure 5 Performance comparison of LLaVA-OV-1.5 and corresponding RL version on Spatial Reasoning & Grounding and Coding tasks.
Spatial & Grounding: RL “fast mode” significantly enhances fine-grained perception on SAT and Ref-L4 benchmarks.
Coding: “Thinking” mode achieves highest scores on Design2Code and UniSVG, demonstrating chain-of-thought benefits for structural code generation.
Performance Results
| Task | Benchmark | LLaVA-OV-1.5 | LLaVA-OV-1.5 RL | |
|---|---|---|---|---|
| 8B | 8B | |||
| - | thinking | fast | ||
| General VQA | MMStar | 67.7 | 68.2↑0.5 | 68.3↑0.6 |
| MMBenchen | 84.1 | 85.7↑1.6 | 85.7↑1.6 | |
| MMBenchcn | 81.0 | 84.2↑3.2 | 81.5↑0.5 | |
| MME-RealWorlden | 61.7 | 63.4↑1.7 | 63.3↑1.6 | |
| MME-RealWorldcn | 56.1 | 56.1↑0.0 | 56.3↑0.2 | |
| SeedBenchimage | 77.3 | 76.7 | 77.6↑0.3 | |
| CV-Bench | 80.7 | 82.9↑2.2 | 81.1↑0.4 | |
| SEED-Bench-2-Plus | 69.2 | 69.5↑0.3 | 69.2↑0.0 | |
| RealWorldQA | 68.1 | 68.4↑0.3 | 70.6↑2.5 | |
| Avg. | 71.8 | 72.8↑1.0 | 72.6↑0.8 | |
| Reasoning | MathVistamini | 69.6 | 72.3↑2.7 | 71.8↑2.2 |
| WeMath | 61.5 | 69.4↑7.9 | 60.8 | |
| MathVision | 25.6 | 34.4↑8.8 | 26.2↑0.6 | |
| MMMUval | 55.4 | 58.8↑3.4 | 54.9 | |
| MMMU-Prostandard | 37.4 | 39.9↑2.5 | 38.0↑0.6 | |
| MMMU-Provision | 25.2 | 35.7↑10.5 | 29.0↑3.8 | |
| Avg. | 45.8 | 51.8↑6.0 | 46.8↑1.0 | |
| OCR & Chart | ChartQA | 86.5 | 87.4↑0.9 | 87.0↑0.5 |
| CharXivDQ | 70.9 | 68.4 | 71.2↑0.3 | |
| DocVQA | 95.0 | 91.9 | 95.0↑0.0 | |
| OCRBench | 82.9 | 81.7 | 82.3 | |
| AI2Dw M | 84.2 | 83.7 | 84.3↑0.1 | |
| AI2Dw/o M | 94.1 | 93.7 | 93.9 | |
| InfoVQA | 78.4 | 76.6 | 78.7↑0.3 | |
| Avg. | 84.6 | 83.3 | 84.6↑0.0 | |
| Others | PixmoCount | 62.2 | 65.7↑3.5 | 71.1↑8.9 |
| CountBench | 88.2 | 86.8 | 88.6↑0.4 | |
| VL-RewardBench | 47.7 | 44.0 | 49.7↑2.0 | |
| V* | 78.0 | 79.1↑1.1 | 78.0↑0.0 | |
| Avg. | 69.0 | 66.0 | 71.6↑2.6 | |
GRPO Algorithm & AReaL Async Framework
GRPO
GRPO eliminates the critic by sampling G = 16 completions per prompt and using group-normalized rewards as the baseline.
Show GRPO Loss Implementation (utils/functional.py)
# utils/functional.py — ppo_actor_loss_fn
ratio = torch.exp(logprobs - proximal_logprobs)
clipped_ratio = torch.clamp(ratio, 1.0 - eps_clip, 1.0 + eps_clip_higher)
pg_loss = torch.max(-advantages * ratio, -advantages * clipped_ratio)
# Behavior importance weight (off-policy correction)
behav_kl = proximal_logprobs - old_logprobs
behav_imp_weight = torch.clamp(behav_kl.exp(), max=behav_imp_weight_cap)
pg_loss = pg_loss * behav_imp_weightAReaL Async Training
AReaL decouples rollout from gradient computation — 2.77× throughput by eliminating GPU idle time.
max_head_offpolicyness η = 4. Rollout samples are restricted to ≤4 gradient steps behind the current policy.Show AReaL Training Loop (trains/grpo.py)
# trains/grpo.py — main training loop
for global_step in range(start_step, max_steps):
batch = rollout.prepare_batch(train_dataloader, workflow=workflow)
batch["prox_logp"] = actor.compute_logp(batch)
actor.compute_advantages(batch)
actor.ppo_update(batch)
rollout.pause()
actor.update_weights(weight_update_meta)
rollout.set_version(global_step + 1)
rollout.resume()Training Configuration
Acknowledgements
We thank the following projects and frameworks:
- AReaL: Lightning-Fast RL for LLM Reasoning and Agents
- sglang: Fast serving framework for LLMs and vision language models
- lmms-eval: Standardized evaluation framework
- LLaVA: Large Language-and-Vision Assistant
- LLaVA-NeXT: Next-generation multi-modal assistant
Open-Source Resources
Complete LLaVA-OneVision-1.5-RL resources for the community.
Code & Paper
Model Checkpoints
Citation
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
booktitle={arXiv},
year={2025}
}
@inproceedings{xie2025region,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong and Miles, Roy and Elezi, Ismail and Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research},
year={2024}
}