LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
High performance, low cost, and strong reproducibility.
LLaVA established a low-cost open-source path for connecting vision encoders with large language models, and later versions steadily expanded toward OCR, charts, documents, multi-image reasoning, and video. LLaVA-OneVision consolidates that line into a unified interface across images, charts, documents, multi-image inputs, and video.
The remaining gap in open multimodal systems is often not architecture, but recipe transparency. Strong models such as Qwen2.5-VL show excellent results, yet full data composition, cleaning, sampling, and training schedules are rarely disclosed end-to-end. Our focus is to close that reproducibility gap rather than only release weights.
| Benchmark | LLaVA-OV-1.5 8B | Qwen2.5-VL 7B | LLaVA-OV-1.5 4B | LLaVA-OV-1.5 3B | Qwen2.5-VL 3B |
|---|---|---|---|---|---|
| MMStar | 67.7 | 62.5 | 64.9 | 59.1 | 55.9 |
| MMBenchen | 84.1 | 83.4 | 84.2 | 81.0 | 78.0 |
| MMBenchcn | 81.0 | 81.6 | 76.9 | 73.0 | 74.6 |
| MME-RealWorlden | 62.3 | 57.3 | 49.6 | 57.9 | 51.6 |
| MME-RealWorldcn | 56.1 | 51.5 | 61.6 | 23.4 | 45.4 |
| SeedBenchimage | 77.3 | 77.5 | 76.6 | 71.3 | 74.8 |
| CV-Bench | 80.8 | 80.0 | 77.2 | 73.8 | 71.5 |
| ScienceQA | 95.0 | 88.8 | 93.6 | 91.2 | 83.3 |
| SEED-Bench-2-Plus | 69.2 | 70.9 | 68.9 | 67.6 | 68.6 |
| RealWorldQA | 68.1 | 68.5 | 67.8 | 66.8 | 60.0 |
| Avg. | 74.2 | 72.2 | 72.1 | 66.5 | 66.4 |
| MathVistamini | 69.6 | 68.6 | 67.9 | 64.7 | 60.2 |
| WeMath | 33.6 | 33.3 | 24.9 | 22.6 | 18.4 |
| MathVision | 25.6 | 22.4 | 24.2 | 19.9 | 21.3 |
| MMMUval | 55.4 | 51.3 | 52.7 | 45.5 | 46.4 |
| MMMU-Prostandard | 37.4 | 36.3 | 35.3 | 29.5 | 31.1 |
| MMMU-Provision | 25.2 | 32.8 | 25.4 | 20.3 | 21.3 |
| Avg. | 41.1 | 40.8 | 38.4 | 33.7 | 33.1 |
| ChartQA | 86.5 | 84.1 | 87.1 | 84.4 | 83.4 |
| CharXivDQ | 74.1 | 69.8 | 63.8 | 61.8 | 58.2 |
| DocVQA | 95.0 | 94.9 | 94.4 | 93.4 | 92.7 |
| OCRBench | 82.9 | 84.2 | 80.0 | 80.5 | 79.2 |
| AI2Dw M | 84.2 | 82.6 | 83.6 | 82.3 | 78.6 |
| AI2Dw/o M | 94.1 | 93.4 | 93.3 | 91.9 | 90.7 |
| InfoVQA | 78.4 | 81.7 | 76.1 | 71.2 | 75.6 |
| Avg. | 85.0 | 84.4 | 82.6 | 81.0 | 79.8 |
| PixmoCount | 62.2 | 63.3 | 52.2 | 57.0 | 50.9 |
| CountBench | 88.2 | 86.4 | 79.8 | 49.5 | 72.5 |
| VL-RewardBench | 46.7 | 49.7 | 48.2 | 42.5 | 42.1 |
| V* | 78.0 | 77.0 | 74.9 | 67.5 | 69.6 |
| Avg. | 68.8 | 69.1 | 63.8 | 54.1 | 58.8 |
We release a fully open, concept-balanced 85M pretraining set and a curated 22M instruction set on top of the LLaVA–OneVision framework. The recipe keeps a compact three-stage pipeline, combines offline data packing with Megatron-LM and a distributed optimizer, and trains an 8B model from pretraining to SFT in about ~168 wall-clock hours.
LLaVA–OneVision–1.5 extends the series with RICE–ViT for native-resolution perception, stronger chart/document understanding, and a quality-first data recipe. Beyond checkpoints, we also release the data, packing pipeline, configs, logs, and evaluation commands needed for low-cost reproduction (see the technical report).
Pretraining Dataset (85M) and Concept Balancing
The 85M pretraining corpus mixes eight sources, covering roughly 20M Chinese and 65M English image–text pairs. To reduce long-tail sparsity, we use concept balancing with a shared image–concept embedding space and inverse-frequency resampling, then add high-quality bilingual caption augmentation. The result is better data coverage, stronger rare-concept recall, and more stable multimodal transfer.
Balanced mid-training consistently outperforms random sampling
Score improvement = balanced − random
Instruction Dataset (22M)
The 22M instruction dataset spans captioning, charts and tables, code and math, grounding, OCR, science, and general VQA. Through aggregation, format standardization, rewriting, bilingual conversion, and safety filtering, we keep category balance and reduce template homogeneity. Adding FineVision further improves downstream performance.
Method
1) Visual Encoder Pretraining
To improve OCR, document understanding, and region-level perception, LLaVA–OneVision–1.5 adopts our in-house MVT v1.5 (RICE–ViT) backbone.
Unlike CLIP/SigLIP-style encoders that mainly emphasize global alignment, RICE–ViT is designed to preserve stronger local semantics through Region Cluster Discrimination:
- it is trained on 450M images and 2.4B candidate regions
- it explicitly models local entities, text blocks, and surrounding context through region–cluster discrimination with region-aware attention
- it uses 2D rotary position encoding (2D RoPE) to support native multi-resolution inputs
Rather than stacking several specialized losses, we use a unified clustering–discrimination objective that improves semantic understanding, OCR, and localization together. We then connect the visual backbone to the language model with a lightweight projector and full-parameter joint training, avoiding redundant adapter layers while keeping the pipeline simple.
| Model Configuration | OCR & Document Understanding | General Vision Understanding | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Vision Tower | InfoVQA | DocVQA | ChartQA | TextVQA | OCRBench | OCRBenchV2 | LiveXivVQA | OCR Avg | AI2D | MMBEN | MMECog | MMEPer | POPE | RealworldQA | MMStar | Other Avg |
| CLIP | ViT-L-14-336px | 38.9 | 75.2 | 66.5 | 62.5 | 52.5 | 23.0 | 47.4 | 52.3 | 73.2 | 74.6 | 48.0 | 75.6 | 88.8 | 63.7 | 49.0 | 67.6 |
| MLCD | ViT-L-14-336px | 43.5 | 76.5 | 67.8 | 61.7 | 53.1 | 24.0 | 48.4 | 53.6 | 77.0 | 76.4 | 54.1 | 79.9 | 88.7 | 61.1 | 51.0 | 69.7 |
| AIMv2 | ViT-L-14-336px | 35.4 | 77.2 | 72.7 | 65.9 | 57.2 | 23.9 | 47.3 | 54.2 | 75.4 | 78.6 | 48.3 | 75.0 | 88.4 | 62.2 | 50.2 | 68.3 |
| RICE-ViT | ViT-L-14-336px | 45.2 | 79.2 | 72.3 | 65.9 | 57.5 | 24.1 | 48.9 | 56.2 | 77.9 | 76.6 | 54.6 | 80.7 | 88.5 | 63.1 | 51.8 | 70.5 |
| DFN5B | ViT-H-14-378px | 38.6 | 70.9 | 64.4 | 59.4 | 47.3 | 21.9 | 46.2 | 49.8 | 73.5 | 73.4 | 45.8 | 76.9 | 88.6 | 59.9 | 49.1 | 66.7 |
| SigLIP | ViT-SO400M-14-384px | 41.4 | 76.7 | 69.3 | 64.7 | 55.4 | 24.0 | 48.4 | 54.3 | 76.2 | 77.0 | 46.1 | 79.9 | 88.8 | 63.7 | 47.3 | 68.4 |
| SigLIPv2 | ViT-SO400M-14-384px | 43.7 | 79.1 | 70.2 | 66.2 | 58.7 | 25.4 | 48.6 | 56.0 | 77.0 | 77.1 | 46.6 | 80.4 | 89.3 | 63.4 | 52.8 | 69.5 |
| RICE-ViT | ViT-L-14-378px | 48.1 | 82.6 | 75.1 | 66.2 | 58.8 | 25.8 | 49.5 | 58.0 | 76.5 | 77.6 | 54.1 | 79.0 | 89.1 | 62.9 | 51.2 | 70.1 |
| SigLIPv2 | ViT-SO400M-16-560px | 50.2 | 86.2 | 77.4 | 70.2 | 62.7 | 26.5 | 52.9 | 60.9 | 77.0 | 76.5 | 53.5 | 79.9 | 89.3 | 68.2 | 53.1 | 71.1 |
| RICE-ViT | ViT-L-14-560px | 53.2 | 87.4 | 78.1 | 69.0 | 60.7 | 26.1 | 53.0 | 61.1 | 76.9 | 78.6 | 56.3 | 79.3 | 88.9 | 65.1 | 50.5 | 70.8 |
| Qwen-ViT from Qwen2.5-VL 7B | ViT-H-14-560px | 55.9 | 85.8 | 78.8 | 73.7 | 66.2 | 26.8 | 53.4 | 62.9 | 78.8 | 78.4 | 62.0 | 80.8 | 88.6 | 64.2 | 55.0 | 72.5 |
| RICE-ViT from OV-1.5 3B | ViT-L-14-560px | 53.7 | 87.1 | 81.9 | 73.8 | 73.3 | 30.4 | 53.6 | 64.8 | 80.3 | 79.6 | 58.6 | 82.2 | 89.0 | 67.3 | 56.6 | 73.4 |
2) Three‑Stage Learning Pipeline
Language–image alignment
Train the projector on LLaVA–1.5 558K to map visual features into the LLM token space.
Mid-stage pretraining
Use full-parameter training on the concept-balanced 85M corpus to inject broad visual semantics and knowledge.
Visual instruction alignment
Continue full-parameter training on the 22M instruction set plus extra visual instruction data such as FineVision.
3) Offline Parallel Data Packing
To reduce padding waste and improve token utilization, we use offline parallel packing:
- hash–bucket clustering by sample length or length ranges to cut global sorting/scanning costs
- multithreaded concatenation of multiple short samples into fixed–length sequences close to the target length during data prep
The pipeline is deterministic, reproducible, and avoids the runtime overhead of online dynamic packing. On the 85M set, it reaches up to ~11× effective padding compression.
From uneven raw samples to dense fixed-length packed sequences
Instead of packing inside the dataloader with a small rolling buffer, we first compute token lengths for the whole shard, store the packing state offline, and then run staged bin-packing over global statistics so every sequence is filled as close as possible to the target budget before training even begins.
- the dataloader only sees a small rolling buffer, so it cannot search globally for better complements
- small buffers leave many near-fit candidates invisible, which turns into leftover slack in each max-length box
- growing the online buffer improves fit quality, but raises host-RAM pressure and preprocessing latency
- because packing happens inside training, every step still pays search, merge, and scheduling overhead
- the whole shard is indexed by token length first, so exact and near-exact complements are easy to retrieve
- long samples are treated as seeds, then short and medium samples are packed around them to close the remaining gap
- the algorithm can switch between diversity-first filling and greedy completion when a bucket starts to deadlock
- once the best arrangement is found, it is serialized once and reused directly during training with no extra packing cost
4) Hybrid Parallelism and Efficient Long–Context Training
We use hybrid parallelism and long-context optimization—TP + PP + sequence/context parallelism with a distributed optimizer—to improve utilization and memory efficiency at scale. Native-resolution training preserves details in charts, documents, and dense text regions.
Training Efficiency: From Zero to SOTA in ~168 Hours
Open-Source Resources
Models, datasets, training code, and demos — all in the same deep-teal visual system as the rest of the page.
Code & Demos
Model Checkpoints
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"
# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)
# default processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
--model=llava_onevision1_5 \
--model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
--tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
--batch_size=1Citation
@inproceedings{LLaVA-OneVision-1.5,
title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
booktitle={arxiv},
year={2025}
}
@article{lillava,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
journal={Transactions on Machine Learning Research}
year={2024}
}