LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Sep 30, 2025/models
LLaVA-OneVision Contributors

High performance, low cost, and strong reproducibility.

LLaVA established a low-cost open-source path for connecting vision encoders with large language models, and later versions steadily expanded toward OCR, charts, documents, multi-image reasoning, and video. LLaVA-OneVision consolidates that line into a unified interface across images, charts, documents, multi-image inputs, and video.

The remaining gap in open multimodal systems is often not architecture, but recipe transparency. Strong models such as Qwen2.5-VL show excellent results, yet full data composition, cleaning, sampling, and training schedules are rarely disclosed end-to-end. Our focus is to close that reproducibility gap rather than only release weights.

Table 1Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified.
BenchmarkLLaVA-OV-1.5
8B
Qwen2.5-VL
7B
LLaVA-OV-1.5
4B
LLaVA-OV-1.5
3B
Qwen2.5-VL
3B
MMStar67.762.564.959.155.9
MMBenchen84.183.484.281.078.0
MMBenchcn81.081.676.973.074.6
MME-RealWorlden62.357.349.657.951.6
MME-RealWorldcn56.151.561.623.445.4
SeedBenchimage77.377.576.671.374.8
CV-Bench80.880.077.273.871.5
ScienceQA95.088.893.691.283.3
SEED-Bench-2-Plus69.270.968.967.668.6
RealWorldQA68.168.567.866.860.0
Avg.74.272.272.166.566.4
MathVistamini69.668.667.964.760.2
WeMath33.633.324.922.618.4
MathVision25.622.424.219.921.3
MMMUval55.451.352.745.546.4
MMMU-Prostandard37.436.335.329.531.1
MMMU-Provision25.232.825.420.321.3
Avg.41.140.838.433.733.1
ChartQA86.584.187.184.483.4
CharXivDQ74.169.863.861.858.2
DocVQA95.094.994.493.492.7
OCRBench82.984.280.080.579.2
AI2Dw M84.282.683.682.378.6
AI2Dw/o M94.193.493.391.990.7
InfoVQA78.481.776.171.275.6
Avg.85.084.482.681.079.8
PixmoCount62.263.352.257.050.9
CountBench88.286.479.849.572.5
VL-RewardBench46.749.748.242.542.1
V*78.077.074.967.569.6
Avg.68.869.163.854.158.8

We release a fully open, concept-balanced 85M pretraining set and a curated 22M instruction set on top of the LLaVA–OneVision framework. The recipe keeps a compact three-stage pipeline, combines offline data packing with Megatron-LM and a distributed optimizer, and trains an 8B model from pretraining to SFT in about ~168 wall-clock hours.

LLaVA–OneVision–1.5 extends the series with RICE–ViT for native-resolution perception, stronger chart/document understanding, and a quality-first data recipe. Beyond checkpoints, we also release the data, packing pipeline, configs, logs, and evaluation commands needed for low-cost reproduction (see the technical report).


Pretraining Dataset (85M) and Concept Balancing

The 85M pretraining corpus mixes eight sources, covering roughly 20M Chinese and 65M English image–text pairs. To reduce long-tail sparsity, we use concept balancing with a shared image–concept embedding space and inverse-frequency resampling, then add high-quality bilingual caption augmentation. The result is better data coverage, stronger rare-concept recall, and more stable multimodal transfer.

Figure 5

Balanced mid-training consistently outperforms random sampling

Score improvement = balanced − random

−1.2−0.60.00.81.62.4
DropGain
MathVistamini
-1.03
CharXivDQ
-0.50
MMStar
SeedBenchimage
MME-RealWorlden
SEED-Bench-2-Plus
+0.35
InfoVQA
+0.38
RealWorldQA
+0.39
MathVision
+0.53
ChartQA
+0.64
AI2Dw/oM
+0.64
AI2DwM
+0.65
MMMU-Provision
+0.83
MMEPer
+1.15
WebSrcval
+1.20
ScienceQA
+1.25
DocVQA
+1.53
MME-RealWorlden
+1.59
MathVersevision
+1.65
CV-Bench
+1.71
MMMU-Prostandard
+1.74
OCRBench
+1.80
PixmoCount
+1.89
MMMUval
+2.11
MMBenchcn
+2.13
MMBenchen
+2.24
Figure 5 Experimental results using 2M balanced and unbalanced mid-training samples (LLaVA-NeXT-780k as the SFT data) show that using a balanced mid-training dataset yields consistent improvements over a random sampling strategy.

Instruction Dataset (22M)

The 22M instruction dataset spans captioning, charts and tables, code and math, grounding, OCR, science, and general VQA. Through aggregation, format standardization, rewriting, bilingual conversion, and safety filtering, we keep category balance and reduce template homogeneity. Adding FineVision further improves downstream performance.


Method

1) Visual Encoder Pretraining

To improve OCR, document understanding, and region-level perception, LLaVA–OneVision–1.5 adopts our in-house MVT v1.5 (RICE–ViT) backbone.

Unlike CLIP/SigLIP-style encoders that mainly emphasize global alignment, RICE–ViT is designed to preserve stronger local semantics through Region Cluster Discrimination:

  • it is trained on 450M images and 2.4B candidate regions
  • it explicitly models local entities, text blocks, and surrounding context through region–cluster discrimination with region-aware attention
  • it uses 2D rotary position encoding (2D RoPE) to support native multi-resolution inputs

Rather than stacking several specialized losses, we use a unified clustering–discrimination objective that improves semantic understanding, OCR, and localization together. We then connect the visual backbone to the language model with a lightweight projector and full-parameter joint training, avoiding redundant adapter layers while keeping the pipeline simple.

Table 2Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. To ensure fair comparison, we adopt LLaVA-NeXT’s tiling strategy (up to 2 × 2 + 1 tiles) for handling high-resolution images, as many vision encoders do not support native resolution processing.
Model ConfigurationOCR & Document UnderstandingGeneral Vision Understanding
MethodVision Tower
InfoVQA
DocVQA
ChartQA
TextVQA
OCRBench
OCRBenchV2
LiveXivVQA
OCR Avg
AI2D
MMBEN
MMECog
MMEPer
POPE
RealworldQA
MMStar
Other Avg
CLIPViT-L-14-336px38.975.266.562.552.523.047.452.373.274.648.075.688.863.749.067.6
MLCDViT-L-14-336px43.576.567.861.753.124.048.453.677.076.454.179.988.761.151.069.7
AIMv2ViT-L-14-336px35.477.272.765.957.223.947.354.275.478.648.375.088.462.250.268.3
RICE-ViTViT-L-14-336px45.279.272.365.957.524.148.956.277.976.654.680.788.563.151.870.5
DFN5BViT-H-14-378px38.670.964.459.447.321.946.249.873.573.445.876.988.659.949.166.7
SigLIPViT-SO400M-14-384px41.476.769.364.755.424.048.454.376.277.046.179.988.863.747.368.4
SigLIPv2ViT-SO400M-14-384px43.779.170.266.258.725.448.656.077.077.146.680.489.363.452.869.5
RICE-ViTViT-L-14-378px48.182.675.166.258.825.849.558.076.577.654.179.089.162.951.270.1
SigLIPv2ViT-SO400M-16-560px50.286.277.470.262.726.552.960.977.076.553.579.989.368.253.171.1
RICE-ViTViT-L-14-560px53.287.478.169.060.726.153.061.176.978.656.379.388.965.150.570.8
Qwen-ViT from Qwen2.5-VL 7BViT-H-14-560px55.985.878.873.766.226.853.462.978.878.462.080.888.664.255.072.5
RICE-ViT from OV-1.5 3BViT-L-14-560px53.787.181.973.873.330.453.664.880.379.658.682.289.067.356.673.4

2) Three‑Stage Learning Pipeline

3) Offline Parallel Data Packing

To reduce padding waste and improve token utilization, we use offline parallel packing:

  • hash–bucket clustering by sample length or length ranges to cut global sorting/scanning costs
  • multithreaded concatenation of multiple short samples into fixed–length sequences close to the target length during data prep

The pipeline is deterministic, reproducible, and avoids the runtime overhead of online dynamic packing. On the 85M set, it reaches up to ~11× effective padding compression.

4) Hybrid Parallelism and Efficient Long–Context Training

We use hybrid parallelism and long-context optimization—TP + PP + sequence/context parallelism with a distributed optimizer—to improve utilization and memory efficiency at scale. Native-resolution training preserves details in charts, documents, and dense text regions.

Training Efficiency: From Zero to SOTA in ~168 Hours

Training loss over time for Mid-Training and SFT phases with a logarithmic y-axis2.501.250.630.320.200h89h168hlog loss
Mid-Training89.39h
SFT79.05h
Total Time168.44h
Hardware128 GPUs
Quick Start with HuggingFace
python
from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"

# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

# default processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Model Evaluation
bash
# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
 --model=llava_onevision1_5 \
 --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
 --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
 --batch_size=1

Citation

citation.bib
bibtex
@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arxiv},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research}
  year={2024}
}