LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Sep 30, 2025/models

LLaVA-OneVision Contributors

Code|Technical Report|Models and Datasets|Demo

High performance, low cost, and strong reproducibility.

LLaVA established a low-cost open-source path for connecting vision encoders with large language models, and later versions steadily expanded toward OCR, charts, documents, multi-image reasoning, and video. LLaVA-OneVision consolidates that line into a unified interface across images, charts, documents, multi-image inputs, and video.

The remaining gap in open multimodal systems is often not architecture, but recipe transparency. Strong models such as Qwen2.5-VL show excellent results, yet full data composition, cleaning, sampling, and training schedules are rarely disclosed end-to-end. Our focus is to close that reproducibility gap rather than only release weights.

Table 1Performance comparison across vision-language models on various benchmarks grouped by task type. All scores are reported as accuracy percentages unless otherwise specified.

Benchmark	LLaVA-OV-1.5 8B	Qwen2.5-VL 7B	LLaVA-OV-1.5 4B	LLaVA-OV-1.5 3B	Qwen2.5-VL 3B
MMStar	67.7	62.5	64.9	59.1	55.9
MMBenchen	84.1	83.4	84.2	81.0	78.0
MMBenchcn	81.0	81.6	76.9	73.0	74.6
MME-RealWorlden	62.3	57.3	49.6	57.9	51.6
MME-RealWorldcn	56.1	51.5	61.6	23.4	45.4
SeedBenchimage	77.3	77.5	76.6	71.3	74.8
CV-Bench	80.8	80.0	77.2	73.8	71.5
ScienceQA	95.0	88.8	93.6	91.2	83.3
SEED-Bench-2-Plus	69.2	70.9	68.9	67.6	68.6
RealWorldQA	68.1	68.5	67.8	66.8	60.0
Avg.	74.2	72.2	72.1	66.5	66.4
MathVistamini	69.6	68.6	67.9	64.7	60.2
WeMath	33.6	33.3	24.9	22.6	18.4
MathVision	25.6	22.4	24.2	19.9	21.3
MMMUval	55.4	51.3	52.7	45.5	46.4
MMMU-Prostandard	37.4	36.3	35.3	29.5	31.1
MMMU-Provision	25.2	32.8	25.4	20.3	21.3
Avg.	41.1	40.8	38.4	33.7	33.1
ChartQA	86.5	84.1	87.1	84.4	83.4
CharXivDQ	74.1	69.8	63.8	61.8	58.2
DocVQA	95.0	94.9	94.4	93.4	92.7
OCRBench	82.9	84.2	80.0	80.5	79.2
AI2Dw M	84.2	82.6	83.6	82.3	78.6
AI2Dw/o M	94.1	93.4	93.3	91.9	90.7
InfoVQA	78.4	81.7	76.1	71.2	75.6
Avg.	85.0	84.4	82.6	81.0	79.8
PixmoCount	62.2	63.3	52.2	57.0	50.9
CountBench	88.2	86.4	79.8	49.5	72.5
VL-RewardBench	46.7	49.7	48.2	42.5	42.1
V*	78.0	77.0	74.9	67.5	69.6
Avg.	68.8	69.1	63.8	54.1	58.8

We release a fully open, concept-balanced 85M pretraining set and a curated 22M instruction set on top of the LLaVA–OneVision framework. The recipe keeps a compact three-stage pipeline, combines offline data packing with Megatron-LM and a distributed optimizer, and trains an 8B model from pretraining to SFT in about ~168 wall-clock hours.

LLaVA–OneVision–1.5 extends the series with RICE–ViT for native-resolution perception, stronger chart/document understanding, and a quality-first data recipe. Beyond checkpoints, we also release the data, packing pipeline, configs, logs, and evaluation commands needed for low-cost reproduction (see the technical report).

Pretraining Dataset (85M) and Concept Balancing

The 85M pretraining corpus mixes eight sources, covering roughly 20M Chinese and 65M English image–text pairs. To reduce long-tail sparsity, we use concept balancing with a shared image–concept embedding space and inverse-frequency resampling, then add high-quality bilingual caption augmentation. The result is better data coverage, stronger rare-concept recall, and more stable multimodal transfer.

Figure 5

Balanced mid-training consistently outperforms random sampling

Score improvement = balanced − random

−1.2−0.60.00.81.62.4

DropGain

MathVista_mini

-1.03

CharXiv_DQ

-0.50

MMStar

SeedBench_image

MME-RealWorld_en

SEED-Bench-2-Plus

+0.35

InfoVQA

+0.38

RealWorldQA

+0.39

MathVision

+0.53

ChartQA

+0.64

AI2D_w/oM

+0.64

AI2D_wM

+0.65

MMMU-Pro_vision

+0.83

MME_Per

+1.15

WebSrc_val

+1.20

ScienceQA

+1.25

DocVQA

+1.53

MME-RealWorld_en

+1.59

MathVerse_vision

+1.65

CV-Bench

+1.71

MMMU-Pro_standard

+1.74

OCRBench

+1.80

PixmoCount

+1.89

MMMU_val

+2.11

MMBench_cn

+2.13

MMBench_en

+2.24

Figure 5 Experimental results using 2M balanced and unbalanced mid-training samples (LLaVA-NeXT-780k as the SFT data) show that using a balanced mid-training dataset yields consistent improvements over a random sampling strategy.

Instruction Dataset (22M)

The 22M instruction dataset spans captioning, charts and tables, code and math, grounding, OCR, science, and general VQA. Through aggregation, format standardization, rewriting, bilingual conversion, and safety filtering, we keep category balance and reduce template homogeneity. Adding FineVision further improves downstream performance.

Method

1) Visual Encoder Pretraining

To improve OCR, document understanding, and region-level perception, LLaVA–OneVision–1.5 adopts our in-house MVT v1.5 (RICE–ViT) backbone.

Unlike CLIP/SigLIP-style encoders that mainly emphasize global alignment, RICE–ViT is designed to preserve stronger local semantics through Region Cluster Discrimination:

it is trained on 450M images and 2.4B candidate regions
it explicitly models local entities, text blocks, and surrounding context through region–cluster discrimination with region-aware attention
it uses 2D rotary position encoding (2D RoPE) to support native multi-resolution inputs

Rather than stacking several specialized losses, we use a unified clustering–discrimination objective that improves semantic understanding, OCR, and localization together. We then connect the visual backbone to the language model with a lightweight projector and full-parameter joint training, avoiding redundant adapter layers while keeping the pipeline simple.

Table 2Comparison of RICE-ViT with other vision encoders using the LLaVA-NeXT framework. All models are evaluated using identical configurations: Qwen2.5-7B as the language model, LLaVA-NeXT training data, and the same training pipeline. To ensure fair comparison, we adopt LLaVA-NeXT’s tiling strategy (up to 2 × 2 + 1 tiles) for handling high-resolution images, as many vision encoders do not support native resolution processing.

Model Configuration		OCR & Document Understanding								General Vision Understanding
Method	Vision Tower	InfoVQA	DocVQA	ChartQA	TextVQA	OCRBench	OCRBenchV2	LiveXivVQA	OCR Avg	AI2D	MMBEN	MMECog	MMEPer	POPE	RealworldQA	MMStar	Other Avg
CLIP	ViT-L-14-336px	38.9	75.2	66.5	62.5	52.5	23.0	47.4	52.3	73.2	74.6	48.0	75.6	88.8	63.7	49.0	67.6
MLCD	ViT-L-14-336px	43.5	76.5	67.8	61.7	53.1	24.0	48.4	53.6	77.0	76.4	54.1	79.9	88.7	61.1	51.0	69.7
AIMv2	ViT-L-14-336px	35.4	77.2	72.7	65.9	57.2	23.9	47.3	54.2	75.4	78.6	48.3	75.0	88.4	62.2	50.2	68.3
RICE-ViT	ViT-L-14-336px	45.2	79.2	72.3	65.9	57.5	24.1	48.9	56.2	77.9	76.6	54.6	80.7	88.5	63.1	51.8	70.5

DFN5B	ViT-H-14-378px	38.6	70.9	64.4	59.4	47.3	21.9	46.2	49.8	73.5	73.4	45.8	76.9	88.6	59.9	49.1	66.7
SigLIP	ViT-SO400M-14-384px	41.4	76.7	69.3	64.7	55.4	24.0	48.4	54.3	76.2	77.0	46.1	79.9	88.8	63.7	47.3	68.4
SigLIPv2	ViT-SO400M-14-384px	43.7	79.1	70.2	66.2	58.7	25.4	48.6	56.0	77.0	77.1	46.6	80.4	89.3	63.4	52.8	69.5
RICE-ViT	ViT-L-14-378px	48.1	82.6	75.1	66.2	58.8	25.8	49.5	58.0	76.5	77.6	54.1	79.0	89.1	62.9	51.2	70.1

SigLIPv2	ViT-SO400M-16-560px	50.2	86.2	77.4	70.2	62.7	26.5	52.9	60.9	77.0	76.5	53.5	79.9	89.3	68.2	53.1	71.1
RICE-ViT	ViT-L-14-560px	53.2	87.4	78.1	69.0	60.7	26.1	53.0	61.1	76.9	78.6	56.3	79.3	88.9	65.1	50.5	70.8
Qwen-ViT from Qwen2.5-VL 7B	ViT-H-14-560px	55.9	85.8	78.8	73.7	66.2	26.8	53.4	62.9	78.8	78.4	62.0	80.8	88.6	64.2	55.0	72.5
RICE-ViT from OV-1.5 3B	ViT-L-14-560px	53.7	87.1	81.9	73.8	73.3	30.4	53.6	64.8	80.3	79.6	58.6	82.2	89.0	67.3	56.6	73.4

2) Three‑Stage Learning Pipeline

Stage 1

Language–image alignment

Train the projector on LLaVA–1.5 558K to map visual features into the LLM token space.

LLaVA–1.5 558K

Stage 1.5

Mid-stage pretraining

Use full-parameter training on the concept-balanced 85M corpus to inject broad visual semantics and knowledge.

Concept-balanced 85M

Stage 2

Visual instruction alignment

Continue full-parameter training on the 22M instruction set plus extra visual instruction data such as FineVision.

22M + FineVision

3) Offline Parallel Data Packing

To reduce padding waste and improve token utilization, we use offline parallel packing:

hash–bucket clustering by sample length or length ranges to cut global sorting/scanning costs
multithreaded concatenation of multiple short samples into fixed–length sequences close to the target length during data prep

The pipeline is deterministic, reproducible, and avoids the runtime overhead of online dynamic packing. On the 85M set, it reaches up to ~11× effective padding compression.

Offline Packing

From uneven raw samples to dense fixed-length packed sequences

Instead of packing inside the dataloader with a small rolling buffer, we first compute token lengths for the whole shard, store the packing state offline, and then run staged bin-packing over global statistics so every sequence is filled as close as possible to the target budget before training even begins.

Raw mixed samples

Mixed lengths do not just waste padding tokens. They also create inter-GPU waiting, large per-step power swings, and occasional OOM spikes when batches are padded to the longest sample.

Hash buckets + staged bin packing

shorthash buckets by token length

mediummulti-threaded bin scheduling

longseed packs for long samples

Token lengths

Bin packing

Tar shards

Training

Short samples are grouped into exact or near-exact token-length buckets, long samples become seed boxes, and each stage can switch from diversity-first filling to greedy completion. That is why the remaining slack in each 8k box stays extremely small.

Packed sequences + direct training

92% full

96% full

94% full

98% full

At training time the loader only reads pre-built packed sequences and their boundaries, so there is no online search, no small-buffer under-packing, and much less CPU-GPU jitter.

Why online packing under-fills

the dataloader only sees a small rolling buffer, so it cannot search globally for better complements
small buffers leave many near-fit candidates invisible, which turns into leftover slack in each max-length box
growing the online buffer improves fit quality, but raises host-RAM pressure and preprocessing latency
because packing happens inside training, every step still pays search, merge, and scheduling overhead

Why offline packing gets near-full boxes

the whole shard is indexed by token length first, so exact and near-exact complements are easy to retrieve
long samples are treated as seeds, then short and medium samples are packed around them to close the remaining gap
the algorithm can switch between diversity-first filling and greedy completion when a bucket starts to deadlock
once the best arrangement is found, it is serialized once and reused directly during training with no extra packing cost

4) Hybrid Parallelism and Efficient Long–Context Training

We use hybrid parallelism and long-context optimization—TP + PP + sequence/context parallelism with a distributed optimizer—to improve utilization and memory efficiency at scale. Native-resolution training preserves details in charts, documents, and dense text regions.

Training Efficiency: From Zero to SOTA in ~168 Hours

Mid-Training89.39h

SFT79.05h

Total Time168.44h

Hardware128 GPUs

Open-Source Resources

Models, datasets, training code, and demos — all in the same deep-teal visual system as the rest of the page.

⌘

Code & Demos

Training CodeGitHub

Cook a SOTA model with our released training code and reproduction scripts

Live DemoHF Spaces

Try LLaVA-OneVision-1.5 directly in your browser

◈

Model Checkpoints

LLaVA-OV-1.5-8B-Instruct8B

Instruction-tuned, ready for deploymentLast 30 days downloads · 16,888

LLaVA-OV-1.5-4B-Instruct4B

Compact instruct model for efficient inferenceLast 30 days downloads · 2,529

▒

Training Datasets

LLaVA-OV-1.5-Mid-Training-85M85M samples

Concept-balanced pretraining corpus with bilingual augmentationLast 30 days downloads · 290,624

LLaVA-OV-1.5-Instruct22M samples

Multi-category instruction set with format standardization and safety filteringLast 30 days downloads · 211,065

from transformers import AutoTokenizer, AutoProcessor, AutoModelForCausalLM
from qwen_vl_utils import process_vision_info

model_path = "lmms-lab/LLaVA-One-Vision-1.5-8B-Instruct"

# default: Load the model on the available device(s)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

# default processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

# pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
accelerate launch --num_processes=8 --main_process_port 12399 -m lmms_eval \
 --model=llava_onevision1_5 \
 --model_args=pretrained=lmms-lab/LLaVA-OneVision-1.5-8B-Instruct,attn_implementation=flash_attention_2,max_pixels=3240000 \
 --tasks=mmmu_val,mmmu_pro_standard,mmbench_en_test,mmerealworld,mmerealworld_cn,ai2d,ai2d_no_mask,vstar_bench,chartqa,charxiv,docvqa_test,mathvista_testmini,mmstar,scienceqa \
 --batch_size=1

Citation

@inproceedings{LLaVA-OneVision-1.5,
  title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training},
  author={An, Xiang and Xie, Yin and Yang, Kaicheng and Zhang, Wenkang and Zhao, Xiuwei and Cheng, Zheng and Wang, Yirui and Xu, Songcen and Chen, Changrui and Wu, Chunsheng and Tan, Huajie and Li, Chunyuan and Yang, Jing and Yu, Jie and Wang, Xiyao and Qin, Bin and Wang, Yumeng and Yan, Zizhen and Feng, Ziyong and Liu, Ziwei and Li, Bo and Deng, Jiankang},
  booktitle={arxiv},
  year={2025}
}

@article{lillava,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Zhang, Peiyuan and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={Transactions on Machine Learning Research}
  year={2024}
}