Open Multimodal Training

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Updated Apr 2026/Apr 20, 2026/models
LLaVA-OneVision Contributors

The next generation of fully-open multimodal training — pushing the boundary of recipe transparency, native-resolution understanding, and end-to-end reproducibility.

Qualitative highlight可视化亮点

Codec evidence keeps motion dense where uniform frames go sparse.Codec 证据在动作密集处保留更多视觉信息,而均匀抽帧容易变稀疏。

The same jump-rope clip is rendered side-by-side on a shared source-video timeline: uniform sampling sees only 128 evenly spaced frames, while codec-selected patches follow the retained temporal evidence.同一段跳绳视频在共享原视频时间轴上并排渲染:均匀采样只看到 128 个等距帧,而 codec-selected patches 会跟随被保留下来的时序证据。

Qualitative example定性示例

Same timeline, different temporal evidence同一时间轴,不同的视频证据密度

Pred event (red flash)预测事件(红色闪烁)GT event (green box)GT 事件(绿色框)
GT events stay green; predictions light up at their video time.GT 事件保持绿色;预测在对应时间点亮起。

Highlights核心要点

LLaVA-OneVision-2 is a fully-open recipe for training competitive 8B-class vision-language models — every stage, every dataset, every weight is reproducible. Below: what makes it different at a glance.LLaVA-OneVision-2 是一套完全开放的 8B 级视觉语言模型训练配方——每个阶段、每个数据集、每份权重都可复现。下方为其核心特性概览。

01

Long Video Understanding长视频理解

Extends video comprehension from 30-second clips to 15-minute footage through a four-stage progressive training pipeline with length-stratified captions.通过四阶段渐进式训练流程与按时长分层的字幕数据,将视频理解能力从 30 秒短片扩展至 15 分钟长视频。
02

Codec-based InputCodec 类型输入

Adopts codec-based dense video input that preserves the native temporal signal, enabling fine-grained temporal understanding without information loss.采用基于 codec 的密集视频输入,保留视频原生时序信号,实现细粒度时序理解且不丢失信息。
03

Fully Open Pipeline全流程开源

Code, training data, evaluation pipelines, and checkpoints — every artifact across all four stages is released with no gated resources.代码、训练数据、评估流程与模型权重——四个阶段的全部产物完全开源,无任何受限资源。

Roadmap路线图

The OV2 roadmap traces the evolution from early frame and clip sampling to heuristic token compression, learned token selection, and the 2026 codec-aligned paradigm.

LLaVA-OneVision-2 roadmap
Figure 2. Roadmap of video understanding from token compression to codec-aligned perceptual intelligence.

How It Works方法图解

Two design choices behind LLaVA-OneVision-2's long-video and unified-modality capability, illustrated.

LLaVA-OneVision-2 长视频与多模态统一能力背后的两个核心设计,图示如下。

Figure 3. Codec-style patch selection. Same 54-token budget as uniform sampling, but spans 3× the temporal range by keeping I-frames dense and skimming only motion-rich patches from P-frames.图 3. Codec 风格的 patch 选择。与均匀采样使用同样的 54 token 预算,但通过保留 I 帧密集采样、仅从 P 帧抽取运动相关 patch,可覆盖 3 倍的时间范围。
Figure 4. One encoder, three input modalities. Image, uniform-frame video, and codec-aligned video all flow through the same OneVision-Encoder under shared (t, h, w) positions.图 4. 单一编码器统一处理三种模态输入。图像、均匀帧视频与 codec 对齐视频均通过同一 OneVision-Encoder,并共享 (t, h, w) 位置编码。

Benchmarks基准测试

Table 1a. Video BenchmarksTable 1a. Video BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.
Benchmark
LLaVA-OneVision-2
8B
Qwen3-VL
8B
Keye-VL-1.5
8B
InternVL-3.5
8B
PLM
8B
LLaVA-OV-1.5
8B
71.971.473.065.960.561.1
76.375.676.268.665.665.5
19.918.214.114.68.79.1
55.558.042.846.744.540.1
61.559.254.950.147.244.8
66.269.056.972.177.151.2
82.583.475.882.084.173.7
74.574.375.570.472.757.5
76.678.175.071.066.462.1
66.968.066.062.459.656.2
56.258.768.360.243.350.1
39.540.635.336.126.230.7
53.548.345.427.834.515.6
53.846.841.331.37.617.7
66.459.455.531.34.221.0
74.930.139.611.013.12.1
70.959.136.456.027.930.2
57.648.932.447.930.733.5
Average62.558.253.650.343.040.1
Table 1b. Spatial BenchmarksTable 1b. Spatial BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.
Benchmark
LLaVA-OneVision-2
8B
Qwen3-VL
8B
Keye-VL-1.5
8B
InternVL-3.5
8B
PLM
8B
LLaVA-OV-1.5
8B
77.377.775.275.077.074.8
69.168.759.265.745.467.1
43.342.338.341.844.341.5
82.681.078.277.980.676.5
92.892.382.086.382.482.9
61.926.920.220.215.715.9
78.177.566.373.273.564.2
69.369.362.754.736.761.3
29.631.026.728.131.428.3
63.565.152.255.756.048.3
31.08.03.04.01.01.0
Average63.558.251.353.049.551.1
Table 1c. Image BenchmarksTable 1c. Image BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.
Benchmark
LLaVA-OneVision-2
8B
Qwen3-VL
8B
Keye-VL-1.5
8B
InternVL-3.5
8B
PLM
8B
LLaVA-OV-1.5
8B
64.862.973.666.657.967.9
85.784.988.587.980.285.6
95.295.794.992.394.697.8
85.985.184.786.785.586.5
74.483.476.979.180.079.1
78.284.784.884.083.282.6
84.383.686.084.092.784.0
85.985.378.081.771.277.5
89.089.883.175.691.887.8
64.062.455.661.868.063.1
69.769.469.863.172.768.1
Average79.780.779.678.479.880.0
Table 1d. Tracking BenchmarksTable 1d. Tracking BenchmarksResultsReferring video object segmentation and reasoning.Referring video object segmentation and reasoning.
Benchmark
LLaVA-OneVision-2
8B
Qwen3-VL
8B
Keye-VL-1.5
8B
InternVL-3.5
8B
PLM
8B
LLaVA-OV-1.5
8B
52.739.714.612.87.811.9
58.741.35.84.72.04.1
37.129.910.17.25.07.3
45.728.47.27.57.66.1
60.840.722.122.26.816.8
58.237.810.710.28.513.0
27.424.79.97.90.16.2
29.221.99.69.210.29.7
Average46.233.111.310.26.09.4

Codec vs Frame Sampling编解码采样 vs 均匀帧采样

At equal token budgets, codec-stream input consistently wins under tight frame budgets — exactly the regime where uniform sampling fails the model.

Figure · Codec vs FrameCodec sampling unlocks low-frame regimes.
QVHighlights: codec vs uniform samplingQVHighlights@ 4 frames: +15.4 · @ 64 frames: +1.81020304050607048163264frame budget (log scale)metric27.712.363.561.7
Charades-STA: codec vs uniform samplingCharades-STA@ 4 frames: +25.0 · @ 64 frames: -3.4152535455548163264frame budget (log scale)metric42.417.450.153.5
ActivityNet: codec vs uniform samplingActivityNet@ 4 frames: +11.1 · @ 64 frames: +2.310203040506048163264frame budget (log scale)metric23.112.051.248.9
LVBench: codec vs uniform samplingLVBench@ 16 frames: +2.0 · @ 128 frames: +1.83640444852163264128frame budget (log scale)metric40.938.949.547.7
VideoMME-long (w/ sub): codec vs uniform samplingVideoMME-long (w/ sub)@ 8 frames: +1.5 · @ 128 frames: -0.15054586266708163264128frame budget (log scale)metric56.655.167.067.1
VideoEval-Pro: codec vs uniform samplingVideoEval-Pro@ 8 frames: +3.3 · @ 128 frames: +1.842465054588163264128frame budget (log scale)metric46.443.155.854.0
JumpScore: codec vs uniform samplingJumpScore@ 4 frames: +6.9 · @ 128 frames: +29.530405060708048163264128frame budget (log scale)metric39.432.574.945.4
Figure. Codec-stream input vs uniform frame sampling across seven video and temporal grounding benchmarks. At equal token budgets, codec sampling wins under tight frame budgets — the largest gains appear at the lowest frame counts.

Video Caption Dataset视频描述数据集

A length-stratified video caption corpus spanning 30 seconds to 15 minutes, totaling roughly 8M captioned clips, 95.1B image tokens, and 9.9B caption tokens.

BucketSamplesStorageImage TokensCaption Tokens
30s caption4.2M29 TB24.7B3.0B
30–60s video caption2.7M32 TB31.8B2.3B
60–180s video caption700K13 TB12.3B0.7B
10–15min caption350K65 TB26.3B4.0B
Total~8M~139 TB95.1B9.9B

Image tokens are computed at 392×392 input, ViT patch size 14, and vision merge size 2×2 for 196 visual tokens per frame. Caption tokens are measured with the Qwen3 tokenizer over 1,500 sampled clips per bucket, then scaled by row count.

Training Pipeline训练流程

The full LLaVA-OneVision-2 recipe runs in four stages — each stage upgrades a different capability of the model. No instruction data is synthesized; the only synthesized data are video captions.

S1

Stage 1 — Bootstrap from LLaVA-OneVision-1.5 + 30s Video Caption

Lift the image-pretrained LLaVA-OneVision-1.5 8B into a video-aware model by mixing in short 30-second clip captions.

  • aLLaVA-OneVision-1.5-Mid-Training-85M85M concept-balanced image-text pairs (20M ZH + 65M EN).
  • b30s-Video-Caption-4.2M4.2M clips, 30 frames @ 392×392.NEW
S2

Stage 2 — Instruction Tuning + 30–60s Video Caption

Scale up to large-scale multimodal instruction data and extend video understanding to medium-length 30–60s clips.

  • aLLaVA-OneVision-1.5-Instruct-Data22M multimodal instruction samples.
  • bHuggingFaceM4/FineVision24M instruction samples.
  • c30s-60s-Video-Caption-2.7Mmedium-length clips, 60 frames @ 392×392.NEW
  • d60s-180s-Video-Caption-700Kminute-scale clips, 90 frames @ 392×392.NEW
S3

Stage 3 — Long Video Understanding

Push the model to long-form video reasoning by combining 10–15 minute captions with established video instruction corpora.

  • aLLaVA-OneVision-1.5-Instruct-Data22M multimodal instruction samples.
  • bHuggingFaceM4/FineVision24M instruction samples.
  • clmms-lab/LLaVA-Video-178K1.6M video instruction samples (captions, open-ended and MC QA).
  • dOpenGVLab/VideoChat-Flash-Training-Datalong-context video instruction data.
  • e10min-15min-Video-Caption-350Klong videos, 384 frames @ 392×392.NEW
S4

Stage 4 — Longer Video + Improved Codec + Spatial & Tracking

Extend to longer videos with an improved codec and denser frame sampling up to 768f, then inject spatial reasoning and video tracking supervision.

  • aLLaVA-OneVision-1.5-Instruct-Data22M multimodal instruction samples.
  • bHuggingFaceM4/FineVision24M instruction samples.
  • callenai/Molmo2-VideoTrack + allenai/Molmo2-VideoPointpoint-based video tracking and spatio-temporal pointing.
  • d10min-15min-Video-Caption-350K (re-encoded)long videos with the new codec, 384 frames @ 392×392.NEW
  • e10min-15min-Video-Caption-350K @ 768fthe same corpus densified to 768 frames @ 392×392.NEW
  • fLLaVA-OneVision-2-Spatial-4M4M in-house spatial understanding samples.NEW

Visual Encoder Pretraining (OneVision-Encoder)视觉编码器预训练(OneVision-Encoder)

OneVision-Encoder extends native-resolution training to longer aspect ratios and pushes context capacity for high-density documents and frame-rich video.

OneVision-Encoder architecture overview
Figure 6. OneVision-Encoder architecture overview.

Open-Source Resources开源资源

The OV2 site ships a small but complete release stack: training code, a public demo surface, the 8B instruct checkpoint, and the full training dataset collection.

Code Demos代码示例

Run LLaVA-OneVision-2-8B-Instruct from a HuggingFace transformers checkpoint (trust_remote_code=True). Two video backends are available: uniform frame sampling, and a codec-aware canvas-packing backend recommended for long videos.以 HuggingFace transformers 权重运行 LLaVA-OneVision-2-8B-Instruct(需 trust_remote_code=True)。提供两种视频后端:均匀抽帧,以及面向长视频推荐的 codec 画布打包后端。

inference.py
python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

MODEL_ID = "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, trust_remote_code=True, dtype=torch.bfloat16, device_map="cuda",
).eval()

# ----- Image -----
image = Image.open("cat.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

# ----- Video -----
# Lower max_pixels if you hit OOM on long videos.
processor.video_processor.max_pixels = 200704

messages = [{"role": "user", "content": [
    {"type": "video"},
    {"type": "text", "text": "Describe what happens in this video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text], videos=["clip.mp4"], return_tensors="pt", padding=True,
    num_frames=16,  # exact frame count; or use target_fps / max_frames
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Task Demos任务演示

Qualitative results across four downstream capabilities: temporal grounding, referring video segmentation and tracking, spatial grounding, and real-world video manipulation.

Citation引用

citation.bib
bibtex
@article{llava_onevision_2_2026,
  title   = {LLaVA-OneVision-2: Open Multimodal Training at Scale},
  author  = {Xiang An and Yin Xie and Kaicheng Yang and Wenkang Zhang and Xiuwei Zhao and Zheng Cheng and Yirui Wang and Songcen Xu and Changrui Chen and Didi Zhu and Chunsheng Wu and Huajie Tan and Chunyuan Li and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
  journal = {arXiv preprint arXiv:TBD},
  year    = {2026}
}

References参考文献

  1. LLaVA-OneVision: Easy Visual Task TransferBo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li
    TMLR · 2024arXiv:2408.03326
  2. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal TrainingXiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng
    arXiv · 2025arXiv:2509.23661
  3. OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal IntelligenceFeilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng
    arXiv · 2026arXiv:2602.08683
  4. Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee
    NeurIPS · 2023arXiv:2304.08485
  5. Qwen3-VL Technical ReportQwen Team
    Tech Report · 2025github.com/QwenLM/Qwen3-VL
  6. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and EfficiencyWeiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, et al.
    Tech Report · 2025arXiv:2508.18265
  7. PerceptionLM: Open-Access Data and Models for Detailed Visual UnderstandingJang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, and Christoph Feichtenhofer
    arXiv · 2025arXiv:2504.13180
  8. Kwai Keye-VL 1.5 Technical ReportBiao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, et al.
    arXiv · 2025arXiv:2509.01563