LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Qualitative highlight可视化亮点

Codec evidence keeps motion dense where uniform frames go sparse.Codec 证据在动作密集处保留更多视觉信息，而均匀抽帧容易变稀疏。

The same jump-rope clip is rendered side-by-side on a shared source-video timeline: uniform sampling sees only 128 evenly spaced frames, while codec-selected patches follow the retained temporal evidence.同一段跳绳视频在共享原视频时间轴上并排渲染：均匀采样只看到 128 个等距帧，而 codec-selected patches 会跟随被保留下来的时序证据。

Qualitative example定性示例

Same timeline, different temporal evidence同一时间轴，不同的视频证据密度

Pred event (red flash)预测事件（红色闪烁）GT event (green box)GT 事件（绿色框）

GT events stay green; predictions light up at their video time.GT 事件保持绿色；预测在对应时间点亮起。

Highlights核心要点

LLaVA-OneVision-2 is a fully-open recipe for training competitive 8B-class vision-language models — every stage, every dataset, every weight is reproducible. Below: what makes it different at a glance.LLaVA-OneVision-2 是一套完全开放的 8B 级视觉语言模型训练配方——每个阶段、每个数据集、每份权重都可复现。下方为其核心特性概览。

Long Video Understanding长视频理解

Extends video comprehension from 30-second clips to 15-minute footage through a four-stage progressive training pipeline with length-stratified captions.通过四阶段渐进式训练流程与按时长分层的字幕数据，将视频理解能力从 30 秒短片扩展至 15 分钟长视频。

Codec-based InputCodec 类型输入

Adopts codec-based dense video input that preserves the native temporal signal, enabling fine-grained temporal understanding without information loss.采用基于 codec 的密集视频输入，保留视频原生时序信号，实现细粒度时序理解且不丢失信息。

Fully Open Pipeline全流程开源

Code, training data, evaluation pipelines, and checkpoints — every artifact across all four stages is released with no gated resources.代码、训练数据、评估流程与模型权重——四个阶段的全部产物完全开源，无任何受限资源。

Roadmap路线图

The OV2 roadmap traces the evolution from early frame and clip sampling to heuristic token compression, learned token selection, and the 2026 codec-aligned paradigm.

LLaVA-OneVision-2 roadmap — **Figure 2.** Roadmap of video understanding from token compression to codec-aligned perceptual intelligence.

How It Works方法图解

Two design choices behind LLaVA-OneVision-2's long-video and unified-modality capability, illustrated.

LLaVA-OneVision-2 长视频与多模态统一能力背后的两个核心设计，图示如下。

Figure 3. Codec-style patch selection. Same 54-token budget as uniform sampling, but spans 3× the temporal range by keeping I-frames dense and skimming only motion-rich patches from P-frames.图 3. Codec 风格的 patch 选择。与均匀采样使用同样的 54 token 预算，但通过保留 I 帧密集采样、仅从 P 帧抽取运动相关 patch，可覆盖 3 倍的时间范围。

Figure 4. One encoder, three input modalities. Image, uniform-frame video, and codec-aligned video all flow through the same OneVision-Encoder under shared (t, h, w) positions.图 4. 单一编码器统一处理三种模态输入。图像、均匀帧视频与 codec 对齐视频均通过同一 OneVision-Encoder，并共享 (t, h, w) 位置编码。

Benchmarks基准测试

Table 1a. Video BenchmarksTable 1a. Video BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.

Benchmark	LLaVA-OneVision-2 8B	Qwen3-VL 8B	Keye-VL-1.5 8B	InternVL-3.5 8B	PLM 8B	LLaVA-OV-1.5 8B
	71.9	71.4	73.0	65.9	60.5	61.1
	76.3	75.6	76.2	68.6	65.6	65.5
	19.9	18.2	14.1	14.6	8.7	9.1
	55.5	58.0	42.8	46.7	44.5	40.1
	61.5	59.2	54.9	50.1	47.2	44.8
	66.2	69.0	56.9	72.1	77.1	51.2
	82.5	83.4	75.8	82.0	84.1	73.7
	74.5	74.3	75.5	70.4	72.7	57.5
	76.6	78.1	75.0	71.0	66.4	62.1
	66.9	68.0	66.0	62.4	59.6	56.2
	56.2	58.7	68.3	60.2	43.3	50.1
	39.5	40.6	35.3	36.1	26.2	30.7
	53.5	48.3	45.4	27.8	34.5	15.6
	53.8	46.8	41.3	31.3	7.6	17.7
	66.4	59.4	55.5	31.3	4.2	21.0
	74.9	30.1	39.6	11.0	13.1	2.1
	70.9	59.1	36.4	56.0	27.9	30.2
	57.6	48.9	32.4	47.9	30.7	33.5
Average	62.5	58.2	53.6	50.3	43.0	40.1

Table 1b. Spatial BenchmarksTable 1b. Spatial BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.

Benchmark	LLaVA-OneVision-2 8B	Qwen3-VL 8B	Keye-VL-1.5 8B	InternVL-3.5 8B	PLM 8B	LLaVA-OV-1.5 8B
	77.3	77.7	75.2	75.0	77.0	74.8
	69.1	68.7	59.2	65.7	45.4	67.1
	43.3	42.3	38.3	41.8	44.3	41.5
	82.6	81.0	78.2	77.9	80.6	76.5
	92.8	92.3	82.0	86.3	82.4	82.9
	61.9	26.9	20.2	20.2	15.7	15.9
	78.1	77.5	66.3	73.2	73.5	64.2
	69.3	69.3	62.7	54.7	36.7	61.3
	29.6	31.0	26.7	28.1	31.4	28.3
	63.5	65.1	52.2	55.7	56.0	48.3
	31.0	8.0	3.0	4.0	1.0	1.0
Average	63.5	58.2	51.3	53.0	49.5	51.1

Table 1c. Image BenchmarksTable 1c. Image BenchmarksResultsUpdated with current evaluation results.Updated with current evaluation results.

Benchmark	LLaVA-OneVision-2 8B	Qwen3-VL 8B	Keye-VL-1.5 8B	InternVL-3.5 8B	PLM 8B	LLaVA-OV-1.5 8B
	64.8	62.9	73.6	66.6	57.9	67.9
	85.7	84.9	88.5	87.9	80.2	85.6
	95.2	95.7	94.9	92.3	94.6	97.8
	85.9	85.1	84.7	86.7	85.5	86.5
	74.4	83.4	76.9	79.1	80.0	79.1
	78.2	84.7	84.8	84.0	83.2	82.6
	84.3	83.6	86.0	84.0	92.7	84.0
	85.9	85.3	78.0	81.7	71.2	77.5
	89.0	89.8	83.1	75.6	91.8	87.8
	64.0	62.4	55.6	61.8	68.0	63.1
	69.7	69.4	69.8	63.1	72.7	68.1
Average	79.7	80.7	79.6	78.4	79.8	80.0

Table 1d. Tracking BenchmarksTable 1d. Tracking BenchmarksResultsReferring video object segmentation and reasoning.Referring video object segmentation and reasoning.

Benchmark	LLaVA-OneVision-2 8B	Qwen3-VL 8B	Keye-VL-1.5 8B	InternVL-3.5 8B	PLM 8B	LLaVA-OV-1.5 8B
	52.7	39.7	14.6	12.8	7.8	11.9
	58.7	41.3	5.8	4.7	2.0	4.1
	37.1	29.9	10.1	7.2	5.0	7.3
	45.7	28.4	7.2	7.5	7.6	6.1
	60.8	40.7	22.1	22.2	6.8	16.8
	58.2	37.8	10.7	10.2	8.5	13.0
	27.4	24.7	9.9	7.9	0.1	6.2
	29.2	21.9	9.6	9.2	10.2	9.7
Average	46.2	33.1	11.3	10.2	6.0	9.4

Codec vs Frame Sampling编解码采样 vs 均匀帧采样

At equal token budgets, codec-stream input consistently wins under tight frame budgets — exactly the regime where uniform sampling fails the model.

Figure · Codec vs FrameCodec sampling unlocks low-frame regimes.

Figure. Codec-stream input vs uniform frame sampling across seven video and temporal grounding benchmarks. At equal token budgets, codec sampling wins under tight frame budgets — the largest gains appear at the lowest frame counts.

Video Caption Dataset视频描述数据集

A length-stratified video caption corpus spanning 30 seconds to 15 minutes, totaling roughly 8M captioned clips, 95.1B image tokens, and 9.9B caption tokens.

Bucket	Samples	Storage	Image Tokens	Caption Tokens
30s caption	4.2M	29 TB	24.7B	3.0B
30–60s video caption	2.7M	32 TB	31.8B	2.3B
60–180s video caption	700K	13 TB	12.3B	0.7B
10–15min caption	350K	65 TB	26.3B	4.0B
Total	~8M	~139 TB	95.1B	9.9B

Image tokens are computed at 392×392 input, ViT patch size 14, and vision merge size 2×2 for 196 visual tokens per frame. Caption tokens are measured with the Qwen3 tokenizer over 1,500 sampled clips per bucket, then scaled by row count.

Training Pipeline训练流程

The full LLaVA-OneVision-2 recipe runs in four stages — each stage upgrades a different capability of the model. No instruction data is synthesized; the only synthesized data are video captions.

Stage 1 — Bootstrap from LLaVA-OneVision-1.5 + 30s Video Caption

Lift the image-pretrained LLaVA-OneVision-1.5 8B into a video-aware model by mixing in short 30-second clip captions.

aLLaVA-OneVision-1.5-Mid-Training-85M — 85M concept-balanced image-text pairs (20M ZH + 65M EN).
b30s-Video-Caption-4.2M — 4.2M clips, 30 frames @ 392×392.NEW

Stage 2 — Instruction Tuning + 30–60s Video Caption

Scale up to large-scale multimodal instruction data and extend video understanding to medium-length 30–60s clips.

aLLaVA-OneVision-1.5-Instruct-Data — 22M multimodal instruction samples.
bHuggingFaceM4/FineVision — 24M instruction samples.
c30s-60s-Video-Caption-2.7M — medium-length clips, 60 frames @ 392×392.NEW
d60s-180s-Video-Caption-700K — minute-scale clips, 90 frames @ 392×392.NEW

Stage 3 — Long Video Understanding

Push the model to long-form video reasoning by combining 10–15 minute captions with established video instruction corpora.

aLLaVA-OneVision-1.5-Instruct-Data — 22M multimodal instruction samples.
bHuggingFaceM4/FineVision — 24M instruction samples.
clmms-lab/LLaVA-Video-178K — 1.6M video instruction samples (captions, open-ended and MC QA).
dOpenGVLab/VideoChat-Flash-Training-Data — long-context video instruction data.
e10min-15min-Video-Caption-350K — long videos, 384 frames @ 392×392.NEW

Stage 4 — Longer Video + Improved Codec + Spatial & Tracking

Extend to longer videos with an improved codec and denser frame sampling up to 768f, then inject spatial reasoning and video tracking supervision.

aLLaVA-OneVision-1.5-Instruct-Data — 22M multimodal instruction samples.
bHuggingFaceM4/FineVision — 24M instruction samples.
callenai/Molmo2-VideoTrack + allenai/Molmo2-VideoPoint — point-based video tracking and spatio-temporal pointing.
d10min-15min-Video-Caption-350K (re-encoded) — long videos with the new codec, 384 frames @ 392×392.NEW
e10min-15min-Video-Caption-350K @ 768f — the same corpus densified to 768 frames @ 392×392.NEW
fLLaVA-OneVision-2-Spatial-4M — 4M in-house spatial understanding samples.NEW

Visual Encoder Pretraining (OneVision-Encoder)视觉编码器预训练（OneVision-Encoder）

OneVision-Encoder extends native-resolution training to longer aspect ratios and pushes context capacity for high-density documents and frame-rich video.

Figure 6. OneVision-Encoder architecture overview.

Open-Source Resources开源资源

The OV2 site ships a small but complete release stack: training code, a public demo surface, the 8B instruct checkpoint, and the full training dataset collection.

Code & Demos

LLaVA-OneVision-2 (GitHub)Code

Training code, configs, and evaluation harness.

github.com

Online DemoSpace

HuggingFace Space for interactive demo.

huggingface.co

Model Checkpoints

LLaVA-OneVision-2-8B-InstructHF

Pretrained checkpoints on HuggingFace.

huggingface.co

Training Datasets

LLaVA-OneVision-2-DataDataset

Pretraining and instruction data.

huggingface.co

Code Demos代码示例

Run LLaVA-OneVision-2-8B-Instruct from a HuggingFace transformers checkpoint (trust_remote_code=True). Two video backends are available: uniform frame sampling, and a codec-aware canvas-packing backend recommended for long videos.以 HuggingFace transformers 权重运行 LLaVA-OneVision-2-8B-Instruct（需 trust_remote_code=True）。提供两种视频后端：均匀抽帧，以及面向长视频推荐的 codec 画布打包后端。

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

MODEL_ID = "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, trust_remote_code=True, dtype=torch.bfloat16, device_map="cuda",
).eval()

# ----- Image -----
image = Image.open("cat.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

# ----- Video -----
# Lower max_pixels if you hit OOM on long videos.
processor.video_processor.max_pixels = 200704

messages = [{"role": "user", "content": [
    {"type": "video"},
    {"type": "text", "text": "Describe what happens in this video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text], videos=["clip.mp4"], return_tensors="pt", padding=True,
    num_frames=16,  # exact frame count; or use target_fps / max_frames
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

# Make sure: `pip install codec-video-prep opencv-python` and ffmpeg on PATH.
messages = [{"role": "user", "content": [
    {"type": "video"},
    {"type": "text", "text": "Describe what happens in this long video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    videos=["long_clip.mp4"],
    video_backend="codec",
    max_pixels=150000,          # per-canvas pixel budget; lower if OOM
    return_tensors="pt",
    padding=True,
    # Optional: override codec defaults from preprocessor_config.json
    # codec_config={"target_canvas": 32, "group_size": 32, "images_per_group": 4},
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Task Demos任务演示

Qualitative results across four downstream capabilities: temporal grounding, referring video segmentation and tracking, spatial grounding, and real-world video manipulation.

Temporal GroundingTimeLens-Bench · mean IoU ≥ 0.95 over 5 runs

He jumps a ramp and dives into a pile of leaves.

IInput video

Source clip

AAnswer

Predicted interval 38–41 s · ActivityNet Captions · IoU 1.00

A boy is wearing boxing gloves practicing boxing.

IInput video

Source clip

AAnswer

Predicted interval 62–87 s · ActivityNet Captions · IoU 0.98

Two men are facing the camera and talking.

IInput video

Source clip

AAnswer

Predicted interval 81–133 s · QVHighlights · IoU 1.00

A man takes a bag from the bottom cabinet.

IInput video

Source clip

AAnswer

Predicted interval 11–15 s · Charades-STA · IoU 1.00

A boy puts his hand on top of his head in the bathroom and takes a selfie.

IInput video

Source clip

AAnswer

Predicted interval 15–18 s · Charades-STA · IoU 1.00

A person puts on a red plaid shirt.

IInput video

Source clip

AAnswer

Predicted interval 23–32 s · Charades-STA · IoU 1.00

A man wearing white clothes is practicing Tai Chi by the sea.

IInput video

Source clip

AAnswer

Predicted interval 189–208 s · ActivityNet Captions · IoU 0.98

A person washes and drains a mop in a bucket.

IInput video

Source clip

AAnswer

Predicted interval 22–34 s · ActivityNet Captions · IoU 0.98

Video TrackingReferring video object segmentation (R-VOS)

Track the animal moving forward.

IInput video

Referring prompt rollout

AAnswer

Per-frame predicted mask

Track the person whose appearance deviates the most from the norm.

IInput video

Referring prompt rollout

AAnswer

Per-frame predicted mask

Track a sport car.

IInput video

Referring prompt rollout

AAnswer

Per-frame predicted mask

Track a blue and white colored surfboard in the right hand of dark blue swim suit.

IInput video

Referring prompt rollout

AAnswer

Per-frame predicted mask

Video ManipulationReal-world robot manipulation · online re-querying

Put the apple on the green plate placed on the table.

RExecution rollout

Robot rollout

AAnswer

9 predicted (x, y, z) waypoints at t = 0 s

Put the bread into the oven.

RExecution rollout

Robot rollout

AAnswer

5 predicted (x, y, z) waypoints at t = 0 s

Spatial GroundingCompositional spatial language on a single image

2D Pointing

Please point to the top piece of paper on the white table.

AAnswer

2D pixel-coordinate point

3D Trajectory

Pick up the brown small bottle on the table, and move it to the left of the white mouse.

AAnswer

3D pick-and-place trajectory

2D Pointing

Please point out the white object that is the second closest to the wooden shelf.

AAnswer

2D pixel-coordinate point

3D Trajectory

Pick up the gray toy on the left, and move it so spacing matches the other toys.

AAnswer

3D pick-and-place trajectory

2D Pointing

Please point to the left pillow on the sofa.

AAnswer

2D pixel-coordinate point

3D Trajectory

Pick up the red object on the rightmost table, and move it onto the center cabinet.

AAnswer

3D pick-and-place trajectory

2D Pointing

Please point out the free space between the cat tree and litter box.

AAnswer

2D pixel-coordinate point

3D Trajectory

Pick up the calculator on the right table, and move it to the left of the phone on the left table.

AAnswer

3D pick-and-place trajectory

2D Pointing

Please point out the free space on the table between the speaker to the right of the monitor and the mouse.

AAnswer

2D pixel-coordinate point

2D Pointing

Please point out the object on the windowsill farthest from the viewer.

AAnswer

2D pixel-coordinate point

2D Pointing

Please point out the free space between the black water bottle, the pot lid, and the scissors.

AAnswer

2D pixel-coordinate point

2D Pointing

Please point out the free space between the black water bottle and the pot lid.

AAnswer

2D pixel-coordinate point

Citation引用

@article{llava_onevision_2_2026,
  title   = {LLaVA-OneVision-2: Open Multimodal Training at Scale},
  author  = {Xiang An and Yin Xie and Kaicheng Yang and Wenkang Zhang and Xiuwei Zhao and Zheng Cheng and Yirui Wang and Songcen Xu and Changrui Chen and Didi Zhu and Chunsheng Wu and Huajie Tan and Chunyuan Li and Jing Yang and Jie Yu and Xiyao Wang and Bin Qin and Yumeng Wang and Zizhen Yan and Ziyong Feng and Ziwei Liu and Bo Li and Jiankang Deng},
  journal = {arXiv preprint arXiv:TBD},
  year    = {2026}
}

References参考文献

LLaVA-OneVision: Easy Visual Task TransferBo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li
TMLR · 2024arXiv:2408.03326
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal TrainingXiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng
arXiv · 2025arXiv:2509.23661
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal IntelligenceFeilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, and Jiankang Deng
arXiv · 2026arXiv:2602.08683
Visual Instruction TuningHaotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee
NeurIPS · 2023arXiv:2304.08485
Qwen3-VL Technical ReportQwen Team
Tech Report · 2025github.com/QwenLM/Qwen3-VL
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and EfficiencyWeiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, et al.
Tech Report · 2025arXiv:2508.18265
PerceptionLM: Open-Access Data and Models for Detailed Visual UnderstandingJang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Krähenbühl, Piotr Dollár, Lorenzo Torresani, Kristen Grauman, and Christoph Feichtenhofer
arXiv · 2025arXiv:2504.13180
Kwai Keye-VL 1.5 Technical ReportBiao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, et al.
arXiv · 2025arXiv:2509.01563