OneVision Encoder

The first HEVC style Vision Transformer with advanced multimodal capabilities

LLaVA-OneVision Community Contributors

Introduction

Video understanding models face a fundamental trade-off: processing more frames captures richer temporal information but increases computation quadratically. Traditional approaches address this through sparse frame sampling, but this discards fine-grained motion dynamics and treats all spatial regions equally—wasting computation on static backgrounds.

We present OneVision Encoder, a vision transformer that resolves this trade-off using principles from HEVC video compression. Instead of sampling sparse frames densely (all patches from few frames), we sample dense frames sparsely (important patches from many frames). Our codec-style patch selection identifies temporally-salient regions—areas with motion, object interactions, or semantic changes—and processes only these informative patches.

Combined with global contrastive learning using a 2M concept bank, OneVision Encoder achieves state-of-the-art results on video benchmarks (MVBench, VideoMME, Perception Test) and image understanding tasks (DocVQA, ChartQA, OCRBench).

OneVision Encoder Method Overview
-

Codec-Style Patch Selection

Traditional video understanding models process frames by uniform temporal sampling—selecting evenly-spaced frames regardless of content. This approach treats all spatial regions equally, wasting computation on redundant background pixels that remain static across frames.

Inspired by HEVC video compression, our codec-style approach identifies and processes only the patches that carry meaningful temporal changes. Just as video codecs encode motion vectors and residuals rather than full frames, we select patches based on their information density—preserving the dynamic, semantically-rich regions while discarding redundant static content.

Codec-Style Input

Left: Reference frame (t=1) with all patches. Right: Three animated blocks showing consecutive frames (t=2,3,4 → t=5,6,7 → ...), cycling through t=2 to t=64. Each frame shows only salient patches at their spatial positions. The result: 75-98% fewer patches while retaining the information that matters.

Traditional Frame Sampling

Uniformly samples 4 frames and processes all patches from each. Notice the redundancy: static backgrounds, repeated textures, and unchanging regions are processed multiple times across frames—wasting computation on pixels that add no new information.

Video Processing Pipeline

The visualization below demonstrates our complete video processing pipeline. The animation shows four key stages: (1) Original Video - a continuous 64-frame stream capturing the full temporal context, (2) Uniform Frame Sampling - traditional approach selecting 4-8 evenly-spaced frames, which is simple but lossy and misses inter-frame motion, (3) Temporal Saliency Detection - analysis of all 64 frames to identify regions with high temporal information such as motion, appearance changes, and semantic events, and (4) Codec-Style Patch Extraction - extraction of only the salient patches in zigzag order, achieving 75-98% compression while preserving temporal dynamics.

Original Video
Uniform Frame Sampling
Temporal Saliency Detection
Codec-Style Patch Extraction

Complete video processing pipeline showing the four stages from original video to codec-style compressed representation. Each stage demonstrates how our approach progressively identifies and extracts temporally-salient patches while maintaining rich motion information.

Global Contrastive Learning

Standard contrastive learning (e.g., CLIP) is limited by batch size—negative samples are drawn only from the current batch, typically 32K-64K examples. This creates a narrow view of the embedding space and leads to suboptimal representations. Our approach maintains a global concept bank of 2M clustered centers, enabling each training sample to contrast against a diverse, representative set of negatives regardless of batch composition. This produces more discriminative embeddings with better-separated semantic clusters.

LMM Probe Results

Training on a mixed dataset of 740K samples from LLaVA-OneVision and 800K samples from LLaVA-Video SFT. The training pipeline proceeds directly to Stage 2 fine-tuning. Following a streamlined native-resolution strategy inspired by LLaVA-OneVision, input frames whose resolution matches the model’s native input size are fed directly into the model without tiling or cropping, enabling an explicit evaluation of the ViT’s native-resolution representation capability. Furthermore, incorporating codec-style patch selection during training improves the visual–language model’s capability. The experimental comparison with the baseline model that does not adopt codec-style patch selection is ongoing, and detailed results will be finalized and reported shortly (TBD).

Task Benchmark Qwen3-4B-Instruct2507 Qwen2.5-1.5B
OV-Encoder SigLIP2 OV-Encoder SigLIP2
Video MVBench 49.8 47.2 46.8 45.9
MLVU-dev 49.4 48.4 48.5 41.4
NExT-QA (MC) 71.9 70.6 67.6 66.8
VideoMME 49.3 46.8 44.2 42.7
Perception Test 56.7 56.0 55.2 54.7
TOMATO 21.8 22.3 20.6 20.2
Image ChartQA 77.8 76.4 67.4 67.2
DocVQA 79.5 75.0 73.2 70.8
InfoVQA 45.5 42.0 33.1 32.6
OCRBench 630.0 621.0 533.0 560.0
OCRBench v2 26.1 26.1 20.5 20.2
MMBench-EN 78.5 79.6 68.6 71.2
MMStar 54.3 55.0 43.5 46.8
RealWorldQA 61.2 62.1 57.7 61.3

* OV-Encoder uses onevision-encoder-large. SigLIP2 uses siglip2-so400m-patch16-naflex.

Attentive Probe Results

Performance comparison of different vision encoders using Attentive Probe evaluation. Models are evaluated using single clip input and trained for 10 epochs across 8 action recognition datasets. Results show average performance and per-dataset scores for 8-frame and 16-frame configurations.

OV-Encoder-Codec (Ours) refers to a variant of OV-Encoder that replaces traditional frame sampling with codec-style input, where dense full-frame inputs are substituted by codec-guided patch reorganization. Under the same attentive probe setting and token budget, patches are selectively reallocated across the input clip based on codec-native motion vectors and residuals, without changing the backbone architecture or training protocol. This results in stronger performance on motion-sensitive datasets, particularly Diving48 and Perception Test.

Method Arch. Res. AVG SSV2 Diving48 Perce. Test CharEgo Epic Verb Epic Noun K400 HMDB51
8 Frames
MetaCLIP2 ViT-L/1422446.3 45.930.746.410.244.436.679.177.0
AIMv2 ViT-L/1422452.6 53.845.352.911.554.442.779.081.5
DINOv3 ViT-L/1422456.0 56.150.758.112.460.047.981.481.4
SigLIP2 ViT-L/1625656.0 57.953.256.312.858.845.281.382.1
OV-Encoder (Ours) ViT-L/1422458.1 57.056.057.712.461.953.384.182.2
OV-Encoder-Codec (Ours) ViT-L/1422459.8 57.565.959.612.161.853.784.383.2
16 Frames
MetaCLIP2 ViT-L/1422449.7 51.647.147.310.449.035.780.176.1
AIMv2 ViT-L/1422453.7 54.850.055.012.255.042.179.380.9
DINOv3 ViT-L/1422457.7 57.861.157.512.261.348.882.679.8
SigLIP2 ViT-L/1625657.3 59.659.156.013.159.346.282.582.3
OV-Encoder (Ours) ViT-L/1422459.9 59.065.159.411.762.254.685.481.7
OV-Encoder-Codec (Ours) ViT-L/1422460.9 59.268.760.812.862.854.385.582.9

* Evaluation under Attentive Probe settings using single clip input, trained for 10 epochs.

Patch-Efficient Video Understanding Comparison

Efficiency analysis comparing SigLIP2 with dense full-frame patch processing and OV-Encoder-Codec under a fixed token budget. It is important to emphasize that OV-Encoder-Codec does not perform temporal downsampling of the input video. All results are obtained from the same 64-frame (16384 patches) source video, where codec-native motion vectors and residuals are used to selectively extract a fixed number of spatiotemporal patches distributed across the entire temporal extent.

For a fair comparison, SigLIP2 is evaluated under the same token budgets and adopts a traditional frame sampling strategy, where each group of 256 patches corresponds to a contiguous RGB frame. Under a fixed token budget, OV-Encoder-Codec redistributes patches across time while preserving their spatial positions, enabling long-range temporal coverage. As a result, it outperforms SigLIP2 on Diving48 and Perception Test while reducing patch processing by 75.0%–98.4% compared to dense processing of 16,384 patches.

Dataset Model
256 Patches
512 Patches
1024 Patches
2048 Patches
4096 Patches
Diving48
SigLIP2
(ViT-L/16, 256px)
Traditional Frame Sampling (dense patch processing)
19.9 28.1 38.7 53.2 59.1
OneVision Encoder-Codec
(ViT-L/14, 224px)
16384 patches -> N patches
36.0
98.4% ↓ *
45.5
96.9% ↓
54.9
93.8% ↓
65.9
87.5% ↓
68.7
75.0% ↓
Perception Test
SigLIP2
(ViT-L/16, 256px)
Traditional Frame Sampling (dense patch processing)
- 48.7 53.1 56.3 56.0
OneVision Encoder-Codec
(ViT-L/14, 224px)
16384 patches -> N patches
41.2
98.4% ↓
49.5
96.9% ↓
54.9
93.8% ↓
59.6
87.5% ↓
60.8
75.0% ↓

* Percentages under OV-Encoder-Codec indicate patch reduction relative to dense processing of all 16384 patches.

' 16384 patches -> N patches ' indicates Codec-Style Patch Selection, where motion-relevant patches are selectively retained instead of temporal frame sampling.

BibTeX

@article{onevision_encoder_2026,
  title={OneVision Encoder},
  author={LLaVA-OneVision Community Contributors},
  journal={arXiv preprint},
  year={2026}
}

If you find our work useful, please consider citing our paper.