skip to content

LLaVA-OneVision: Easy Visual Task Transfer

2 min read

The first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video

LLaVA-OneVision
LLaVA-OneVision: A unified model for single-image, multi-image, and video understanding

Overview

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

Key Features

Unified Architecture

LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

Model Sizes

  • 0.5B parameters - Lightweight deployment
  • 7B parameters - Balanced performance
  • 72B parameters - State-of-the-art capabilities

Emerging Capabilities

The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

1. Cross-Scenario Understanding

Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

2. Advanced Visual Analysis

  • Diagram and table interpretation - Understanding complex visual structures
  • Multi-screenshot interaction - Analyzing relationships across multiple screens
  • Set-of-mark object referencing - Precise object identification and tracking

3. Video Capabilities

  • Image-to-video generation understanding - Comprehending temporal transitions
  • Video analysis and comparison - Deep understanding of video content
  • Multi-camera video interpretation - Processing footage from multiple viewpoints
  • Detailed video subject description - Rich, contextual video narration

Strong Transfer Learning

Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the model’s ability to generalize learned representations across visual domains.

Development Roadmap

LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:

  • Enhanced reasoning capabilities
  • Improved OCR performance
  • Expanded world knowledge
  • Advanced multimodal understanding
2024
Vision
Multimodal
Research
Llava
Video
Multi-image
Transfer-learning

Authors

Bo Li*, Yuanhan Zhang*, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

* Main Authors

Acknowledgement

This work is a collaboration between researchers from ByteDance, NTU, CUHK, and HKUST, building upon the strong foundation of the LLaVA project series.