LLaVA-OneVision: Easy Visual Task Transfer • LMMs-Lab

Overview

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

Key Features

Unified Architecture

LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

Model Sizes

0.5B parameters - Lightweight deployment
7B parameters - Balanced performance
72B parameters - State-of-the-art capabilities

Emerging Capabilities

The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

1. Cross-Scenario Understanding

Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

2. Advanced Visual Analysis

Diagram and table interpretation - Understanding complex visual structures
Multi-screenshot interaction - Analyzing relationships across multiple screens
Set-of-mark object referencing - Precise object identification and tracking

3. Video Capabilities

Image-to-video generation understanding - Comprehending temporal transitions
Video analysis and comparison - Deep understanding of video content
Multi-camera video interpretation - Processing footage from multiple viewpoints
Detailed video subject description - Rich, contextual video narration

Strong Transfer Learning

Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the model’s ability to generalize learned representations across visual domains.

Open-Source Resources

We open-source LLaVA-OneVision to facilitate future development of LMMs in the community:

🚀 Training Code

Cook a SOTA model with our released training code and reproduction scripts

🤗 Model Checkpoints

Access pre-trained model checkpoints in all three sizes (0.5B, 7B, 72B)

📊 Training Datasets

Explore comprehensive training datasets for Single-Image and OneVision stages

🔥 Live Demo

Try LLaVA-OneVision directly in your browser

Development Roadmap

LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:

Enhanced reasoning capabilities
Improved OCR performance
Expanded world knowledge
Advanced multimodal understanding

Citation

If you find LLaVA-OneVision useful for your research, please cite:

@article{li2024llava-onevision,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan},
  journal={arXiv preprint arXiv:2408.03326},
  year={2024}
}

Acknowledgments

This work is a collaboration between researchers from ByteDance, NTU, CUHK, and HKUST, building upon the strong foundation of the LLaVA project series.