skip to content

Tags โ†’ #transfer-learning

  • LLaVA-OneVision
    LLaVA-OneVision: A unified model for single-image, multi-image, and video understanding

    Overview

    We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios.

    Key Features

    Unified Architecture

    LLaVA-OneVision is designed to have a similar maximum visual token count across different scenarios, enabling flexible extension to multiple visual signal types while maintaining consistent performance.

    Model Sizes

    • 0.5B parameters - Lightweight deployment
    • 7B parameters - Balanced performance
    • 72B parameters - State-of-the-art capabilities

    Emerging Capabilities

    The design of LLaVA-OneVision enables strong transfer learning across different modalities and scenarios, yielding impressive emerging capabilities:

    1. Cross-Scenario Understanding

    Seamlessly process and understand content across single images, multiple images, and videos within a unified framework.

    2. Advanced Visual Analysis

    • Diagram and table interpretation - Understanding complex visual structures
    • Multi-screenshot interaction - Analyzing relationships across multiple screens
    • Set-of-mark object referencing - Precise object identification and tracking

    3. Video Capabilities

    • Image-to-video generation understanding - Comprehending temporal transitions
    • Video analysis and comparison - Deep understanding of video content
    • Multi-camera video interpretation - Processing footage from multiple viewpoints
    • Detailed video subject description - Rich, contextual video narration

    Strong Transfer Learning

    Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos, showcasing the modelโ€™s ability to generalize learned representations across visual domains.

    Development Roadmap

    LLaVA-OneVision represents a significant milestone in our iterative improvements through the LLaVA-NeXT series, focusing on:

    • Enhanced reasoning capabilities
    • Improved OCR performance
    • Expanded world knowledge
    • Advanced multimodal understanding
  • LongVA Visual Needle-in-a-Haystack Heatmap
    LongVA's performance on Visual Needle-In-A-Haystack benchmark showing accurate retrieval across long video sequences

    Overview

    Gemini has amazed the world with its capability to understand hour-long videos. However, we still lack an open-source alternative with similar capabilities. Our latest research presents an innovative solution towards long video LMM, shifting the focus from reducing visual tokens per frame to leveraging the long context capabilities of language models.

    Here, we present our state-of-the-art video model, Long Video Assistant (LongVA), and our novel benchmark, Visual Needle-In-A-Haystack (V-NIAH).

    Key Innovations

    ๐Ÿ”„ Long Context Transfer

    We discovered and verified that the long context capability of language models can be directly transferred to the video domain in modality-aligned multi-modal models. On V-NIAH, LongVA is the only open-source model capable of accurately retrieving visual information from inputs with:

    • 2000+ frames
    • 200K+ visual tokens

    ๐ŸŽฏ UniRes: Unified Visual Encoding

    We proposed UniRes, a unified visual encoding scheme that encodes both images and videos. In UniRes, a video is encoded the same as multiple image crops in a sequence.

    Key Benefits:

    • Leverages the Long Context Transfer property
    • Enables superior zero-shot performance in video tasks
    • No video-specific training data required

    Performance Highlights

    ๐Ÿ† State-of-the-Art Results

    LongVA achieves state-of-the-art performance on the comprehensive Video-MME benchmarks among 7B models.

    Key Performance Features:

    • Performance increases with denser sampling of video frames
    • Superior zero-shot capabilities on video understanding tasks
    • Comprehensive ablation studies validating improvement sources

    ๐Ÿ“Š V-NIAH Benchmark

    Our novel Visual Needle-In-A-Haystack (V-NIAH) benchmark provides:

    • Rigorous evaluation of long-context visual understanding
    • Testing retrieval accuracy across extended video sequences
    • Open-source evaluation framework for the community

    Technical Architecture

    Multi-Modal Alignment

    LongVA demonstrates that language modelsโ€™ inherent long-context capabilities can be effectively transferred to visual domains through proper modality alignment.

    Scalable Design

    The architecture scales efficiently with:

    • Increased frame sampling rates
    • Extended sequence lengths
    • Larger visual token counts

    Research Impact

    Open-Source Alternative

    LongVA provides the first viable open-source alternative to proprietary long-video understanding systems, enabling:

    • Academic research advancement
    • Commercial application development
    • Community-driven improvements

    Methodology Innovation

    The long context transfer approach opens new research directions in:

    • Cross-modal capability transfer
    • Efficient video processing
    • Unified multi-modal architectures

    Future Directions

    LongVA establishes a foundation for:

    1. Extended Context Models - Pushing beyond current frame limits
    2. Multi-Modal Transfer Learning - Applying insights to other modalities
    3. Efficient Video Processing - Optimizing computational requirements
    4. Benchmark Development - Creating more comprehensive evaluation metrics
    LongVA Resources
    Complete resources for LongVA including source code, evaluation benchmark, and pre-trained models