skip to content

Tags β†’ #benchmark

  • LongVA Visual Needle-in-a-Haystack Heatmap
    LongVA's performance on Visual Needle-In-A-Haystack benchmark showing accurate retrieval across long video sequences

    Overview

    Gemini has amazed the world with its capability to understand hour-long videos. However, we still lack an open-source alternative with similar capabilities. Our latest research presents an innovative solution towards long video LMM, shifting the focus from reducing visual tokens per frame to leveraging the long context capabilities of language models.

    Here, we present our state-of-the-art video model, Long Video Assistant (LongVA), and our novel benchmark, Visual Needle-In-A-Haystack (V-NIAH).

    Key Innovations

    πŸ”„ Long Context Transfer

    We discovered and verified that the long context capability of language models can be directly transferred to the video domain in modality-aligned multi-modal models. On V-NIAH, LongVA is the only open-source model capable of accurately retrieving visual information from inputs with:

    • 2000+ frames
    • 200K+ visual tokens

    🎯 UniRes: Unified Visual Encoding

    We proposed UniRes, a unified visual encoding scheme that encodes both images and videos. In UniRes, a video is encoded the same as multiple image crops in a sequence.

    Key Benefits:

    • Leverages the Long Context Transfer property
    • Enables superior zero-shot performance in video tasks
    • No video-specific training data required

    Performance Highlights

    πŸ† State-of-the-Art Results

    LongVA achieves state-of-the-art performance on the comprehensive Video-MME benchmarks among 7B models.

    Key Performance Features:

    • Performance increases with denser sampling of video frames
    • Superior zero-shot capabilities on video understanding tasks
    • Comprehensive ablation studies validating improvement sources

    πŸ“Š V-NIAH Benchmark

    Our novel Visual Needle-In-A-Haystack (V-NIAH) benchmark provides:

    • Rigorous evaluation of long-context visual understanding
    • Testing retrieval accuracy across extended video sequences
    • Open-source evaluation framework for the community

    Technical Architecture

    Multi-Modal Alignment

    LongVA demonstrates that language models’ inherent long-context capabilities can be effectively transferred to visual domains through proper modality alignment.

    Scalable Design

    The architecture scales efficiently with:

    • Increased frame sampling rates
    • Extended sequence lengths
    • Larger visual token counts

    Research Impact

    Open-Source Alternative

    LongVA provides the first viable open-source alternative to proprietary long-video understanding systems, enabling:

    • Academic research advancement
    • Commercial application development
    • Community-driven improvements

    Methodology Innovation

    The long context transfer approach opens new research directions in:

    • Cross-modal capability transfer
    • Efficient video processing
    • Unified multi-modal architectures

    Future Directions

    LongVA establishes a foundation for:

    1. Extended Context Models - Pushing beyond current frame limits
    2. Multi-Modal Transfer Learning - Applying insights to other modalities
    3. Efficient Video Processing - Optimizing computational requirements
    4. Benchmark Development - Creating more comprehensive evaluation metrics
    LongVA Resources
    Complete resources for LongVA including source code, evaluation benchmark, and pre-trained models