Posts - LMMs-Lab

OneVision Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Our hypothesis: AGI is a compression problem. We introduce Codec Patchification that processes only 3.1%-25% of regions, achieving 4.1% improvement on video tasks while outperforming Qwen3-ViT and SigLIP2.

2025.12.15

modelsmultimodal

LLaVA-OneVision-1.5-RL: Unlocking Multimodal Reasoning via Lightweight Reinforcement Learning

Applying reinforcement learning post-training to enhance reasoning capabilities in multimodal models with significant improvements on STEM, coding, and reasoning tasks.

2025.11.27

modelsmultimodal

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LongVT introduces a novel paradigm that natively interleaves multimodal tool-augmented Chain-of-Thought with on-demand clip inspection over hours-long videos, enabling large multimodal models to perform more effective and reliable long-video reasoning.

2025.11.21

modelsmultimodal

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner introduces a systematic study on constructing high-quality SFT and RL datasets for multimodal reasoning, demonstrating that both source diversity and answer diversity are crucial for building reliable supervision signals.

2025.09.30

modelsmultimodal

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

LLaVA-OneVision1.5 introduces a novel family of fully open-source Large Multimodal Models (LMMs) that achieves state-of-the-art performance with substantially lower cost through training on native resolution images.

2025.08.29

modelsmultimodal

LLaVA-Critic-R1: Unified Critic and Policy Model Through Reinforcement Learning

A family of generative critic VLM trained through GRPO using pairwise critic data, achieving SoTA policy performance at 7B scale while excelling at both evaluation and generation

2025.08.06

researchmultimodal

Improved MM-Search-R1: Reasoning and Action in Multimodal Search

We improve MMSearch-R1 by integrating improved reasoning capabilities into the model

2025.07.12

toolsmultimodal

SAE Made Easy: Simplified Sparse Autoencoder Integration

A framework that allows you to apply Sparse AutoEncoder on any models - inspired by PEFT design for seamless integration

2025.06.01

modelsmultimodal

MMSearch-R1: Multimodal Search with Reinforcement Learning

The first end-to-end RL-based solution designed to equip LMMs with the capability to perform search on demand in real-world internet environments

2025.05.28

researchmultimodal

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

MGPO enables LMMs to iteratively focus on key image regions through automatic grounding, achieving superior performance on high-resolution visual tasks without requiring grounding annotations