Tag: sparse-autoencoders • LMMs-Lab

Overview

SAE Made Easy is inspired by a wealth of Sparse Autoencoder (SAE) work from Anthropic, OpenAI, Google, and the open-source community. SAE has become a powerful and widely-used tool in the field of explainable AI.

This project aims to provide a simple and flexible interface that allows users to inject SAE modules into their models at any layer with minimal effort. We adopt the elegant design of Hugging Face’s peft and regard SAE training as a kind of parameter efficient tuning - as long as the target is an nn.Module, SAE can be easily integrated and trained with only a few lines of code.

🎯 Design Philosophy

The code design takes inspiration from PEFT, as we believe SAE shares many structural similarities with PEFT-based methods. By inheriting from a BaseTuner class, we enable seamless SAE integration into existing models.

Simple Integration Example

With this design, injecting an SAE module is as simple as:

import torch
import torch.nn as nn
from peft import inject_adapter_in_model
 
from sae import TopKSaeConfig, get_peft_sae_model, PeftSaeModel
 
class DummyModel(nn.Module):
    def __init__(self):
        super(DummyModel, self).__init__()
        self.linear = nn.Linear(10, 10)
 
    def forward(self, x):
        return self.linear(x)
 
model = DummyModel()
config = TopKSaeConfig(k=1, num_latents=5, target_modules=["linear"])
 
# Inject the adapter into the model
model = inject_adapter_in_model(config, model)
 
# Check if the adapter was injected correctly
result = model(torch.randn(1, 512, 10))

PEFT-Style Workflow

You can also obtain a PEFT-wrapped model using the magic function from the PEFT library. The rest of your workflow remains the same:

# Get the PEFT model
peft_model = get_peft_sae_model(model, config)
 
result = peft_model(torch.randn(1, 512, 10))

Model Persistence

Loading and saving is similar to PeftModel:

peft_model.save_pretrained("test_save_peft_model")
 
model = DummyModel()
peft_model = PeftSaeModel.from_pretrained(
    model,
    "test_save_peft_model",
    adapter_name="default",
    low_cpu_mem_usage=True,
)

📊 Data Processing

To ensure consistency in data formatting, we recommend first processing your data and storing it in Parquet format. This standardization simplifies interface development and data preparation.

Preprocessing Pipeline

You are free to customize the preprocessing logic and define keys for different modalities. However, the final output should be compatible with:

Chat templates
Our preprocessing pipeline

Example Usage

An example preprocessing script is available at examples/data_process/llava_ov_clevr.py:

python examples/data_process/llava_ov_clevr.py \
    --push_to_hub \
    --hf_repo_path lmms-lab/LLaVA-OneVision-Data \
    --subset "CLEVR-Math(MathV360K)" \
    --split train \
    --target_hf_repo_path lmms-lab/LLaVA-OneVision-Data-SAE

🚀 Training

Our trainer implementation builds on top of existing frameworks and supports the following enterprise-grade features:

ZeRO-1/2/3 training - Efficient memory usage for large models
Weights & Biases (WandB) logging - Comprehensive experiment tracking

Scalability

With ZeRO optimizations, you can train SAEs on 72B models using just 8×A800 GPUs - making large-scale SAE research accessible to more teams.

Quick Start Examples

We provide simple training recipes to help you get started quickly:

Large-Scale Training

ZeRO-3, 72B training: examples/train/zero/run_qwen25_vl_72b_zero3.sh

Medium-Scale Training

ZeRO-2, 7B training: examples/train/zero/run_qwen25_vl_7b_zero2.sh

Standard Training

DDP, 7B training: examples/train/ddp/run_qwen25_vl_7b_ddp.sh

Training Monitoring

Training Logs and Metrics — Reproducible training logs showing SAE training progress with comprehensive metrics tracking

Our framework provides comprehensive logging for reproducible research and easy debugging.

🏗️ Framework Features

✨ PEFT-Inspired Design

Seamless integration with existing models
Minimal code changes required
Compatible with Hugging Face ecosystem

🔧 Flexible Configuration

Support for various SAE architectures
Configurable sparsity levels and latent dimensions
Target any model layer with precision

📈 Scalable Training

ZeRO optimization support for large models
Distributed training capabilities
Memory-efficient implementations

🔍 Research-Ready

Built-in experiment tracking
Reproducible training pipelines
Comprehensive logging and metrics

🎓 Research Applications

Mechanistic Interpretability

Feature Discovery - Identify interpretable features in neural networks
Activation Analysis - Study how models process information
Behavioral Understanding - Understand model decision-making

Model Analysis

Sparse Representation - Learn compressed, interpretable representations
Feature Steering - Control model behavior through feature manipulation
Safety Research - Understand and mitigate potential risks

If you find this repository useful, please consider checking out our previous paper on applying Sparse Autoencoders (SAE) to Large Multimodal Models, accepted at ICCV 2025.

🌟 Key Benefits

Ease of Use

Transform complex SAE integration into a few lines of code with our PEFT-inspired design.

Scalability

Train on models ranging from 7B to 72B parameters with optimized memory usage.

Flexibility

Apply SAEs to any neural network layer with configurable parameters and architectures.

Research Impact

Accelerate mechanistic interpretability research with production-ready tools and frameworks.

🚀 Getting Started

Install the framework following our documentation
Prepare your data using our preprocessing pipeline
Configure SAE parameters for your specific use case
Train using our optimized training scripts
Analyze learned features for interpretability insights

SAE Made Easy democratizes access to sparse autoencoder research, enabling researchers and practitioners to easily integrate interpretability tools into their workflows.

Open Source Resources

Comprehensive resources for the research community to reproduce and extend our multimodal SAE work

GitHub

GitHub Repository

Complete implementation and code

Link

Training Scripts

SAE training and integration pipelines

Link

Interpretation Tools

Auto-explanation pipeline implementation

Multimodal-SAE Banner — Multimodal-SAE: First demonstration of SAE-based feature interpretation in Large Multimodal Models

Overview

For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing a breakthrough solution for feature interpretation across various model scales.

Inspiration and Motivation

This research is inspired by Anthropic’s remarkable work on applying SAEs to interpret features in large-scale language models. In multimodal models, we discovered intriguing features that:

Correlate with diverse semantics across visual and textual modalities
Can be leveraged to steer model behavior for precise control
Enable deeper understanding of LMM functionality and decision-making

Technical Approach

SAE Training Pipeline

The Sparse Autoencoder (SAE) is trained using a targeted approach:

Integration Strategy - SAE integrated into a specific layer of the model
Frozen Architecture - All other model components remain frozen during training
Training Data - Utilizes LLaVA-NeXT dataset for comprehensive multimodal coverage
Feature Learning - Learns sparse, interpretable representations of multimodal features

Auto-Explanation Pipeline

Our novel auto-explanation pipeline analyzes visual features through:

Activation Region Analysis - Identifies where features activate in visual inputs
Semantic Correlation - Maps features to interpretable semantic concepts
Cross-Modal Understanding - Leverages larger LMMs for feature interpretation
Automated Processing - Scalable interpretation without manual annotation

Feature Steering and Control

Feature Steering Demonstration — Demonstration of feature steering: These learned features can be used to control model behavior and generate desired outputs

Behavioral Control Capabilities

The learned features enable precise model steering by:

Selective Feature Activation - Amplifying specific semantic features
Behavioral Modification - Directing model attention and responses
Interpretable Control - Understanding why specific outputs are generated
Fine-Grained Manipulation - Precise control over model behavior

Key Contributions

🔬 First Multimodal SAE Implementation

Pioneering application of SAE methodology to multimodal models, opening new research directions in mechanistic interpretability.

🎯 Cross-Scale Feature Interpretation

Demonstration that smaller LMMs can learn features interpretable by larger models, enabling scalable analysis approaches.

🎮 Model Steering Capabilities

Practical application of learned features for controllable model behavior and output generation.

🔄 Auto-Explanation Pipeline

Automated methodology for interpreting visual features without requiring manual semantic labeling.

Research Impact

Mechanistic Interpretability Advancement

This work represents a significant advancement in understanding how multimodal models process and integrate information across modalities.

Practical Applications

Model Debugging - Understanding failure modes and biases
Controllable Generation - Steering model outputs for specific applications
Safety and Alignment - Better control over model behavior
Feature Analysis - Deep understanding of learned representations

Future Directions

Our methodology opens new research avenues in:

Cross-Modal Feature Analysis - Understanding feature interactions across modalities
Scalable Interpretability - Extending to larger and more complex models
Real-Time Steering - Dynamic control during inference
Safety Applications - Preventing harmful or biased outputs

Technical Details

Architecture Integration

The SAE is carefully integrated to:

Preserve Model Performance - Minimal impact on original capabilities
Capture Rich Features - Learn meaningful sparse representations
Enable Interpretation - Facilitate analysis by larger models
Support Steering - Allow runtime behavioral modification

Evaluation Methodology

Our approach is validated through:

Feature Interpretability - Qualitative analysis of learned features
Steering Effectiveness - Quantitative measurement of behavioral control
Cross-Model Validation - Testing interpretation across different model sizes
Semantic Consistency - Verifying feature stability and meaning

Open Source Resources

Comprehensive resources for the research community to reproduce and extend our multimodal SAE work

GitHub

GitHub Repository

Complete implementation and code

Link

Training Scripts

SAE training and integration pipelines

Link

Interpretation Tools

Auto-explanation pipeline implementation

Link

Documentation

Detailed guides for reproduction and extension

Conclusion

Multimodal-SAE represents a breakthrough in multimodal mechanistic interpretability, providing the first successful demonstration of SAE-based feature interpretation in the multimodal domain. Our work enables:

Deeper Understanding of how LMMs process multimodal information
Practical Control over model behavior through feature steering
Scalable Interpretation methods for increasingly complex models
Foundation Research for future advances in multimodal AI safety and control

This research establishes a new paradigm for understanding and controlling Large Multimodal Models, with significant implications for AI safety, controllability, and interpretability research.

Tags → #sparse-autoencoders

Overview

🎯 Design Philosophy

Simple Integration Example

PEFT-Style Workflow

Model Persistence

📊 Data Processing

Preprocessing Pipeline

Example Usage

🚀 Training

Scalability

Quick Start Examples

Large-Scale Training

Medium-Scale Training

Standard Training

Training Monitoring

🏗️ Framework Features

✨ PEFT-Inspired Design

🔧 Flexible Configuration

📈 Scalable Training

🔍 Research-Ready

🎓 Research Applications

Mechanistic Interpretability

Model Analysis

📚 Related Work and Citation

🌟 Key Benefits

Ease of Use

Scalability

Flexibility

Research Impact

🚀 Getting Started

Overview

Inspiration and Motivation

Technical Approach

SAE Training Pipeline

Auto-Explanation Pipeline

Feature Steering and Control

Behavioral Control Capabilities

Key Contributions

🔬 First Multimodal SAE Implementation

🎯 Cross-Scale Feature Interpretation

🎮 Model Steering Capabilities

🔄 Auto-Explanation Pipeline

Research Impact

Mechanistic Interpretability Advancement

Practical Applications

Future Directions

Technical Details

Architecture Integration

Evaluation Methodology

Conclusion