skip to content

Tags โ†’ #sparse-autoencoders

  • SAE Made Easy Framework Overview
    SAE Made Easy: A comprehensive framework for integrating Sparse Autoencoders into any neural network model

    Overview

    SAE Made Easy is inspired by a wealth of Sparse Autoencoder (SAE) work from Anthropic, OpenAI, Google, and the open-source community. SAE has become a powerful and widely-used tool in the field of explainable AI.

    This project aims to provide a simple and flexible interface that allows users to inject SAE modules into their models at any layer with minimal effort. We adopt the elegant design of Hugging Faceโ€™s peft and regard SAE training as a kind of parameter efficient tuning - as long as the target is an nn.Module, SAE can be easily integrated and trained with only a few lines of code.

    ๐ŸŽฏ Design Philosophy

    The code design takes inspiration from PEFT, as we believe SAE shares many structural similarities with PEFT-based methods. By inheriting from a BaseTuner class, we enable seamless SAE integration into existing models.

    Simple Integration Example

    With this design, injecting an SAE module is as simple as:

    import torch
    import torch.nn as nn
    from peft import inject_adapter_in_model
     
    from sae import TopKSaeConfig, get_peft_sae_model, PeftSaeModel
     
    class DummyModel(nn.Module):
        def __init__(self):
            super(DummyModel, self).__init__()
            self.linear = nn.Linear(10, 10)
     
        def forward(self, x):
            return self.linear(x)
     
    model = DummyModel()
    config = TopKSaeConfig(k=1, num_latents=5, target_modules=["linear"])
     
    # Inject the adapter into the model
    model = inject_adapter_in_model(config, model)
     
    # Check if the adapter was injected correctly
    result = model(torch.randn(1, 512, 10))

    PEFT-Style Workflow

    You can also obtain a PEFT-wrapped model using the magic function from the PEFT library. The rest of your workflow remains the same:

    # Get the PEFT model
    peft_model = get_peft_sae_model(model, config)
     
    result = peft_model(torch.randn(1, 512, 10))

    Model Persistence

    Loading and saving is similar to PeftModel:

    peft_model.save_pretrained("test_save_peft_model")
     
    model = DummyModel()
    peft_model = PeftSaeModel.from_pretrained(
        model,
        "test_save_peft_model",
        adapter_name="default",
        low_cpu_mem_usage=True,
    )

    ๐Ÿ“Š Data Processing

    To ensure consistency in data formatting, we recommend first processing your data and storing it in Parquet format. This standardization simplifies interface development and data preparation.

    Preprocessing Pipeline

    You are free to customize the preprocessing logic and define keys for different modalities. However, the final output should be compatible with:

    • Chat templates
    • Our preprocessing pipeline

    Example Usage

    An example preprocessing script is available at examples/data_process/llava_ov_clevr.py:

    python examples/data_process/llava_ov_clevr.py \
        --push_to_hub \
        --hf_repo_path lmms-lab/LLaVA-OneVision-Data \
        --subset "CLEVR-Math(MathV360K)" \
        --split train \
        --target_hf_repo_path lmms-lab/LLaVA-OneVision-Data-SAE

    ๐Ÿš€ Training

    Our trainer implementation builds on top of existing frameworks and supports the following enterprise-grade features:

    • ZeRO-1/2/3 training - Efficient memory usage for large models
    • Weights & Biases (WandB) logging - Comprehensive experiment tracking

    Scalability

    With ZeRO optimizations, you can train SAEs on 72B models using just 8ร—A800 GPUs - making large-scale SAE research accessible to more teams.

    Quick Start Examples

    We provide simple training recipes to help you get started quickly:

    Large-Scale Training

    • ZeRO-3, 72B training: examples/train/zero/run_qwen25_vl_72b_zero3.sh

    Medium-Scale Training

    • ZeRO-2, 7B training: examples/train/zero/run_qwen25_vl_7b_zero2.sh

    Standard Training

    • DDP, 7B training: examples/train/ddp/run_qwen25_vl_7b_ddp.sh

    Training Monitoring

    Training Logs and Metrics
    Reproducible training logs showing SAE training progress with comprehensive metrics tracking

    Our framework provides comprehensive logging for reproducible research and easy debugging.

    ๐Ÿ—๏ธ Framework Features

    โœจ PEFT-Inspired Design

    • Seamless integration with existing models
    • Minimal code changes required
    • Compatible with Hugging Face ecosystem

    ๐Ÿ”ง Flexible Configuration

    • Support for various SAE architectures
    • Configurable sparsity levels and latent dimensions
    • Target any model layer with precision

    ๐Ÿ“ˆ Scalable Training

    • ZeRO optimization support for large models
    • Distributed training capabilities
    • Memory-efficient implementations

    ๐Ÿ” Research-Ready

    • Built-in experiment tracking
    • Reproducible training pipelines
    • Comprehensive logging and metrics

    ๐ŸŽ“ Research Applications

    Mechanistic Interpretability

    • Feature Discovery - Identify interpretable features in neural networks
    • Activation Analysis - Study how models process information
    • Behavioral Understanding - Understand model decision-making

    Model Analysis

    • Sparse Representation - Learn compressed, interpretable representations
    • Feature Steering - Control model behavior through feature manipulation
    • Safety Research - Understand and mitigate potential risks

    If you find this repository useful, please consider checking out our previous paper on applying Sparse Autoencoders (SAE) to Large Multimodal Models, accepted at ICCV 2025.

    ๐ŸŒŸ Key Benefits

    Ease of Use

    Transform complex SAE integration into a few lines of code with our PEFT-inspired design.

    Scalability

    Train on models ranging from 7B to 72B parameters with optimized memory usage.

    Flexibility

    Apply SAEs to any neural network layer with configurable parameters and architectures.

    Research Impact

    Accelerate mechanistic interpretability research with production-ready tools and frameworks.

    ๐Ÿš€ Getting Started

    1. Install the framework following our documentation
    2. Prepare your data using our preprocessing pipeline
    3. Configure SAE parameters for your specific use case
    4. Train using our optimized training scripts
    5. Analyze learned features for interpretability insights

    SAE Made Easy democratizes access to sparse autoencoder research, enabling researchers and practitioners to easily integrate interpretability tools into their workflows.

    Open Source Resources
    Comprehensive resources for the research community to reproduce and extend our multimodal SAE work
  • Multimodal-SAE Banner
    Multimodal-SAE: First demonstration of SAE-based feature interpretation in Large Multimodal Models

    Overview

    For the first time in the multimodal domain, we demonstrate that features learned by Sparse Autoencoders (SAEs) in a smaller Large Multimodal Model (LMM) can be effectively interpreted by a larger LMM. Our work introduces the use of SAEs to analyze the open-semantic features of LMMs, providing a breakthrough solution for feature interpretation across various model scales.

    Inspiration and Motivation

    This research is inspired by Anthropicโ€™s remarkable work on applying SAEs to interpret features in large-scale language models. In multimodal models, we discovered intriguing features that:

    • Correlate with diverse semantics across visual and textual modalities
    • Can be leveraged to steer model behavior for precise control
    • Enable deeper understanding of LMM functionality and decision-making

    Technical Approach

    SAE Training Pipeline

    The Sparse Autoencoder (SAE) is trained using a targeted approach:

    1. Integration Strategy - SAE integrated into a specific layer of the model
    2. Frozen Architecture - All other model components remain frozen during training
    3. Training Data - Utilizes LLaVA-NeXT dataset for comprehensive multimodal coverage
    4. Feature Learning - Learns sparse, interpretable representations of multimodal features

    Auto-Explanation Pipeline

    Our novel auto-explanation pipeline analyzes visual features through:

    • Activation Region Analysis - Identifies where features activate in visual inputs
    • Semantic Correlation - Maps features to interpretable semantic concepts
    • Cross-Modal Understanding - Leverages larger LMMs for feature interpretation
    • Automated Processing - Scalable interpretation without manual annotation

    Feature Steering and Control

    Feature Steering Demonstration
    Demonstration of feature steering: These learned features can be used to control model behavior and generate desired outputs

    Behavioral Control Capabilities

    The learned features enable precise model steering by:

    • Selective Feature Activation - Amplifying specific semantic features
    • Behavioral Modification - Directing model attention and responses
    • Interpretable Control - Understanding why specific outputs are generated
    • Fine-Grained Manipulation - Precise control over model behavior

    Key Contributions

    ๐Ÿ”ฌ First Multimodal SAE Implementation

    Pioneering application of SAE methodology to multimodal models, opening new research directions in mechanistic interpretability.

    ๐ŸŽฏ Cross-Scale Feature Interpretation

    Demonstration that smaller LMMs can learn features interpretable by larger models, enabling scalable analysis approaches.

    ๐ŸŽฎ Model Steering Capabilities

    Practical application of learned features for controllable model behavior and output generation.

    ๐Ÿ”„ Auto-Explanation Pipeline

    Automated methodology for interpreting visual features without requiring manual semantic labeling.

    Research Impact

    Mechanistic Interpretability Advancement

    This work represents a significant advancement in understanding how multimodal models process and integrate information across modalities.

    Practical Applications

    • Model Debugging - Understanding failure modes and biases
    • Controllable Generation - Steering model outputs for specific applications
    • Safety and Alignment - Better control over model behavior
    • Feature Analysis - Deep understanding of learned representations

    Future Directions

    Our methodology opens new research avenues in:

    1. Cross-Modal Feature Analysis - Understanding feature interactions across modalities
    2. Scalable Interpretability - Extending to larger and more complex models
    3. Real-Time Steering - Dynamic control during inference
    4. Safety Applications - Preventing harmful or biased outputs

    Technical Details

    Architecture Integration

    The SAE is carefully integrated to:

    • Preserve Model Performance - Minimal impact on original capabilities
    • Capture Rich Features - Learn meaningful sparse representations
    • Enable Interpretation - Facilitate analysis by larger models
    • Support Steering - Allow runtime behavioral modification

    Evaluation Methodology

    Our approach is validated through:

    • Feature Interpretability - Qualitative analysis of learned features
    • Steering Effectiveness - Quantitative measurement of behavioral control
    • Cross-Model Validation - Testing interpretation across different model sizes
    • Semantic Consistency - Verifying feature stability and meaning

    Conclusion

    Multimodal-SAE represents a breakthrough in multimodal mechanistic interpretability, providing the first successful demonstration of SAE-based feature interpretation in the multimodal domain. Our work enables:

    • Deeper Understanding of how LMMs process multimodal information
    • Practical Control over model behavior through feature steering
    • Scalable Interpretation methods for increasingly complex models
    • Foundation Research for future advances in multimodal AI safety and control

    This research establishes a new paradigm for understanding and controlling Large Multimodal Models, with significant implications for AI safety, controllability, and interpretability research.