Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Beyond Sparsity: Improving Diversity in Sparse Autoencoders via Denoising Training

Xiang Pan · Yifei Wang · Qi Lei


Abstract: Sparse Autoencoders (SAEs) have shown promising performance in decomposing dense representations from foundation models into sparse and interpretable components. They are widely used to interpret the internal behavior of these models. In this work, we show that the feature spaces learned by SAEs often exhibit significant redundancy, limiting their diversity. This lack of diversity can hinder interpretability by omitting distinct components that are necessary for faithful explanations. Existing evaluation metrics primarily focus on the trade-off between sparsity and reconstruction error, downstream task performance, or the quality of individual feature explanations. However, these metrics fail to capture the diversity and expressiveness of the learned dictionary or the selected Top-$K$ explanation feature space. We highlight this gap and propose new evaluation protocols that explicitly quantify explanation diversity to better align with interpretability objectives. To improve the diversity of feature space, we adapt dropout as a simple yet effective denoising-based augmentation strategy. Empirically, we demonstrate that the resulting features are not only more diverse but also more interpretable.

Chat is not available.