Poster
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models
Thomas Fel · Ekdeep Singh Lubana · Jacob Prince · Matthew Kowal · Victor Boutin · Isabel Papadimitriou · Binxu Wang · Martin Wattenberg · Demba Ba · Talia Konkle
East Exhibition Hall A-B #E-905
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the data’s convex hull. This geometric anchoring significantly enhances the stability and plausibility of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true” classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
Neural networks often make decisions using internal representations that are difficult for humans to interpret. One promising approach to explainability is to extract a set of internal “concepts” — directions in the model’s representation space that act like a dictionary the model uses to make sense of the world. These concepts can help us understand what features the model is using, and why it makes certain predictions.However, current methods for building these concept dictionaries are unstable: small changes in the data or random choices during training can lead to completely different explanations. This instability makes it hard to trust or reproduce the results.Our work introduces a new method, Archetypal Sparse Autoencoders, that builds more reliable and interpretable concept dictionaries by geometrically anchoring them to the training data. We also design new evaluation benchmarks to measure whether the learned concepts align with ground truth and remain consistent across training runs. Our approach improves the stability and quality of concept-based explanations in large vision models, helping researchers and practitioners better understand how these systems work — and why.