ICML Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Poster
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sumedh Hindupur · Ekdeep Singh Lubana · Thomas Fel · Demba Ba

Keywords: [ Sparse Autoencoders ] [ Interpretability ] [ Dictionary Learning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. We show that each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. We train SAEs on synthetic data with specific structure to show that SAEs fail to recover concepts when their assumptions are ignored, and we design a new SAE---called SpaDE---that enables the discovery of previously hidden concepts (those with heterogenous intrinsic dimensionality and nonlinear separation boundaries) and reinforces our theoretical insights.

Chat is not available.

Poster in Workshop: Methods and Opportunities at Small Scale (MOSS)

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sumedh Hindupur · Ekdeep Singh Lubana · Thomas Fel · Demba Ba

Poster
in
Workshop: Methods and Opportunities at Small Scale (MOSS)