ICML Do Sparse Autoencoders Generalize? A Case Study of Answerability

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Lovis Heindrich · Phil Torr · Fazl Barez · Veronika Thost

Keywords: [ sparse autoencoders ] [ probing ] [ answerability ] [ out-of-distribution evaluation ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability"—a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse, partly self-constructed answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features show inconsistent out-of-domain transfer, with performance varying from almost random to outperforming residual stream probes. Overall, this demonstrates the need for robust evaluation methods and quantitative approaches to predict feature generalization in SAE-based interpretability.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Lovis Heindrich · Phil Torr · Fazl Barez · Veronika Thost

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models