Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations

Ailin Deng · Shaoliang Nie · Lijuan Liu · Xianjun Yang · Ujjwal Karn · Dat Huynh · Fulton Wang · Ying Xu · Madian Khabsa · Saghar Hosseini

[ ] [ Project Page ]
Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

Refusal behavior is a critical component of aligning language models to be both helpful and safe. While models are expected to reject harmful prompts, they often over-refuse—declining benign queries that resemble unsafe ones or involve sensitive topics. Prior work typically addresses this issue through dense feature interventions, which often lack interpretability and require carefully curated samples.In this work, we propose a new perspective by leveraging Sparse Autoencoder (SAE) features—unsupervised and interpretable representations of model activations—to study and steer refusal behavior. We conduct a large-scale analysis and identify SAE features associated with harmful compliance, safe compliance, and refusal triggers. We show that intervening with a single SAE feature can effectively reduce over-refusals while maintaining both safety and general model capability, outperforming previous dense features. Our findings highlight the potential of sparse, interpretable features as meaningful building blocks for understanding and improving model refusal behavior.

Chat is not available.