Poster
in
Workshop: Actionable Interpretability
Resilient Multi-Concept Steering in LLMs via Enhanced Sparse "Conditioned" Autoencoders
Saurish Srivastava · Kevin Zhu · Cole Blondin · Sean O'Brien
Large Language Models (LLMs) excel at producing fluent text yet remain prone to generating harmful or biased outputs, largely due to their opaque, “black-box” nature. Existing mitigation strategies, such as reinforcement learning from human feedback and instruction tuning, can reduce these risks but often demand extensive retraining and may not generalize. An alternative approach leverages sparse autoencoders (SAEs) to extract disentangled, interpretable representations from LLM activations, enabling the detection of specific semantic attributes without modifying the base model. In this work, we extend the Sparse Conditioned Autoencoder (SCAR) framework (Härle et al., 2024) to enable multi-attribute detection and steering. Our approach, M-SCAR, disentangles multiple semantic features—such as toxicity and style—in a unified latent space by conditioning specific SAE features during training. This provides granular, real-time control without compromising textual quality. Experimental results on an expanded evaluation dataset demonstrate that M-SCAR effectively detects multiple concepts with high fidelity and significantly outperforms baseline SAEs. We further show successful simultaneous steering of multiple attributes (i.e., by reducing toxicity while increasing Shakespearean style). Evaluations under both black-box and white-box adversarial attack scenarios reveal that our approach maintains robustness, reinforcing its potential as a reliable and adaptable safety and control mechanism for LLMs.