Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

Aruna Sankaranarayanan · Amir Zur · Atticus Geiger · Dylan Hadfield-Menell

[ ] [ Project Page ]
Sat 19 Jul 1 p.m. PDT — 2 p.m. PDT

Abstract:

Where should we steer—that is, intervene on internal activations of—a language model (LM) to control the free-form text it generates? Identifying effective steering locations is especially challenging when evaluation depends on a human or auxiliary LM, as such judgments are costly and yield only coarse feedback on the impact of an intervention. We introduce a method for selecting steering locations by: (1) constructing contrastive pairs of text exhibiting successful and unsuccessful steering, (2) computing the difference in generation probabilities between the two, and (3) approximating the causal effect of hidden activation interventions on this probability difference. We refer to this lightweight localization procedure as contrastive causal mediation (CCM). Across three case studies—refusal, sycophancy, and style transfer—we evaluate three CCM variants against probing and random baselines. All variants consistently outperform baselines in identifying attention heads suitable for steering. These results highlight the promise of causally grounded mechanistic interpretability for fine-grained model control.

Chat is not available.