Poster
in
Workshop: Actionable Interpretability
Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis
Aruna Sankaranarayanan · Amir Zur · Atticus Geiger · Dylan Hadfield-Menell
Where should we steer—that is, intervene on internal activations of—a language model (LM) to control the free-form text it generates? Identifying effective steering locations is especially challenging when evaluation depends on a human or auxiliary LM, as such judgments are costly and yield only coarse feedback on the impact of an intervention. We introduce a method for selecting steering locations by: (1) constructing contrastive pairs of text exhibiting successful and unsuccessful steering, (2) computing the difference in generation probabilities between the two, and (3) approximating the causal effect of hidden activation interventions on this probability difference. We refer to this lightweight localization procedure as contrastive causal mediation (CCM). Across three case studies—refusal, sycophancy, and style transfer—we evaluate three CCM variants against probing and random baselines. All variants consistently outperform baselines in identifying attention heads suitable for steering. These results highlight the promise of causally grounded mechanistic interpretability for fine-grained model control.