Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

The Necessity for Intervention Fidelity: Unintended Side Effects When Steering LLMs

Jonas Raedler · Weiyue Li · Alyssa Taliotis · Manasvi Goyal · Siddharth Swaroop · Weiwei Pan

Keywords: [ Steering ] [ Representation Engineering ] [ Social Bias ] [ AI ] [ LLM ]


Abstract:

Steering (inference-time modification of activations) offers a lightweight alternative to fine-tuning for aligning large language models (LLMs). While effective on targeted behaviors, we do not yet understand its effects on unrelated model behaviors. Here, we present a systematic comparison of steering across pretrained and fine-tuned models in the context of social bias. We find that in pretrained models, steering suppresses the intended (stereotypical) behavior, as expected. However, in fine-tuned models, steering primarily suppresses unrelated outputs, and this is both unexpected and undesired. This misalignment reveals aggregate metrics masks side-effects, highlighting the need for a focus on intervention fidelity (the degree to which an intervention impacts models as intended.) We hypothesize that this is due to fine-tuning increasing anisotropy of the latent space, entangling unrelated behaviors and thereby reducing steering precision.

Chat is not available.