ICML Angular Steering: Behavior Control via Rotation in Activation Space

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Angular Steering: Behavior Control via Rotation in Activation Space

Hieu M. Vu · Tan Nguyen

Keywords: [ Activation Steering ] [ Safety Alignment ] [ Mechanistic Interpretability ] [ LLMs ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Controlling specific behaviors in large language models while preserving general capabilities remains a key challenge for safe AI deployment. Current steering methods like vector addition and directional ablation are limited to two-dimensional subspaces, making them parameter-sensitive and prone to affecting unrelated features. We introduce Angular Steering, which modulates behavior by rotating activations within a fixed subspace, providing fine-grained control over behaviors like refusal and compliance. This geometric rotation framework generalizes existing techniques while simplifying parameter selection and maintaining model stability. Experiments demonstrate that Angular Steering achieves robust behavioral control with comparable language modeling performance across multiple model families. Our Adaptive Angular Steering variant further enhances stability by selectively rotating only aligned activations.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Angular Steering: Behavior Control via Rotation in Activation Space

Hieu M. Vu · Tan Nguyen

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models