Poster
in
Workshop: Actionable Interpretability
Convergent Linear Representations of Emergent Misalignment
Anna Soligo · Edward Turner · Senthooran Rajamanoharan · Neel Nanda
Recent work on emergent misalignment demonstrated that fine-tuning large language models on narrowly specialized datasets can induce misaligned behaviours in broader settings. However, we lack an understanding of what was learned to induce this behaviour or why, highlighting critical gaps in our understanding of model alignment. To enable interpretability in this area, we present a simplified model organism of emergent misalignment, using rank-1 LoRA adapters on just 9 layers of Qwen2.5-14B-Instruct. Through studying this, we find that different emergently misaligned models converge to similar representations of misalignment. We demonstrate this by extracting a direction for misalignment from a single model's activations, and using it to effectively ablate misaligned behaviour from emergently misaligned models fine-tuned using higher dimensional LoRas and different datasets. Leveraging the the scalar hidden state of rank-1 LoRAs, we further demonstrate that our minimal set of LoRA adapters can be directly interpreted, showing that six contribute to general misalignment, while two specialise for misalignment in just the fine-tuning domain. Emergent misalignment is a particularly salient example of undesirable and unexpected model behaviour and by advancing our understanding of the mechanisms behind it, we move towards being able to better understand and mitigate misalignment more generally.