Skip to yearly menu bar Skip to main content


Oral

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley · Daniel Tan · Niels Warncke · Anna Sztyber-Betley · Xuchan Bao · Martín Soto · Nathan Labenz · Owain Evans

West Exhibition Hall C
[ ] [ Visit Oral 1A Alignment and Agents ]
Tue 15 Jul 10:30 a.m. — 10:45 a.m. PDT

Abstract:

We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Chat is not available.