Invited Talk
in
Workshop: Methods and Opportunities at Small Scale (MOSS)
How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs
Eric Wong
Why are LLMs fundamentally so easily jailbroken? We will first analyze how to subvert and "jailbreak" simple, one-layer transformers, formalizing rule-breaking as a vulnerability in the model's attention mechanism. This theoretical work taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. We will then introduce InstABoost, an incredibly simple yet highly effective steering method that directly applies this insight to boost the model's attention on user-provided instructions during generation. This technique, inspired by our theoretical understanding of how a small model is jailbroken, provides state-of-the-art control over large-scale LLMs with just five lines of code.