ICML How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs

Invited Talk
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs

Eric Wong

[ Abstract ]

Sat 19 Jul 1:30 p.m. PDT — 2:15 p.m. PDT

Abstract:

Why are LLMs fundamentally so easily jailbroken? We will first analyze how to subvert and "jailbreak" simple, one-layer transformers, formalizing rule-breaking as a vulnerability in the model's attention mechanism. This theoretical work taught us a critical lesson: if attention is the key to breaking rules, it may also be the key to enforcing them. We will then introduce InstABoost, an incredibly simple yet highly effective steering method that directly applies this insight to boost the model's attention on user-provided instructions during generation. This technique, inspired by our theoretical understanding of how a small model is jailbroken, provides state-of-the-art control over large-scale LLMs with just five lines of code.

Chat is not available.

Invited Talk in Workshop: Methods and Opportunities at Small Scale (MOSS)

How Jailbreaking 1-Layer Transformers Taught us how to Steer LLMs

Eric Wong

Invited Talk
in
Workshop: Methods and Opportunities at Small Scale (MOSS)