Poster
in
Workshop: Actionable Interpretability
MSE-Break: Steering Internal Representations to Bypass Refusals in Large Language Models
Ashwin Saraswatula · Pranav Balabhadra · Pranav Dhinakar
Abstract:
The flexibility of internal concept embeddings in large language models (LLMs) enables advanced capabilities like in-context learning---but also opens the door to adversarial exploitation. We introduce MSE-Break, a jailbreak method that optimizes a soft-prompt prefix via gradient descent to minimize the mean squared error (MSE) between harmful concept embeddings in refused and accepted contexts. The resulting soft prompt $p$ is concept-specific but prompt-general, enabling it to jailbreak a wide range of queries involving that concept without further tuning. Applied to four popular open-source LLMs---including Gemma-2B-IT and LLaMA-3.1-8B-IT---MSE-Break achieves attack success rates exceeding 90\%. Its interpretability-driven design enables MSE-Break to outperform existing methods like GCG and AutoDAN---while converging in a fraction of the time. We find that harmful concept embeddings are linearly separable between refused and accepted contexts---structure that MSE-Break actively exploits. We further show that concept representations can be drastically steered in-context with as little as a single token. Our findings underscore the brittleness of LLM representations---and their susceptibility to targeted manipulation---highlighting the urgency for more robust and interpretable safety mechanisms.
Chat is not available.