Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Circuit Discovery Helps To Detect LLM Jailbreaking

Paria Mehrbod · Boris Knyazev · Eugene Belilovsky · Guy Wolf · geraldin nanfack

Keywords: [ Mechanistic Interpretability ] [ Jailbreak ]


Abstract:

Despite extensive safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safeguards to elicit harmful content. While prior work attributes this vulnerability to safety training limitations, the internal mechanisms by which LLMs process adversarial prompts remain poorly understood. We present a mechanistic analysis of the jailbreaking behavior in a large-scale, safety-aligned LLM, focusing on LLaMA-2-7B-chat-hf. Leveraging edge attribution patching and subnetwork probing, we systematically identify computational circuits responsible for generating affirmative responses to jailbreak prompts. Ablating these circuits during the first token prediction can reduce attack success rates by up to 80\%, demonstrating its critical role in safety bypass. Our analysis uncovers key attention heads and MLP pathways that mediate adversarial prompt exploitation, revealing how important tokens propagate through these components to override safety constraints. These findings advance the understanding of adversarial vulnerabilities in aligned LLMs and pave the way for targeted, interpretable defenses mechanisms based on mechanistic interpretability.

Chat is not available.