ICML Poster Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Poster

Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Yik Siu Chan · Narutatsu Ri · Yuxin Xiao · Marzyeh Ghassemi

East Exhibition Hall A-B #E-901

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both *actionable* and *informative*---two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

Lay Summary:

Large language models (LLMs) are designed to follow safety guidelines that prevent harmful use. However, researchers have found ways to bypass these safeguards and generate dangerous content, a tactic known as "jailbreaking." While previous work has focused on technical methods for carrying out such attacks, we asked two new questions: First, are these harmful responses actually useful in helping someone carry out harmful actions? Second, can such responses be triggered through simple, everyday interactions?We found that the most harmful responses are both actionable (offering clear steps to follow) and informative (providing useful details). Surprisingly, these kinds of responses can be elicited using simple, non-technical methods. To better evaluate this risk, we develop HarmScore, a tool that measures how much a model response enables harmful actions. We also introduce Speak Easy, a simple jailbreak framework that uses natural, multi-step conversations across different languages to bypass safety measures. These findings highlight a critical vulnerability: even without advanced skills, users can exploit common interaction patterns to misuse LLMs. Recognizing this risk is an important step toward building safer and more responsible AI systems.

Chat is not available.