Poster
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
Yik Siu Chan · Narutatsu Ri · Yuxin Xiao · Marzyeh Ghassemi
East Exhibition Hall A-B #E-901
Large language models (LLMs) are designed to follow safety guidelines that prevent harmful use. However, researchers have found ways to bypass these safeguards and generate dangerous content, a tactic known as "jailbreaking." While previous work has focused on technical methods for carrying out such attacks, we asked two new questions: First, are these harmful responses actually useful in helping someone carry out harmful actions? Second, can such responses be triggered through simple, everyday interactions?We found that the most harmful responses are both actionable (offering clear steps to follow) and informative (providing useful details). Surprisingly, these kinds of responses can be elicited using simple, non-technical methods. To better evaluate this risk, we develop HarmScore, a tool that measures how much a model response enables harmful actions. We also introduce Speak Easy, a simple jailbreak framework that uses natural, multi-step conversations across different languages to bypass safety measures. These findings highlight a critical vulnerability: even without advanced skills, users can exploit common interaction patterns to misuse LLMs. Recognizing this risk is an important step toward building safer and more responsible AI systems.