ICML Poster AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Spotlight Poster

AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini · Edoardo Debenedetti · Javier Rando · Milad Nasr · Florian Tramer

East Exhibition Hall A-B #E-601

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Oral presentation: Oral 5A Safety and Security
Thu 17 Jul 10 a.m. PDT — 11 a.m. PDT

Abstract:

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4's 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Lay Summary:

We introduce AutoAdvExBench, a benchmark designed to evaluate whether large language models (LLMs) can autonomously identify and exploit vulnerabilities in adversarial defense systems.Adversarial defenses are defenses that protect machine learning systems from specially crafted inputs meant to fool them. For example, someone might subtly modify an image so that an image classifier misidentifies it. Our benchmark tests whether LLMs can find ways around these protective measures, and automate the work that human security researchers do manually.Our findings reveal an interesting pattern. When we tested our strongest combination of LLM agents on CTF-like challenges ("Capture The Flag" challenges are practice exercises similar to homework problems that security professionals use for training), the LLMs successfully broke 87% of the defenses. However, when we moved to real-world defense systems that researchers actually wrote for research papers, the best LLMs' success rate dropped significantly to just 37%. This gap highlights how much harder it is to attack real systems compared to educational exercises.We also discovered that performance doesn't transfer predictably between these two settings. For example, we tested two advance language models, Claude Sonnet 3.5 and Opus 4 (where Opus 4 is the more advanced model). On the CTF-like challenges, both performed similarly—Sonnet 3.5 broke 75% of defenses while Opus 4 broke 79%. But on real-world defenses, the difference was dramatic: Sonnet 3.5 succeeded only 13% of the time, while Opus 4 managed 30%.The benchmark is publicly available at https://github.com/ethz-spylab/AutoAdvExBench for other researchers to use and build upon.

Chat is not available.