Spotlight Poster
CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities
Yuxuan Zhu · Antony Kellermann · Dylan Bowman · Philip Li · Akul Gupta · Adarsh Danda · Richard Fang · Conner Jensen · Eric Ihli · Jason Benn · Jet Geronimo · Avi Dhir · Sudhit Rao · Kaicheng Yu · Twm Stone · Daniel Kang
West Exhibition Hall B2-B3 #W-506
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper-tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities.
Modern artificial intelligence systems (e.g., ChatGPT) are increasingly able to write code and even hack computer programs on their own. As a result, security researchers need a reliable way to see how dangerous these AI-powered hackers really are. Unfortunately, current benchmarks usually rely on toy games or cover only a limited number of cybersecurity vulnerabilities. They do not reflect the complexity and diversity of real-world web applications.We introduce CVE-Bench, a benchmark built from real-world web applications. Each application contains a critical-security vulnerability that actually happened. We developed a secure sandbox framework for these applications to simulate real-world scenarios and evaluate AI systems automatically.In our experiments, we find that state-of-the-art AI systems succeeded 13% of the time. This means current AI systems are capable of hacking some real vulnerabilities, but they still miss most of them. CVE-Bench now gives researchers a realistic, repeatable way to track that risk and to improve defenses before AI hackers grow more effective.