Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

How (not) to hack AI?

Ivan Evtimov

[ ]
Sat 19 Jul 1:45 p.m. PDT — 2:15 p.m. PDT

Abstract:

Research on safety, privacy and security of language model-based AI needs better threat models.

Consider the jailbreaking/red teaming research problem. Unlike with the classic generation of adversarial examples against computer vision classifiers, adversarial inputs are much more easily generated by humans and the outputs impact complex and nuanced social topics. Studying worst-case adversaries with advanced optimization methods has its place in this setting. Yet it has taken over a disproportionate part of the literature without careful consideration of whether the “jailbreaks” it discovers are relevant to real safety issues. Two failure modes stand out. First, this research has left out an important class of very prevalent adversaries by mostly ignoring multi-turn adversarial conversations. Second, what counts as an attack success rate is often an artifact of the trend away from judgmental refusal in foundational models. Many safety benchmarks are misspecified relative to what recent industry-standard model safety policies require.

Similarly, the trend toward language model-based systems that can execute actions (“agents”) instead of simply providing textual responses shifts the most urgent problems in privacy and security. In privacy, recent benchmarks suggest studying whether and how models choose to share private data given to them in their context window is more pressing than whether they regurgitate information from their training set. In security, early works on agentic hijacking tend to propose misspecified threat models, but more recent benchmarks set better requirements for the adversary.

The works discussed in this talk point to the need to treat safety, privacy, and security as a multidisciplinary effort that is not confined to machine learning or security researchers alone.

Chat is not available.