ICML Dynamic & Stateful Evals of Safety on the Frontier: What can Academics do?

Invited Talk
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Dynamic & Stateful Evals of Safety on the Frontier: What can Academics do?

Eric Wong

[ Abstract ]

Sat 19 Jul 9 a.m. PDT — 9:30 a.m. PDT

Abstract:

LLM evaluations often rely on fixed datasets and single, isolated interactions, a static-ness that fails to capture real-world use cases or vulnerabilities of next-generation models. This talk introduces dynamic and stateful evaluation approaches designed to keep pace with evolving frontier model capabilities and practical adversarial strategies. First, we will introduce task elicitation, an adaptive profiling technique that dynamically generates data to discover new model weaknesses that need not exist in current benchmarks. Second, we will present Benchmarks for Stateful Defenses, a safety benchmark for assessing covert, multi-step attacks that are virtually impossible to detect with single-turn assessments. These advanced methods, however, point to a looming crisis for academic safety research: as models become more capable and safety-aligned, the corresponding evaluations and datasets must become higher-stakes, potentially threatening to close off this safety research from the broader academic community.

Chat is not available.

Invited Talk in Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Dynamic & Stateful Evals of Safety on the Frontier: What can Academics do?

Eric Wong

Invited Talk
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)