ICML Detecting High-Stakes Interactions with Activation Probes

Poster
in
Workshop: Actionable Interpretability

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie · Phil Blandfort · Urja Pawar · William Bankes · David Krueger · Ekdeep Singh Lubana · Dmitrii Krasheninnikov

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes as a computationally efficient technique to detect "high-stakes" interactions---where the text indicates the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We train several novel probe architectures on synthetic data and find they exhibit robust generalization performance (mean AUROC > 0.91) on diverse, out-of-distribution, real-world data. Their performance is comparable to that of prompted or fine-tuned medium-sized LLM monitors, while offering computational savings of six orders of magnitude. Furthermore, this research establishes a foundation for building resource-aware monitoring systems where probes serve as an initial, resource-efficient filter in a cascaded system, flagging cases for more specialized and expensive downstream analysis. Finally, we release our novel synthetic dataset and codebase to encourage further investigation.

Chat is not available.

Poster in Workshop: Actionable Interpretability

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie · Phil Blandfort · Urja Pawar · William Bankes · David Krueger · Ekdeep Singh Lubana · Dmitrii Krasheninnikov

Poster
in
Workshop: Actionable Interpretability