Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie · Phil Blandfort · Urja Pawar · William Bankes · David Krueger · Ekdeep Singh Lubana · Dmitrii Krasheninnikov

[ ] [ Project Page ]
Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes as a computationally efficient technique to detect "high-stakes" interactions---where the text indicates the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We train several novel probe architectures on synthetic data and find they exhibit robust generalization performance (mean AUROC > 0.91) on diverse, out-of-distribution, real-world data. Their performance is comparable to that of prompted or fine-tuned medium-sized LLM monitors, while offering computational savings of six orders of magnitude. Furthermore, this research establishes a foundation for building resource-aware monitoring systems where probes serve as an initial, resource-efficient filter in a cascaded system, flagging cases for more specialized and expensive downstream analysis. Finally, we release our novel synthetic dataset and codebase to encourage further investigation.

Chat is not available.