ICML Contributed Talk 4: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Contributed Talk
in
Workshop: Actionable Interpretability

Contributed Talk 4: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang

[ Abstract ] [ Project Page ]

Sat 19 Jul 2:45 p.m. PDT — 3 p.m. PDT

Abstract:

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes as a computationally efficient technique to detect "high-stakes" interactions---where the text indicates the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We train several novel probe architectures on synthetic data and find they exhibit robust generalization performance (mean AUROC > 0.91) on diverse, out-of-distribution, real-world data. Their performance is comparable to that of prompted or fine-tuned medium-sized LLM monitors, while offering computational savings of six orders of magnitude. Furthermore, this research establishes a foundation for building resource-aware monitoring systems where probes serve as an initial, resource-efficient filter in a cascaded system, flagging cases for more specialized and expensive downstream analysis. Finally, we release our novel synthetic dataset and codebase to encourage further investigation.

Chat is not available.

Contributed Talk in Workshop: Actionable Interpretability

Contributed Talk 4: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang

Contributed Talk
in
Workshop: Actionable Interpretability