Contributed Talk
in
Workshop: Actionable Interpretability
Contributed Talk 4: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes as a computationally efficient technique to detect "high-stakes" interactions---where the text indicates the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring. We train several novel probe architectures on synthetic data and find they exhibit robust generalization performance (mean AUROC > 0.91) on diverse, out-of-distribution, real-world data. Their performance is comparable to that of prompted or fine-tuned medium-sized LLM monitors, while offering computational savings of six orders of magnitude. Furthermore, this research establishes a foundation for building resource-aware monitoring systems where probes serve as an initial, resource-efficient filter in a cascaded system, flagging cases for more specialized and expensive downstream analysis. Finally, we release our novel synthetic dataset and codebase to encourage further investigation.