Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Jing Huang · Junyi Tao · Thomas Icard · Diyi Yang · Christopher Potts

[ ] [ Project Page ]
Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

Interpretability research now offers a variety oftechniques for identifying abstract internal mechanisms in neural networks. Can such techniques beused to predict how models will behave on out-of-distribution examples? In this work, we providea positive answer to this question. Through a diverse set of language modeling tasks—includingsymbol manipulation, knowledge retrieval, andinstruction following—we show that the most robust features for correctness prediction are thosethat play a distinctive causal role in the model’sbehavior. Specifically, we propose two methodsthat leverage causal mechanisms to predict thecorrectness of model outputs: counterfactual simulation (checking whether key causal variablesare realized) and value probing (using the values of those variables to make predictions). Bothachieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic featuresin out-of-distribution settings, where predicting model behaviors is more crucial. Our work thushighlights a novel and significant application forinternal causal analysis of language models.

Chat is not available.