Poster
in
Workshop: Actionable Interpretability
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Jing Huang · Junyi Tao · Thomas Icard · Diyi Yang · Christopher Potts
Interpretability research now offers a variety oftechniques for identifying abstract internal mechanisms in neural networks. Can such techniques beused to predict how models will behave on out-of-distribution examples? In this work, we providea positive answer to this question. Through a diverse set of language modeling tasks—includingsymbol manipulation, knowledge retrieval, andinstruction following—we show that the most robust features for correctness prediction are thosethat play a distinctive causal role in the model’sbehavior. Specifically, we propose two methodsthat leverage causal mechanisms to predict thecorrectness of model outputs: counterfactual simulation (checking whether key causal variablesare realized) and value probing (using the values of those variables to make predictions). Bothachieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic featuresin out-of-distribution settings, where predicting model behaviors is more crucial. Our work thushighlights a novel and significant application forinternal causal analysis of language models.