Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Simple Mechanistic Explanations for Out-Of-Context Reasoning

Atticus Wang · Josh Engels · Oliver Clive-Griffin

Keywords: [ model diffing ] [ steering vectors ] [ mechanistic interpretability ] [ out-of-context reasoning ]


Abstract:

Recently, it was observed that large language models (LLMs) can exhibit a surprising phenomenon called out-of-context reasoning (OOCR), a form of out-of-distribution generalisation. In OOCR, a fine-tuned LLM implicitly combines observations scattered thoughout fine-tuning data at inference, despite that information and data not appearing explicitly in the prompt or chain-of-thought. In this work, we investigate OOCR mechanistically and find a simple explanation in many examples: the LoRA-finetuned model effectively learns to add a constant steering vector, and this vector often promotes interpretable tokens when viewed through the unembedding matrix. Moreover, directly training steering vectors for these tasks also induces OOCR. We also find that even for a task that seems to involve conditional behavior (model backdoors), unconditional steering vectors work surprisingly well. Overall, our work presents an explanation of what gets learned during finetuning for OOCR tasks, contributing to the key question of why LLMs can reason out-of-context, an advanced capability that is highly relevant for their safe and reliable deployment.

Chat is not available.