Poster
in
Workshop: Actionable Interpretability
A Single Direction of Truth: An Observer Model’s Linear Residual Probe Exposes and Steers Contextual Hallucinations
Charles O'Neill · Sviatoslav Chalnev · Rune Chi Zhao · Max Kirkby · Mudith Jayasekara
Abstract:
Contextual hallucinations –- statements unsupported by given context –- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5–27 points and showing robust mid‑layer performance across Gemma‑2 models (2B$\to$27B). Gradient‑times‑activation localises this signal to sparse, late‑layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low‑dimensional hallucination tracking linked to specific MLP sub‑circuits, exploitable for detection and mitigation. We release the 2000‑example \textsc{ContraTales} benchmark for realistic assessment of such solutions.
Chat is not available.