Poster
in
Workshop: Actionable Interpretability
Steering Self-Evaluation: Interpreting LLM's Reasoning Across Domains and Languages
Praveen Hegde
Understanding and controlling the reasoning processes of large language models (LLMs) is crucial for their reliable deployment. In this work, we investigate the latent representation of self-evaluation behavior - the ability of a model to assess its own reasoning steps - a vital behavior for robust reasoning. Through targeted steering vector computation, we identify a direction within LLM activations that represents this self-evaluation behavior. Crucially, we demonstrate that this steering vector for self-evaluation exhibits remarkable cross-contextual efficacy, working well across different domains (e.g., math and medicine) and languages (e.g., English and Spanish). This suggests that the identified latent direction captures a fundamental, abstract representation of self-evaluation within the LLM's internal state, offering a promising avenue for interpretable and controllable reasoning across diverse applications.