Poster
in
Workshop: Actionable Interpretability
Do LLMs Lie About What They Use? Benchmark for Metacognitive Truthfulness in Large Language Models
Nhi Nguyen · Shauli Ravfogel · Rajesh Ranganath
As large language models (LLMs) become more commonly used in critical applications, so is the practice of using LLMs to explain their own outputs in order to monitor models' behavior and build user trust. For such explanations to be useful, they must reflect the model's internal computation - a property we call metacognitive truthfulness. However, current benchmarks fall short in measuring truthfulness in LLMs, as they rely on human rationales, use potentially untruthful LLMs as judges, or are limited to contrived prompting setups. We introduce the Truthfulness Evaluation of Large Language models' Metacognitive Explanations (TELLME) benchmark that directly measures metacognitive truthfulness in LLMs by grounding truthfulness evaluation in the formal objective of feature attribution. Our approach generalizes insertion-based tests to feature attribution of arbitrary context, providing a dataset-agnostic and model-agnostic evaluation for truthfulness in LLMs. Results across multiple open-source LLMs and public datasets show that many models fail to provide truthful explanations of their inner computations. While larger models are more accurate, they are not more truthful. Moreover, no single factor - including model size, model family, and last hidden states - consistently predicts explanation truthfulness. These findings highlight challenges in improving truthfulness in LLMs and position our benchmark as a practical tool for evaluating and improving metacognitive truthfulness in LLM systems.