Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding
Measuring Belief Updates in Curious Agents
Joschka StrĂ¼ber · Ilze Amanda Auzina · Shashwat Goel · Susanne Keller · Jonas Geiping · Ameya Pandurang Prabhu · Matthias Bethge
Keywords: [ agentic reasoning ] [ language models ] [ internal world model ] [ multi-turn reasoning ]
To effectively perform open-ended tasks, language models must identify gaps in their knowledge, take actions to acquire new information, and update their internal world models accordingly. This raises a key question, how can we assess whether their reasoning chains and multi-turn actions contribute to improving beliefs in their internal world model? In this paper, we demonstrate a simple, scalable method of measuring belief updates by sequentially assessing the log-probabilities that a language model assigns to the true belief across multi-turn actions. We assess model belief updates on a multi-turn RL reasoning benchmark, 'Twenty Questions'. Our findings show that recent Qwen3 models struggle to update their beliefs, even when the quality of generated questions is controlled for. Through counterfactual experiments, we validate that finetuning teaches the student models to perform coherent belief updates, which they could not do before. Intriguingly, we find that measuring model beliefs also allows detecting reward-hacking in RL-trained models. Overall, we offer a novel perspective on measuring and understanding intermediate beliefs of language models.