Assessing World Models: Methods and Metrics for Evaluating Understanding

Workshop

Assessing World Models: Methods and Metrics for Evaluating Understanding

Keyon Vafa · Belinda Li · Kenneth Li · Michael Lepori · Hao Tang · Lionel Wong · Peter Chang

West Ballroom B

Fri 18 Jul, 8:45 a.m. PDT

[ Abstract ] Workshop Website

[ OpenReview]

Generative models across domains are capable of producing outputs that appear to mimic the real world. But have these systems actually understood the laws that govern the world? Researchers across subfields are attempting to answer this question: in natural language processing, researchers measure whether LLMs understand real-world mechanisms in order to measure how robust they are to new tasks; in video generation, researchers assess whether a model has understood the laws of physics in order to evaluate how realistic its videos are; in scientific domains, foundation models are being developed in order to uncover new theories about the world. Despite studying similar questions, these communities remain disparate. This workshop will explore the question: how can we formalize and evaluate whether generative models have understood the real world? While this question is important across communities, we don’t have unified frameworks for defining and evaluating world models. This workshop will bring together these computer science communities along with non-computer-science scientists working on relevant applications.Our invited speakers include Jacob Andreas, Shiry Ginosar, Shirley Ho, Sendhil Mullainathan, and Martin Wattenberg, all of whom have confirmed they will be speaking and that they can make it in-person.

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 8:45 a.m. - 9:00 a.m.	Opening	🔗
Fri 9:00 a.m. - 9:40 a.m.	Spotlights ( Spotlights ) >	🔗
Fri 9:40 a.m. - 10:00 a.m.	Coffee Break	🔗
Fri 10:00 a.m. - 10:40 a.m.	Invited Talk 1 (Naomi Saphra: And Nothing Between - Using Categorical Differences to Understand and Predict Model Behavior) ( Invited Talk ) >	🔗
Fri 10:40 a.m. - 11:20 a.m.	Invited Talk 2 (Shiry Ginosar: What Do Vision and Vision-Language Models Really Know About the World?) ( Invited Talk ) >	🔗
Fri 11:20 a.m. - 12:00 p.m.	Invited Talk 3 (Jacob Andreas: Language Models as World Models?) ( Invited Talk ) >	🔗
Fri 12:00 p.m. - 1:00 p.m.	Lunch Break	🔗
Fri 1:00 p.m. - 1:40 p.m.	Invited Talk 4 (Shirley Ho: Polymathic AI: Building Scientific Foundation Models) ( Invited Talk ) >	🔗
Fri 1:40 p.m. - 2:20 p.m.	Invited Talk 5 (Sendhil Mullainathan: Testing for Understanding Requires First Defining It) ( Invited Talks ) >	🔗
Fri 2:20 p.m. - 3:20 p.m.	Panel Discussion (Jacob Andreas, Jon Kleinberg, Mengye Ren, Alane Suhr) ( Panel Discussion ) >	🔗
Fri 3:20 p.m. - 3:45 p.m.	Coffee Break	🔗
Fri 3:45 p.m. - 5:00 p.m.	Poster Session ( Poster Sessions ) >	🔗
Fri 5:00 p.m. - 5:15 p.m.	Closing	🔗
-	Measuring Belief Updates in Curious Agents ( Poster ) > link Link	Joschka Strüber · Ilze Amanda Auzina · Shashwat Goel · Susanne Keller · Jonas Geiping · Ameya Pandurang Prabhu · Matthias Bethge 🔗
-	Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching ( Poster ) > link Link	Nikhil Chandak · Shashwat Goel · Ameya Pandurang Prabhu · Moritz Hardt · Jonas Geiping 🔗
-	Open World Scene Graph Generation using Vision Language Models ( Poster ) > link Link	11 presenters Amartya Dutta · Kazi Sajeed Mehrab · Medha Sawhney · Abhilash Neog · Mridul Khurana · Sepideh Fatemi · Aanish Pradhan · M. Maruf · Ismini Lourentzou · Arka Daw · Anuj Karpatne 🔗
-	Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models ( Poster ) > link Link	YingQiao Wang · Eric Bigelow · Tomer Ullman 🔗
-	What if Othello-Playing Language Models Could See? ( Poster ) > link Link	Xinyi Chen · Yifei Yuan · Jiaang Li · Serge Belongie · Maarten de Rijke · Anders Søgaard 🔗
-	Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models ( Poster ) > link Link	Jia-Hua Lee · Bor Jiun Lin · Wei-Fang Sun · Chun-Yi Lee 🔗
-	FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models ( Poster ) > link Link	Likun Tan · Kuan-Wei Huang · Kevin Wu 🔗
-	Tracking World States with Language Models: State-Based Evaluation Using Chess ( Poster ) > link Link	Romain Harang · Jason Naradowsky · Yaswitha Gujju · Yusuke Miyao 🔗
-	HueManity: Probing Fine-Grained Visual Perception in MLLMs ( Poster ) > link Link	Rynaa Grover · Jayant Sravan Tamarapalli · Sahiti Yerramilli · Nilay Pande 🔗
-	On the Emergence of "Useless" Features in Next Token Predictors ( Poster ) > link Link	Mark Rofin · Jalal Naghiyev · Michael Hahn 🔗
-	Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset ( Poster ) > link Link	Bingyang Wang · Yijiang Li · Qingyang Zhou · Hui Yi Leong · Tianwei Zhao · Letian Ye · Hokin Deng · Dezhi Luo · Nuno Vasconcelos 🔗
-	Leveraging the Sequential Nature of Language for Interpretability ( Poster ) > link Link	Usha Bhalla · Alex Oesterling · Claudio Mayrink Verdun · Flavio Calmon · Himabindu Lakkaraju 🔗
-	Evaluating Self-Orienting in Language and Reasoning Models ( Poster ) > link Link	Eric Bigelow · Zergham Ahmed · Tomer Ullman 🔗
-	Probing the Limits of Mathematical World Models in LLMs ( Poster ) > link Link	13 presenters Henry Kvinge · Elizabeth Coda · Eric Yeats · Davis Brown · John Buckheit · Sarah Scullen · Brendan Kennedy · Loc Truong · William Kay · Cliff Joslyn · Tegan Emerson · Michael Henry · John Emanuello 🔗
-	ReviseQA: A Benchmark for Belief Revision in Multi-Turn Logical Reasoning ( Poster ) > link Link	Chadi Helwe · Sultan AlRashed · Francesco Orabona 🔗
-	Virtue Semantics: Probing the Consistency of Moral Values of Large Language Models ( Poster ) > link Link	Em Smullen · Srihari Thirumaligai · Anna Leshinskaya 🔗
-	I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2 ( Poster ) > link Link	Oliver McLaughlin · Jack Merullo · Arjun Khurana 🔗
-	World Models and Consistent Mistakes in LLMs ( Poster ) > link Link	Christopher Wolfram · Aaron Schein 🔗
-	GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning ( Poster ) > link Link	Sahiti Yerramilli · Nilay Pande · Jayant Sravan Tamarapalli · Rynaa Grover 🔗
-	RMA: Reward Model Alignment with Human preference ( Poster ) > link Link	Ashish Gupta · Manjunatha Naik 🔗
-	Uncertainty Quantification for LLM-Based Survey Simulations ( Poster ) > link Link	Chengpiao Huang · Yuhang Wu · Kaizheng Wang 🔗
-	Aquilon: Towards Building Multimodal Weather LLMs ( Poster ) > link Link	13 presenters Sumanth Varambally · Veeramakali Vignesh Manivannan · Yasaman Jafari · Luyu Han · Zachary Novack · Zhirui Xia · Salva Ruhling Cachay · Srikar Eranky · Brooks(Ruijia) Niu · Taylor Berg-Kirkpatrick · Duncan Watson-Parris · Yian Ma · Rose Yu 🔗
-	Measuring Rule-Following in Language Models ( Poster ) > link Link	Benjamin Laufer · Jon Kleinberg 🔗
-	Are LLM Belief Updates Consistent with Bayes’ Theorem? ( Poster ) > link Link	Sohaib Imran · Ihor Kendiukhov · Matthew Broerman · Aditya Thomas · Riccardo Campanella · Rob Lamb · Peter Atkinson 🔗
-	Newfluence: Boosting Model Interpretability and Understanding in High Dimensions ( Poster ) > link Link	Haolin Zou · Arnab Auddy · Yongchan Kwon · Kamiar Rad · Arian Maleki 🔗
-	Adapting Vision-Language Models for Evaluating World Models ( Poster ) > link Link	Mariya Hendriksen · Tabish Rashid · David Bignell · Raluca Georgescu · Abdelhak Lemkhenter · Katja Hofmann · Sam Devlin · Sarah Parisot 🔗
-	Deep Koopman operator framework for causal discovery in nonlinear dynamical systems ( Poster ) > link Link	Juan Nathaniel · Carla Roesch · Jatan Buch · Derek DeSantis · Adam Rupe · Kara Lamb · Pierre Gentine 🔗
-	Evaluating Forecasting is More Difficult than Other LLM Evaluations ( Poster ) > link Link	Daniel Paleka · Shashwat Goel · Jonas Geiping · Florian Tramer 🔗
-	MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models ( Poster ) > link Link	Vanya Cohen · Ray Mooney 🔗
-	WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning ( Poster ) > link Link	Delong Chen · Willy Chung · Yejin Bang · Ziwei Ji · Pascale FUNG 🔗
-	Beyond Behavioural Evaluations for Assessing World Models ( Poster ) > link Link	Kola Ayonrinde 🔗
-	Understanding Large Language Models' Ability on Interdisciplinary Research ( Poster ) > link Link	Yuanhao Shen · Daniel de Sousa · Ricardo de Andrade Nascimento · Ali Asad · Hongyu Guo · Xiaodan Zhu 🔗
-	Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning ( Poster ) > link Link	Sultan AlRashed · Jianghui Wang · Francesco Orabona 🔗
-	Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity ( Poster ) > link Link	23 presenters Haoyu Guo · Maria Tikhanovskaya · Paul Raccuglia · Alexey Vlaskin · Christopher Co · Daniel Liebling · Scott Ellsworth · Matthew Abraham · Elizabeth Dorfman · N.P. Armitage · John Tranquada · Senthil Todadri · Antoine Georges · Subir Sachdev · Steven Kivelson · B. Ramshaw · Dominik Kiese · Chunhan Feng · Olivier Gingras · Vadim Oganesyan · Michael Brenner · Subhashini Venugopalan · Eun-Ah Kim 🔗
-	APOD: Adaptive PDE-Observation Diffusion for Physics-Constrained Sampling ( Poster ) > link Link	Ruichen Xu · Haochun Wang · Georgios Kementzidis · Chenhao Si · Yuefan Deng 🔗
-	Contextual Effects in LLM and Human Causal Reasoning ( Poster ) > link Link	Zach Studdiford · Gary Lupyan 🔗