Workshop
Assessing World Models: Methods and Metrics for Evaluating Understanding
Keyon Vafa · Belinda Li · Kenneth Li · Michael Lepori · Hao Tang · Lionel Wong · Peter Chang
West Ballroom B
Fri 18 Jul, 8:45 a.m. PDT
Generative models across domains are capable of producing outputs that appear to mimic the real world. But have these systems actually understood the laws that govern the world? Researchers across subfields are attempting to answer this question: in natural language processing, researchers measure whether LLMs understand real-world mechanisms in order to measure how robust they are to new tasks; in video generation, researchers assess whether a model has understood the laws of physics in order to evaluate how realistic its videos are; in scientific domains, foundation models are being developed in order to uncover new theories about the world. Despite studying similar questions, these communities remain disparate. This workshop will explore the question: how can we formalize and evaluate whether generative models have understood the real world? While this question is important across communities, we don’t have unified frameworks for defining and evaluating world models. This workshop will bring together these computer science communities along with non-computer-science scientists working on relevant applications.Our invited speakers include Jacob Andreas, Shiry Ginosar, Shirley Ho, Sendhil Mullainathan, and Martin Wattenberg, all of whom have confirmed they will be speaking and that they can make it in-person.
Schedule
Fri 8:45 a.m. - 9:00 a.m.
|
Opening
|
🔗 |
Fri 9:00 a.m. - 9:40 a.m.
|
Spotlights
(
Spotlights
)
>
|
🔗 |
Fri 9:40 a.m. - 10:00 a.m.
|
Coffee Break
|
🔗 |
Fri 10:00 a.m. - 10:40 a.m.
|
Invited Talk 1 (Naomi Saphra: And Nothing Between - Using Categorical Differences to Understand and Predict Model Behavior)
(
Invited Talk
)
>
|
🔗 |
Fri 10:40 a.m. - 11:20 a.m.
|
Invited Talk 2 (Shiry Ginosar: What Do Vision and Vision-Language Models Really Know About the World?)
(
Invited Talk
)
>
|
🔗 |
Fri 11:20 a.m. - 12:00 p.m.
|
Invited Talk 3 (Jacob Andreas: Language Models as World Models?)
(
Invited Talk
)
>
|
🔗 |
Fri 12:00 p.m. - 1:00 p.m.
|
Lunch Break
|
🔗 |
Fri 1:00 p.m. - 1:40 p.m.
|
Invited Talk 4 (Shirley Ho: Polymathic AI: Building Scientific Foundation Models)
(
Invited Talk
)
>
|
🔗 |
Fri 1:40 p.m. - 2:20 p.m.
|
Invited Talk 5 (Sendhil Mullainathan: Testing for Understanding Requires First Defining It)
(
Invited Talks
)
>
|
🔗 |
Fri 2:20 p.m. - 3:20 p.m.
|
Panel Discussion (Jacob Andreas, Jon Kleinberg, Mengye Ren, Alane Suhr)
(
Panel Discussion
)
>
|
🔗 |
Fri 3:20 p.m. - 3:45 p.m.
|
Coffee Break
|
🔗 |
Fri 3:45 p.m. - 5:00 p.m.
|
Poster Session
(
Poster Sessions
)
>
|
🔗 |
Fri 5:00 p.m. - 5:15 p.m.
|
Closing
|
🔗 |
-
|
Measuring Belief Updates in Curious Agents ( Poster ) > link | Joschka Strüber · Ilze Amanda Auzina · Shashwat Goel · Susanne Keller · Jonas Geiping · Ameya Pandurang Prabhu · Matthias Bethge 🔗 |
-
|
Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching ( Poster ) > link | Nikhil Chandak · Shashwat Goel · Ameya Pandurang Prabhu · Moritz Hardt · Jonas Geiping 🔗 |
-
|
Open World Scene Graph Generation using Vision Language Models ( Poster ) > link |
11 presentersAmartya Dutta · Kazi Sajeed Mehrab · Medha Sawhney · Abhilash Neog · Mridul Khurana · Sepideh Fatemi · Aanish Pradhan · M. Maruf · Ismini Lourentzou · Arka Daw · Anuj Karpatne |
-
|
Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models ( Poster ) > link | YingQiao Wang · Eric Bigelow · Tomer Ullman 🔗 |
-
|
What if Othello-Playing Language Models Could See? ( Poster ) > link | Xinyi Chen · Yifei Yuan · Jiaang Li · Serge Belongie · Maarten de Rijke · Anders Søgaard 🔗 |
-
|
Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models ( Poster ) > link | Jia-Hua Lee · Bor Jiun Lin · Wei-Fang Sun · Chun-Yi Lee 🔗 |
-
|
FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models ( Poster ) > link | Likun Tan · Kuan-Wei Huang · Kevin Wu 🔗 |
-
|
Tracking World States with Language Models: State-Based Evaluation Using Chess ( Poster ) > link | Romain Harang · Jason Naradowsky · Yaswitha Gujju · Yusuke Miyao 🔗 |
-
|
HueManity: Probing Fine-Grained Visual Perception in MLLMs ( Poster ) > link | Rynaa Grover · Jayant Sravan Tamarapalli · Sahiti Yerramilli · Nilay Pande 🔗 |
-
|
On the Emergence of "Useless" Features in Next Token Predictors ( Poster ) > link | Mark Rofin · Jalal Naghiyev · Michael Hahn 🔗 |
-
|
Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset ( Poster ) > link | Bingyang Wang · Yijiang Li · Qingyang Zhou · Hui Yi Leong · Tianwei Zhao · Letian Ye · Hokin Deng · Dezhi Luo · Nuno Vasconcelos 🔗 |
-
|
Leveraging the Sequential Nature of Language for Interpretability ( Poster ) > link | Usha Bhalla · Alex Oesterling · Claudio Mayrink Verdun · Flavio Calmon · Himabindu Lakkaraju 🔗 |
-
|
Evaluating Self-Orienting in Language and Reasoning Models ( Poster ) > link | Eric Bigelow · Zergham Ahmed · Tomer Ullman 🔗 |
-
|
Probing the Limits of Mathematical World Models in LLMs ( Poster ) > link |
13 presentersHenry Kvinge · Elizabeth Coda · Eric Yeats · Davis Brown · John Buckheit · Sarah Scullen · Brendan Kennedy · Loc Truong · William Kay · Cliff Joslyn · Tegan Emerson · Michael Henry · John Emanuello |
-
|
ReviseQA: A Benchmark for Belief Revision in Multi-Turn Logical Reasoning ( Poster ) > link | Chadi Helwe · Sultan AlRashed · Francesco Orabona 🔗 |
-
|
Virtue Semantics: Probing the Consistency of Moral Values of Large Language Models ( Poster ) > link | Em Smullen · Srihari Thirumaligai · Anna Leshinskaya 🔗 |
-
|
I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2 ( Poster ) > link | Oliver McLaughlin · Jack Merullo · Arjun Khurana 🔗 |
-
|
World Models and Consistent Mistakes in LLMs ( Poster ) > link | Christopher Wolfram · Aaron Schein 🔗 |
-
|
GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning ( Poster ) > link | Sahiti Yerramilli · Nilay Pande · Jayant Sravan Tamarapalli · Rynaa Grover 🔗 |
-
|
RMA: Reward Model Alignment with Human preference ( Poster ) > link | Ashish Gupta · Manjunatha Naik 🔗 |
-
|
Uncertainty Quantification for LLM-Based Survey Simulations ( Poster ) > link | Chengpiao Huang · Yuhang Wu · Kaizheng Wang 🔗 |
-
|
Aquilon: Towards Building Multimodal Weather LLMs ( Poster ) > link |
13 presentersSumanth Varambally · Veeramakali Vignesh Manivannan · Yasaman Jafari · Luyu Han · Zachary Novack · Zhirui Xia · Salva Ruhling Cachay · Srikar Eranky · Brooks(Ruijia) Niu · Taylor Berg-Kirkpatrick · Duncan Watson-Parris · Yian Ma · Rose Yu |
-
|
Measuring Rule-Following in Language Models ( Poster ) > link | Benjamin Laufer · Jon Kleinberg 🔗 |
-
|
Are LLM Belief Updates Consistent with Bayes’ Theorem? ( Poster ) > link | Sohaib Imran · Ihor Kendiukhov · Matthew Broerman · Aditya Thomas · Riccardo Campanella · Rob Lamb · Peter Atkinson 🔗 |
-
|
Newfluence: Boosting Model Interpretability and Understanding in High Dimensions ( Poster ) > link | Haolin Zou · Arnab Auddy · Yongchan Kwon · Kamiar Rad · Arian Maleki 🔗 |
-
|
Adapting Vision-Language Models for Evaluating World Models ( Poster ) > link | Mariya Hendriksen · Tabish Rashid · David Bignell · Raluca Georgescu · Abdelhak Lemkhenter · Katja Hofmann · Sam Devlin · Sarah Parisot 🔗 |
-
|
Deep Koopman operator framework for causal discovery in nonlinear dynamical systems ( Poster ) > link | Juan Nathaniel · Carla Roesch · Jatan Buch · Derek DeSantis · Adam Rupe · Kara Lamb · Pierre Gentine 🔗 |
-
|
Evaluating Forecasting is More Difficult than Other LLM Evaluations ( Poster ) > link | Daniel Paleka · Shashwat Goel · Jonas Geiping · Florian Tramer 🔗 |
-
|
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models ( Poster ) > link | Vanya Cohen · Ray Mooney 🔗 |
-
|
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning ( Poster ) > link | Delong Chen · Willy Chung · Yejin Bang · Ziwei Ji · Pascale FUNG 🔗 |
-
|
Beyond Behavioural Evaluations for Assessing World Models ( Poster ) > link | Kola Ayonrinde 🔗 |
-
|
Understanding Large Language Models' Ability on Interdisciplinary Research ( Poster ) > link | Yuanhao Shen · Daniel de Sousa · Ricardo de Andrade Nascimento · Ali Asad · Hongyu Guo · Xiaodan Zhu 🔗 |
-
|
Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning ( Poster ) > link | Sultan AlRashed · Jianghui Wang · Francesco Orabona 🔗 |
-
|
Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity ( Poster ) > link |
23 presentersHaoyu Guo · Maria Tikhanovskaya · Paul Raccuglia · Alexey Vlaskin · Christopher Co · Daniel Liebling · Scott Ellsworth · Matthew Abraham · Elizabeth Dorfman · N.P. Armitage · John Tranquada · Senthil Todadri · Antoine Georges · Subir Sachdev · Steven Kivelson · B. Ramshaw · Dominik Kiese · Chunhan Feng · Olivier Gingras · Vadim Oganesyan · Michael Brenner · Subhashini Venugopalan · Eun-Ah Kim |
-
|
APOD: Adaptive PDE-Observation Diffusion for Physics-Constrained Sampling ( Poster ) > link | Ruichen Xu · Haochun Wang · Georgios Kementzidis · Chenhao Si · Yuefan Deng 🔗 |
-
|
Contextual Effects in LLM and Human Causal Reasoning ( Poster ) > link | Zach Studdiford · Gary Lupyan 🔗 |