Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding

WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Delong Chen · Willy Chung · Yejin Bang · Ziwei Ji · Pascale FUNG

Keywords: [ World Models ] [ Procedural Planning ] [ Benchmark ]


Abstract:

World models predict future world states resulting from actions, enabling AI agents to perform planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different models. In contrast to prior works that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. As such, to prevent models from exploiting low-level continuity cues in background scenes, we provide “action equivalents” – identical actions observed in different contexts – as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, which ensures better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieves 57% accuracy on High-level World Modeling and 38% on Long-horizon Procedural Planning whereas humans are able to perfectly solve both tasks.

Chat is not available.