Skip to yearly menu bar Skip to main content


Poster

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou · Hengkai Pan · Yann LeCun · Lerrel Pinto

West Exhibition Hall B2-B3 #W-411
[ ] [ ] [ Project Page ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remain challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

Lay Summary:

A core ability for intelligent agents is the ability to predict the outcome of their actions on the environment. Giving machines this foresight is the goal of world models, which predict future outcomes based on current actions. However, most existing world models are hard to train, rely on hand-crafted rewards, and are tailored for one specific task at a time.We introduce DINO-WM, a new world model that is task-agnostic, can be trained entirely on offline datasets, and enables agents to reason at test time by optimizing over action sequences. DINO-WM leverages pre-trained vision encoder DINOv2 to enhance spatial understanding. This allows the model to predict directly in a compact latent space, capturing task-relevant information while avoiding the need to reconstruct raw pixels — reducing both complexity and computational cost.With this approach, DINO-WM enables zero-shot planning for unseen goals and environment configurations, such as navigating unfamiliar mazes or manipulating new object shapes. It brings us closer to building general-purpose world models that enable flexible, goal-directed behavior without additional supervision or task-specific retraining.

Chat is not available.