ICML Poster SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Poster

SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Malte Mosbach · Jan Ewertz · Angel Villar-Corrales · Sven Behnke

West Exhibition Hall B2-B3 #W-706

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of multi-object manipulation environments that require both relational reasoning and dexterous control. Videos and code are available at https:// slot-latent-dynamics.github.io.

Lay Summary:

Teaching robots and game-playing agents is often time-consuming because most algorithms take in every pixel instead of the handful of objects that really matter. Humans, by contrast, effortlessly track the coffee mug, the table, and our hand, and predict how each will move when we act. Our work aims to bring that object-level common sense to machines. We built SOLD, a system that receives video sequences and, with no human labels, splits the scene into individual “slots” - one compact representation per object. It then learns how each slot changes over time, letting the agent imagine how the scene will evolve under different actions. Because the agent reasons in terms of objects, its inner workings are easier for people to inspect. In simulated tasks where a robot must reason over multiple objects in a scene and manipulate a specific one, SOLD masters the required skills faster and more reliably than today’s best methods.This efficiency could help to cut training costs for real-world robots and make them more interpretable, because we can see what objects they are paying attention to in order to select their action.

Chat is not available.