Spotlight Poster
LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Paul McVay · Sergio Arnaud · Ada Martin · Arjun Majumdar · Krishna Murthy Jatavallabhula · Phillip Thomas · Ruslan Partsey · Daniel Dugas · Abha Gejji · Alexander Sax · Vincent-Pierre Berges · Mikael Henaff · Ayush Jain · Ang Cao · Ishita Prasad · Mrinal Kalakrishnan · Michael Rabbat · Nicolas Ballas · Mahmoud Assran · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier
West Exhibition Hall B2-B3 #W-215
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
For robot assistants to become commonplace and perform household tasks alongside humans, we will need to communicate via natural language. This means that the robot will need to be able to differentiate objects based on object names, descriptions, and spatial relationships to successfully ‘put away the pillows on the bed’. This work describes a system that is able to locate objects in 3D space from natural language input.Given the limited amount of 3D data with objects and descriptions, this work uses a three-fold approach. First, we leverage image models to incorporate knowledge from labeled and unlabeled 2D data. Second, we develop a new algorithm to learn from unlabeled 3D data. Finally, we develop a new approach to learn from labeled 3D data. Additionally, we release additional labeled 3D data. This system achieves state of the art performance on existing benchmarks that measure location accuracy in 3D based on natural language descriptions. We also show the system performs well in robotic use-cases.