Skip to yearly menu bar Skip to main content


Poster

Latent Action Learning Requires Supervision in the Presence of Distractors

Alexander Nikulin · Ilya Zisman · Denis Tarasov · Nikita Lyubaykin · Andrei Polubarov · Igor Kiselev · Vladislav Kurenkov

West Exhibition Hall B2-B3 #W-604
[ ] [ ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.

Lay Summary:

One of the most promising ways to scale up Embodied AI is to use videos from the internet, given that they are extremely large and diverse, encompassing many complex, real-world human activities. However, video data cannot be used immediately due to a lack of action annotations. Recently, approaches that infer latent proxy actions have demonstrated remarkable pre-training efficiency. These approaches, however, have focused on distractor-free data, which unfortunately does not reflect real-world videos, which may contain action-correlated distractors, such as people moving in the background. In this study, we empirically investigate the effect of such distractors on latent action learning. We demonstrate that, without additional supervision in the form of hints indicating what is relevant to the action, these approaches are ineffective. Our findings indicate that existing methods will not be able to utilize the available videos on the internet effectively, and that novel approaches are needed.

Chat is not available.