Skip to yearly menu bar Skip to main content


Poster

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Max Wilcoxson · Qiyang Li · Kevin Frans · Sergey Levine

West Exhibition Hall B2-B3 #W-713
[ ] [ ] [ Project Page ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.

Lay Summary:

How do we leverage unlabeled prior data to improve online learning and exploration of a reinforcement learning (RL) agent (a self-improving agent) in solving challenging tasks that require long-horizon reasoning? Our paper presents a method to utilize these data effectively as follows: (1) break the data into segments and turn them into a set of low-level skills that imitate these segments, and (2) determine which skills are the most appropriate to use by processing and analyzing the high-level structure of the data. These allow us to effectively learn a high-level agent that picks low-level skills at a fixed time interval during online learning. By using the high-level agent to carefully select low-level skills online, we are able to collect data in a structured manner, improving online data sample efficiency. As a result, our method is able to learn online efficiently with only limited online data and achieves strong performance on a set of 42 simulated robotic tasks compared all prior strategies.

Chat is not available.