Skip to yearly menu bar Skip to main content


Poster

Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding

Siyuan Xu · Minghui Zhu

West Exhibition Hall B2-B3 #W-704
[ ] [ ]
Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

This paper studies meta-reinforcement learning with adaptation from human feedback. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the framework adaptation via Preference-Order-preserving EMbedding (POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the adaptation from human feedback, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-of-the-art performance with substantial improvement over baseline methods.

Lay Summary:

This paper introduces a new way to teach AI systems how to quickly adapt to new tasks using feedback from humans instead of complex programming or reward setups. The method helps the AI learn patterns across many training tasks, so that when it faces a new task, it can understand what to do just by comparing options that people prefer. This makes the training process much faster and more efficient, especially in situations where it’s hard to define what success looks like. The approach shows strong results in robotic simulations, performing as well or better than existing methods while using much less human input.

Chat is not available.