Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Exploration in AI Today (EXAIT)

Provably Learning from Language Feedback

Wanqiao Xu · Allen Nie · Ruijie Zheng · Aditya Modi · Adith Swaminathan · Ching-An Cheng

Keywords: [ no-regret learning ] [ bandit ] [ large language models ] [ sequential decision-making ]

[ ] [ Project Page ]
 
presentation: Exploration in AI Today (EXAIT)
Sat 19 Jul 8:30 a.m. PDT — 5:15 p.m. PDT

Abstract:

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce "transfer eluder dimension" as a complexity measure to characterize the hardness of LLF problems. We show that the transfer eluder dimension captures the intuition that information in feedback changes the learning complexity of LLF. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called LLF-UCB, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Our contributions mark a first step towards designing principled agent learning from generic language feedback.

Chat is not available.