ICML Poster Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Poster

Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Zihan Zhang · Yuxin Chen · Jason Lee · Simon Du · Ruosong Wang

West Exhibition Hall B2-B3 #W-1018

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes.To achieve this result, our new technical ingredients include(i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.

Lay Summary:

We study reinforcement learning (RL) with trajectory feedback, where the agent only observes the cumulative reward along the trajectory. This model is particularly suitable for scenarios where querying the reward at each step incurs prohibitive costs. We develop an algorithm that achieves asymptotically near-optimal regret for finite-horizon Markov Decision Processes.

Chat is not available.