Poster
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
Tal Lancewicki · Yishay Mansour
West Exhibition Hall B2-B3 #W-1007
We study the challenge of training reinforcement learning agents when feedback is only available as the total loss at the end of each episode, a situation known as aggregate bandit feedback. This is common in settings like robotics or dialogues with an LLM, where step-by-step feedback is typically unavailable. Our work introduces the first Policy Optimization algorithms for this problem. When the environment’s dynamics are known, our method achieves the first optimal regret bound (a common performance measure) for this setting. When the dynamics are unknown, our approach substantially improves regret compared to previous best results, marking a significant step forward for learning in environments with limited feedback.