Poster
in
Workshop: Exploration in AI Today (EXAIT)
Oracle-Efficient Adversarial Reinforcement Learning via Max-Following
Sikata Sengupta · Zakaria Mhammedi · Teodor Vanislavov Marinov
Keywords: [ reinforcement learning ] [ machine learning theory ] [ max-following ] [ ensembling ]
Learning the optimal policy in reinforcement learning (RL) with large state andaction spaces remains a notoriously difficult problem from both computational andstatistical perspectives. A recent line of work addresses this challenge by aimingto compete with, or improve upon, a given base class of policies. One approach,known as max-following, selects at each state the policy from the base class whoseestimated value function is highest. In this paper, we extend the max-followingframework to the setting of regret minimization under adversarial initial states andlimited feedback. Our algorithm is oracle-efficient, achieves no-regret guaranteeswith respect to the base class (and to the worst approximate max-following policy),and avoids any dependence on the size of the state or action space. It also attainsthe optimal rate in terms of the number of episodes. Additionally, we establish alower bound on the regret of any max-following algorithm as a function of β, aparameter that quantifies the approximation slack in the benchmark policy class.Finally, we empirically validate our theoretical findings on the Linear QuadraticRegulator (LQR) problem.