Poster
Near-Optimal Sample Complexity for MDPs via Anchoring
Jongmin Lee · Mario Bravo · Roberto Cominetti
West Exhibition Hall B2-B3 #W-1011
Abstract:
We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\||h\||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ \||h\|| $. Although the complexity bound involves the span seminorm $ \||h\|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.
Lay Summary:
We propose a new algorithm, SAVIA+, for solving average reward Markov decision processes (MDPs) in the generative model setting. To the best of our knowledge, SAVIA+ is the most efficient model-free algorithm to date, nearly matching the lower bound on sample complexity. Unlike many prior methods, it does not require any prior knowledge of the optimal solution. Our approach combines two key ideas, recursive sampling and Halpern's anchored iteration, and we also show how these techniques can be extended to address discounted MDPs.
Chat is not available.