ICML Poster Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Poster

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner · Mark Rowland · Yunhao Tang · Murat Erdogdu · Amir-massoud Farahmand

West Exhibition Hall B2-B3 #W-607

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work analyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.

Lay Summary:

A popular approach to deep reinforcement learning is to use classification losses to learn the range of possible future outcomes. Previous theoretical works studying this algorithm change the loss used in order to simplify the analysis, but this creates a theory-practice gap. In this work, we directly study these learning algorithms with the classification loss used in practice, the KL divergence. We show that with some modifications to the dynamics (the use of a preconditioner matrix), the updates provably converge. We also study the efficiency of these methods compared to standard reinforcement learning, and we prove results on the exact variance of these algorithms as they approach convergence.Throughout our analysis, we obtain a number of insights that are valuable to anyone using these methods in practice, such as how to modify the learning rate used as one changes the number of atoms (a separate hyperparameter), and how the number and locations of these atoms affect the error incurred.

Chat is not available.