ICML Poster Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Poster

Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Qiuhao Wang · Yuqi Zha · Chin Pang Ho · Marek Petrik

West Exhibition Hall B2-B3 #W-717

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.

Lay Summary:

Many real-world applications — such as supply chains or autonomous vehicles — require making reliable decisions over time and achieving strong long-term performance. However, the dynamics of the environment are often uncertain or not fully known, which makes it difficult to find reliable strategies.To address this challenge, we propose a new learning algorithm, RP2G, which computes reliable long-term strategies under uncertainty. The algorithm includes two variants designed for different structural assumptions about the unknown environment. In addition, we introduce an adaptive update rule that improves learning efficiency over time.Our method is supported by theoretical guarantees and shows strong performance in experiments. It offers a general solution framework for making reliable long-term strategies when the environment is uncertain or only partially known, and can adapt to different structural assumptions.

Chat is not available.