ICML Poster Gap-Dependent Bounds for Federated $Q$-Learning

Poster

Gap-Dependent Bounds for Federated $Q$-Learning

Haochen Zhang · Zhong Zheng · Lingzhou Xue

West Exhibition Hall B2-B3 #W-619

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$-type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$, states $S$, and actions $A$, where $T$ is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a $\log T$-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on $MSA$ from the $\log T$ term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when $M=1$, removing $SA$ from the $\log T$ term.

Lay Summary:

Existing theoretical analyses of federated reinforcement learning (FRL) algorithms don't fully capture their real-world effectiveness. While these methods show strong performance in practice, their mathematical guarantees remain overly conservative, especially in scenarios where some actions are clearly better than others.We present a new analysis revealing the hidden efficiency of standard FRL approaches. By accounting for natural structures in learning problems—particularly the difference between optimal and non-optimal decisions—we show these methods converge much faster and require far less coordination than previously proven. Our work provides precise mathematical characterization of how exploration and exploitation phases contribute to overall performance.This research bridges the gap between theory and practice, explaining why these algorithms work so well in real applications. Our results demonstrate these methods scale remarkably well with multiple learners, supporting their use in privacy-sensitive areas like healthcare and robotics.

Chat is not available.