ICML Poster Geometric Resampling in Nearly Linear Time for Follow-the-Perturbed-Leader with Best-of-Both-Worlds Guarantee in Bandit Problems

Poster

Geometric Resampling in Nearly Linear Time for Follow-the-Perturbed-Leader with Best-of-Both-Worlds Guarantee in Bandit Problems

Botao Chen · Jongyeong Lee · Junya Honda

West Exhibition Hall B2-B3 #W-915

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: This paper studies the complexity and optimality of Follow-the-Perturbed-Leader (FTPL) policy in the $K$-armed bandit problems. FTPL is a promising policy that achieves the Best-of-Both-Worlds (BOBW) guarantee without solving an optimization problem unlike Follow-the-Regularized-Leader (FTRL). However, FTPL needs a procedure called geometric resampling to estimate the loss, which needs $O(K^2)$ per-round average complexity, usually worse than that of FTRL. To address this issue, we propose a novel technique, which we call Conditional Geometric Resampling (CGR), for unbiased loss estimation applicable to general perturbation distributions. CGR reduces the average complexity to $O(K\log K)$ without sacrificing the regret bounds. We also propose a biased version of CGR that can control the worst-case complexity while keeping the BOBW guarantee for a certain perturbation distribution. We confirm through experiments that CGR does not only significantly improve the average and worst-case runtime but also achieve better regret thanks to the stable loss estimation.

Lay Summary:

Imagine you're playing a game with several slot machines, each giving different rewards. Your goal is to win as much as possible, but you don't know which machine is the best at first, and you have limited chances to find out. You need a smart strategy to balance learning about the machines and earning rewards. This classic dilemma shows up in many real-world settings like online recommendations and clinical trials.In this paper, we propose a new method that helps computers make better decisions in such uncertain situations. It speeds up the learning process by reducing the need for repeated trial and error, while still making accurate choices. Our approach works well even when the environment changes over time and is much faster than existing methods, making it more practical for real-world use.

Chat is not available.