Spotlight Poster
Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
Chenlu Ye · Yujia Jin · Alekh Agarwal · Tong Zhang
West Exhibition Hall B2-B3 #W-919
Many real-world decision-making systems—like those used in online advertising or wireless networks—face unpredictable rewards that can be very large or "heavy-tailed." This makes it difficult for standard learning algorithms to make good decisions, since they are designed assuming rewards are relatively well-behaved and bounded.We tackle this problem by designing new algorithms for contextual bandits, a type of learning model that helps an agent choose actions based on observed data. Our algorithms use a statistical tool called the Catoni estimator to achieve robustness even when the observed rewards have a large range and normal variance. We provide two versions: one for when the reward variance is known in advance, and one that works even when it is not.Our methods achieve strong performance variance-dependent guarantees that enjoy logarithmic order on the worst-case range, meaning they are much more accurate and efficient in realistic scenarios. These results push the boundary of robust reinforcement learning and make it more practical for applications involving unreliable or extreme feedback, such as recommendation systems, finance, and networking.