Poster
Adversarial Combinatorial Semi-bandits with Graph Feedback
Yuxiao Wen
East Exhibition Hall A-B #E-2709
In many scenarios, we are making decisions while learning the factors that affect our preferences or gains, which is called online learning. We want to learn about those factors, but we also want to gain as much as possible during this process. For example, we have a few lotteries (numbered 1 to K) with different winning probabilities but same price, and we can draw any one of them every day. Then to get as much money as possible (say, in one year), we need to figure out whose probability is the highest as we go.This type of problem is studied under the name multi-armed bandits. In this work, we study a variant of them, in which we now can draw a few (say S) of them every day. How to maximize our gain? More importantly, if there is additional information, how can we leverage it? For example, imagine whenever you buy lottery k, your rich friend is always buying every lottery with a smaller number, namely 1 to k-1. And she will tell you her outcomes as well. This information structure forms a graph over the candidate actions (in this case, the lotteries we can purchase), and is called graph feedback.In this work, we mathematically the performance guarantees for any policy/strategy you can use to maximize your total gain, in the worst-case scenario. This is the so-called minimax regret bounds. We show how the worst-case performance guarantees relate to the graph structure, and we provide an optimal (in the worst-case sense) algorithm for you to achieve this. For any online decision-making process, such as bidding in advertising, online inventory control, recommendation systems, etc., you may use this algorithm to guarantee your total gain.