Poster
Tokenized Bandit for LLM Decoding and Alignment
Suho Shin · Chenghao Yang · Haifeng Xu · MohammadTaghi Hajiaghayi
West Exhibition Hall B2-B3 #W-920
Large language models (LLMs) like ChatGPT generate responses one word (or token) at a time. But how should they choose the next word in a way that aligns best with what the user wants? This paper introduces a new mathematical framework to study this problem using ideas from a field called multi-armed bandits, which is often used to model decision-making under uncertainty.In our problem setting, a user submits a question, and the system chooses one word at a time to form a complete response. After the response is finished, the system receives feedback — a score measuring how good the response was. The challenge is to learn how to pick better responses over time.We show that without any structure, learning is hopeless. But with a natural assumption (that similar tokens lead to similar outcomes), they develop new algorithms that learn effectively and provide strong performance guarantees. Surprisingly, our results also explain why simple decoding methods like greedy generation (choosing the best word at each step) often work well in practice. Our findings are supported with experiments using both synthetic and real-world data.