Spotlight Poster
Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu · Laura Ruis · Tim Rocktäschel · Robert Kirk
East Exhibition Hall A-B #E-2703
When evaluating AI chatbots that follow human instructions, researchers often rely on automatic comparisons made by powerful language models. This typically involves comparing two chatbots at a time, based on an implicit assumption: if Chatbot A is better than Chatbot B, and Chatbot B is better than Chatbot C, then Chatbot A should also be better than Chatbot C. However, we find that this assumption does not always hold, and such inconsistencies can significantly distort the overall rankings of AI chatbots. We examine this issue in a widely used evaluation framework called AlpacaEval and observe clear evidence of these ranking inconsistencies. To address the problem, we introduce an evaluation method inspired by round-robin tournaments, where each chatbot is compared against many others. The outcomes are then aggregated using a statistical model called Bradley-Terry to produce more consistent and accurate rankings. This approach significantly improves the reliability of AI chatbot evaluations. To reduce the high computational cost of round-robin comparisons, we also propose a more efficient matching strategy called Swim tournaments, which preserves the benefits of round-robin evaluation while requiring far fewer comparisons.