Skip to yearly menu bar Skip to main content


Oral
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)

Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Jenny Huang · Yunyi Shen · Dennis Wei · Tamara Broderick

[ ] [ Project Page ]
Fri 18 Jul 3:45 p.m. PDT — 4 p.m. PDT

Abstract:

We propose a method for evaluating the robustness of a widely used LLM ranking system---the Bradley--Terry ranking system---to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley--Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench’s use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model.

Chat is not available.