Spotlight Poster
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Yangsibo Huang · Milad Nasr · Anastasios Angelopoulos · Nicholas Carlini · Wei-Lin Chiang · Christopher A. Choquette Choo · Daphne Ippolito · Matthew Jagielski · Katherine Lee · Ken Ziyu Liu · Ion Stoica · Florian Tramer · Chiyuan Zhang
East Exhibition Hall A-B #E-701
Thu 17 Jul 10 a.m. PDT — 11 a.m. PDT
The field of natural language processing has long relied on domain-specific, easy-to-implement evaluation metrics. But dramatic advances in LLM performance challenges traditional evaluation practices. As we show in this paper, moving from evaluations that use an objective source of truth to evaluations that utilize human inputs introduces the potential for new types of evaluation difficulties. We focus on this paper in validating one straightforward attack: by identifying and selectively voting for (or against) a particular model, an adversary can significantly alter the ordering of the best models.Mitigating this attack is feasible, and we are actively collaborating with the Chatbot Arena team to make Chatbot Arena more robust. We also encourage the community to explore and adopt mitigation strategies, such as voter authentication, rate limits, and more robust mechanisms for detecting malicious activities.