Skip to yearly menu bar Skip to main content


Oral Sessions

Oral 4B Positions: Generative AI Evaluation

West Ballroom A

Moderator: Kiri Wagstaff

Wed 16 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
Abstract:
Chat is not available.

Wed 16 July 15:30 - 15:45 PDT

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

D. Sculley · William Cukierski · Phil Culliton · Sohier Dane · Maggie Demkin · Ryan Holbrook · Addison Howard · Paul Mooney · Walter Reade · Meg Risdal · Nate Keating

In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems of leakage and contamination are in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.

Wed 16 July 15:45 - 16:00 PDT

Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Ahmed Alaa · Thomas Hartvigsen · Niloufar Golchini · Shiladitya Dutta · Frances Dean · Inioluwa Raji · Travis Zack

Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarks—a tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should—and indeed can—be empirically evaluated for their construct validity. In the psychological testing literature, “construct validity” refers to the ability of a test to measure an underlying “construct”, that is the actual conceptual target of evaluation. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put these ideas into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report significant gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.

Wed 16 July 16:00 - 16:15 PDT

Position: Principles of Animal Cognition to Improve LLM Evaluations

Sunayana Rane · Cyrus Kirkman · Graham Todd · Amanda Royka · Ryan Law · Erica Cartmill · Jacob Foster

It has become increasingly challenging to understand and evaluate LLM capabilities as these models exhibit a broader range of behaviors. In this position paper, we argue that LLM researchers should draw on the lessons from another field which has developed a rich set of experimental paradigms and design practices for probing the behavior of complex intelligent systems: animal cognition. We present five core principles of evaluation drawn from animal cognition research, and explain how they provide invaluable guidance for understanding LLM capabilities and behavior. We ground these principles in an empirical case study, and show how they can already provide a richer picture of one particular reasoning capability: transitive inference.

Wed 16 July 16:15 - 16:30 PDT

Position: Political Neutrality in AI Is Impossible — But Here Is How to Approximate It

Jillian Fisher · Ruth Elisabeth Appel · Chan Young Park · Yujin Potter · Liwei Jiang · Taylor Sorensen · Shangbin Feng · Yulia Tsvetkov · Margaret Roberts · Jennifer Pan · Dawn Song · Yejin Choi

AI systems often exhibit political bias, influencing users' opinions and decisions. While political neutrality—defined as the absence of bias—is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.