Spotlight Poster
Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations & Reporting Checklist)
Kevin Wei · Patricia Paskov · Sunishchal Dev · Michael Byun · Anka Reuel · Xavier Roberts-Gaal · Rachel Calcott · Evie Coxon · Chinmay Deshpande
East Exhibition Hall A-B #E-600
In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end. Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: https://github.com/kevinlwei/human-baselines.
Advanced AI systems are more and more able to perform complex, realistic, and profitable tasks. How can we meaningfully figure out if AI systems can perform these tasks as well as humans—and how much better or worse are they? We looked at how other disciplines like psychology, economics, and political science measure differences between groups of humans, and based on what these disciplines do, we wrote guidelines for comparing AI and human performance. We then looked at AI studies that made human vs. AI performance comparisons, and we found that most comparisons aren't very trustworthy. For instance, many studies don't compare AI systems with enough humans, or they are really comparing humans and AIs on different tasks under the hood.This research will help improve our understanding of what AIs can do when compared to what humans can do. That understanding is important not just to AI researchers, but also to companies and users who want to know where AI excels and fails, as well as to policymakers thinking about how AI can be dangerous or how AI can affect jobs. We hope that our research can lead to better AI research, AI usage, and AI policy.