Skip to yearly menu bar Skip to main content


Spotlight Poster

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer · Laurence Aitchison · Desi Ivanova

East Exhibition Hall A-B #E-501
[ ] [ ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.

Lay Summary:

AI researchers often evaluate large language models (LLMs) on test benchmarks -- collections of questions from a given subject area, such as maths, coding or general knowledge. The proportion of correct answers an LLM achieves on a benchmark gives us a rough idea of how capable that LLM is at a given subject. However, LLMs don't always give the same answer every time you give them the same question, so researchers calculate 'confidence intervals' which tell us that if we were to run the benchmark many times, then 95% of the time, for instance, we'd expect the LLM's score to fall between, say, 0.82 and 0.88.The problem is that there are many ways to construct these intervals, and one of the most common methods (based on an important result from statistics called the 'central limit theorem' (CLT)) relies on assumptions that aren't always satisfied by these benchmarks. Specifically, we show that CLT-based intervals underperform (give inaccurate confidence intervals) when the benchmark contains few questions (which is an increasingly common scenario). We suggest simple, alternative ways to construct these intervals in a range of settings where CLT-based methods fail.

Chat is not available.