ICML Poster AutoEval Done Right: Using Synthetic Data for Model Evaluation

Poster

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Pierre Boyeau · Anastasios Angelopoulos · Tianle Li · Nir Yosef · Jitendra Malik · Michael Jordan

East Exhibition Hall A-B #E-1809

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.

Lay Summary:

This paper introduces a method to evaluate machine learning models more efficiently by combining a small set of human-annotated data with a larger set of AI-generated synthetic labels. The core idea is to use the human data to correct biases present in the synthetic labels, leveraging a statistical technique called prediction-powered inference. This approach is demonstrated across diverse applications, including ranking computer vision models, evaluating protein fitness predictors, and assessing large language models via pairwise comparisons from the Chatbot Arena. Results show that this method produces more accurate performance estimates and tighter confidence intervals than traditional evaluation techniques, allowing for more reliable model evaluation with reduced human effort.

Chat is not available.