Poster
AutoEval Done Right: Using Synthetic Data for Model Evaluation
Pierre Boyeau · Anastasios Angelopoulos · Tianle Li · Nir Yosef · Jitendra Malik · Michael Jordan
East Exhibition Hall A-B #E-1809
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.
This paper introduces a method to evaluate machine learning models more efficiently by combining a small set of human-annotated data with a larger set of AI-generated synthetic labels. The core idea is to use the human data to correct biases present in the synthetic labels, leveraging a statistical technique called prediction-powered inference. This approach is demonstrated across diverse applications, including ranking computer vision models, evaluating protein fitness predictors, and assessing large language models via pairwise comparisons from the Chatbot Arena. Results show that this method produces more accurate performance estimates and tighter confidence intervals than traditional evaluation techniques, allowing for more reliable model evaluation with reduced human effort.