Poster
Tuning LLM Judge Design Decisions for 1/1000 of the Cost
David Salinas · Omar Swelam · Frank Hutter
East Exhibition Hall A-B #E-2609
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging.In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
Comparing different AI language models requires human experts to evaluate their responses—a costly and slow process. A cheaper alternative consists in using AI models themselves as judges to compare other AI systems. Think of it as having one AI referee determine which of two AI players performed better at a task.However, previous research has been inconsistent, like comparing apples to oranges. Different studies used different AI judges, instructions, and settings all at once, making it impossible to know what actually works best.This paper shows how to tune systematically different AI judge design decisions. We propose a method that finds judges offering the best balance between accuracy and cost—identifying which AI judges are both reliable and affordable to run.In particular, we find AI judges that outperform existing methods while using publicly available models, which we hope can help to make research using and based on AI judge more open.