ICML Poster Agent-as-a-Judge: Evaluate Agents with Agents

Poster

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge · Changsheng Zhao · Dylan Ashley · Wenyi Wang · Dmitrii Khizbullin · Yunyang Xiong · Zechun Liu · Ernie Chang · Raghuraman Krishnamoorthi · Yuandong Tian · Yangyang Shi · Vikas Chandra · Jürgen Schmidhuber

West Exhibition Hall B2-B3 #W-103

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes---ignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available at https://github.com/metauto-ai/agent-as-a-judge

Lay Summary:

Existing evaluation methods either focus solely on final outcomes—overlooking the agent’s reasoning—or depend on costly human review, making them impractical for large-scale systems .We introduce the “Agent-as-a-Judge” framework, in which one autonomous agent observes and provides step-by-step, fine-grained assessments of another agent’s task execution .To validate our approach, we created DevAI, a new benchmark comprising 55 real-world AI development tasks and 365 hierarchical requirements .In code-generation experiments, Agent-as-a-Judge agreed with human expert evaluations about 90% of the time—substantially outperforming the 70% agreement rate of previous LLM-as-a-Judge methods.Moreover, our framework cuts evaluation time and cost by roughly 97%, reducing effort from 86 hours and 1,297 USD to approximately 2 hours and 31 USD.By delivering human-level reliability at a fraction of the cost, Agent-as-a-Judge paves the way for rapid, trustworthy analysis of complex multi-agent systems

Chat is not available.