Poster
in
Workshop: Workshop on Technical AI Governance
Methodological Challenges in Agentic Evaluations of AI Systems
Kevin Wei · Stephen Guth · Gabriel Wu · Patricia Paskov
With the increased generality and advanced reasoning capabilities of AI systems, an increasing number of AI evaluations are agentic evaluations: evaluations involving complex tasks requiring environmental interaction, as opposed to knowledge-based question-answer benchmarks. However, no work has explored the methodological challenges of agentic evaluations or the practices necessary to ensure their validity, reliability, replicability, and efficiency. In this (work-in-progress) paper, we (1) define and formalize the agentic evaluation paradigm; (2) survey and analyze methodological problems in agentic evaluations; and (3) discuss the implications of agentic evaluations for AI governance. Our hope is to improve the state of agentic evaluations of AI systems, systematize the methodological work in this domain, and contribute to the establishment of a science of AI evaluations.