Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Technical AI Governance

AI Benchmarks: Interdisciplinary Issues and Policy Considerations

Maria Eriksson · Erasmo Purificato · Arman Noroozian · João Vinagre · Guillaume Chaslot · Emilia Gomez · David Fernández-Llorca


Abstract:

Artificial Intelligence (AI) benchmarks have emerged as essential for evaluating AI performance, capabilities, and risks. However, as their influence grows, concerns arise about their limitations and side effects when assessing sensitive topics such as high-impact capabilities, safety and systemic risks. In this work we summarise the results an interdisciplinary meta-review of approximately 100 studies over the last decade, which identify key shortcomings in AI benchmarking practices, including issues in the design and application (e.g., dataset creation biases, inadequate documentation, data contamination, and failures to distinguish signal from noise) and broader sociotechnical issues (e.g., over-focus on text-based and one-time evaluation logic, which neglects multimodality and interactions). We also highlight systemic flaws, such as misaligned incentives, construct validity issues, unknown unknowns, and the gaming of benchmark results. We underscore how benchmark practices are shaped by cultural, commercial and competitive dynamics that often prioritise performance at the expense of broader societal concerns. As a result, AI benchmarking may be ill-suited to provide the assurances required by policymakers. To address these challenges, it is crucial to consider key policy aspects that can help mitigate the shortcomings of current AI benchmarking practices.

Chat is not available.