Poster
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Rogerio Bonatti · Dan Zhao · Francesco Bonacci · Dillon Dupont · Sara Abdali · Yinheng Li · Yadong Lu · Justin Wagle · Kazuhito Koishida · Arthur Bucker · Lawrence Jang · Zheng Hui
West Exhibition Hall B2-B3 #W-100
Large multi-modal language models that understand text, images, and more are becoming capable digital assistants and agents, helping us accomplish complex computer tasks on their own. However, effectively measuring how well these AI agents perform realistic tasks is challenging because traditional benchmarks can be complex and slow, often taking hours if not more to provide meaningful results. This slow evaluation significantly delays progress in AI development and agent improvements. Moreover, despite the widespread popularity and extensive use of the Windows operating system (OS), there is no agentic benchmark designed specifically for the Windows OS. To solve this critical bottleneck, we introduce the Windows Agent Arena, a highly efficient and scalable evaluation framework where state-of-the-art multi-modal AI agents perform tasks using the same software and tools humans rely on daily. Our approach dramatically accelerates evaluation, enabling rapid feedback and quicker improvements in AI capabilities. Even the best current AI agents pale in comparison to the 74.5% success rate by humans, highlighting significant room for advancement.