Poster
in
Workshop: Workshop on Computer Use Agents
WebGames: Challenging General-Purpose Web-Browsing AI Agents
George Thomas · Filippos Christianos · Alexander Chan · Rohit Midha · Jikun Kang · Wenqi Wu · Fraser Greenlee · Andrew Toulis · Marvin Purtorab
We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 150 interactive challenges. These challenges assess AI agents' ability to interact with the web as humans do, evaluating them across five core domains: Technical Fluency, Real-Time Responsiveness, Adversarial Resistance, Cognitive Abilities, and Visual Comprehension—through simple systems and fundamental browser tasks. Our framework eliminates reliance on outside systems and provides verifiable ground-truth solutions, ensuring reproducible evaluation. We evaluate leading vision-language models including GPT-4o, Claude, Gemini-2.5, and Qwen2.5-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 48% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at https://webgames-7ng.pages.dev.