Poster
in
Workshop: Workshop on Computer Use Agents
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
Maria Wang · Srinivas Sunkara · Jason Lin · Gilles Baechler · Fedir Zubach · Lei Shu · YUN ZHU · Jindong Chen
The growing power of multimodal large language models (MLLMs) is turning autonomous web agents that assist users into a reality. To accurately assess these agents' capabilities in real-world scenarios, we introduce WebQuest. This new benchmark dataset challenges MLLMs with cross-page question-answering that requires complex reasoning, such as arithmetic and sorting, across diverse website categories. Unlike existing web agent benchmarks that focus on multi-step web navigation and task completion, WebQuest evaluates information extraction, multimodal retrieval and composition of information from many web pages at once. We provide three dataset splits: Single Screen QA, Multi Screen QA, and Trace QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and open source models like InternVL2.5, Pixtral and Qwen2.5-VL on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. We also explore techniques like Chain-of-thought prompting to address this gap.