ICML WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Poster
in
Workshop: Workshop on Computer Use Agents

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang · Srinivas Sunkara · Jason Lin · Gilles Baechler · Fedir Zubach · Lei Shu · YUN ZHU · Jindong Chen

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The growing power of multimodal large language models (MLLMs) is turning autonomous web agents that assist users into a reality. To accurately assess these agents' capabilities in real-world scenarios, we introduce WebQuest. This new benchmark dataset challenges MLLMs with cross-page question-answering that requires complex reasoning, such as arithmetic and sorting, across diverse website categories. Unlike existing web agent benchmarks that focus on multi-step web navigation and task completion, WebQuest evaluates information extraction, multimodal retrieval and composition of information from many web pages at once. We provide three dataset splits: Single Screen QA, Multi Screen QA, and Trace QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and open source models like InternVL2.5, Pixtral and Qwen2.5-VL on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. We also explore techniques like Chain-of-thought prompting to address this gap.

Chat is not available.

Poster in Workshop: Workshop on Computer Use Agents

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang · Srinivas Sunkara · Jason Lin · Gilles Baechler · Fedir Zubach · Lei Shu · YUN ZHU · Jindong Chen

Poster
in
Workshop: Workshop on Computer Use Agents