Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Computer Use Agents

OS-MAP: How Far Can Computer Use Agents Go in Breadth and Depth?

Xuetian Chen · Yinghao Chen · Xinfeng Yuan · ZhuoPeng · Lu Chen · Yuekeng Li · Zhoujia Zhang · Yingqian Huang · Leyan Huang · Jiaqing Liang · Tianbao Xie · Zhiyong Wu · Qiushi Sun · Biqing Qi · Bowen Zhou


Abstract:

Computer use agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands—hindering both targeted capability development and the reliable transition of research progress into practical deployment.To bridge the gap, we present OS-Map, a benchmark for daily computer use automation, consisting of 416 applications and 15 realistic tasks.To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-Map evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy.This design captures varying levels of required agent autonomy and generalization, forming a performance–generalization evaluation matrix for structured and comprehensive assessment.Experiments show that even the strongest agents struggle with higher-level tasks involving perception, reasoning, and coordination—highlighting the need for deeper understanding of current strengths and limitations to drive the future progress in computer use agents research and deployment. All code, environments, baselines, and data are publicly available at https://anonymous.4open.science/r/OSMap-C2F5/.

Chat is not available.