ICML OSWorld-Gold: Benchmarking the Efficiency of Computer-Use Agents

Poster
in
Workshop: Workshop on Computer Use Agents

OSWorld-Gold: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar · Qi Qi · Yiying Zhang

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind it and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark. We find that as an agent uses more steps to complete a task, each step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Gold, an annotated version of the original OSWorld dataset that contains the most efficient trajectory for each task, and we evaluate 17 agents on their efficiency using OSWorld-Gold.

Chat is not available.

Poster in Workshop: Workshop on Computer Use Agents

OSWorld-Gold: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar · Qi Qi · Yiying Zhang

Poster
in
Workshop: Workshop on Computer Use Agents