ICML OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Oral
in
Workshop: Workshop on Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz · Agatha Duzan · Hao Zhao · Francesco Croce · Zico Kolter · Nicolas Flammarion · Maksym Andriushchenko

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Computer use agents are a new way of building agents that can interact with a computer directly by processing screenshots or accessibility trees. Despite their emerging popularity, their safety has been mostly overlooked. However, evaluating potential harmful behaviors and failure cases of agents represents a crucial step towards their widespread adoption. To this end, we introduce OS-Harm, a new benchmark for measuring safety of computer use agents. OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior. To cover these cases, we collect 100 base and 150 augmented tasks spanning several types of safety violations (harassment, copyright infringement, disinformation, data exfiltration, etc.) and that require the agent to interact with a variety of applications. Moreover, we propose an automated judge to evaluate both the accuracy (whether the task is successfully completed) and the safety (whether the actions are safe) of the agent. Finally, we benchmark state-of-the-art computer use agents on OS-Harm, as well as the automated judges. We believe our benchmark can help the community both to measure the safety of current and future computer use agents, and to develop novel task-specific techniques to red team agents before deployment.

Chat is not available.

Oral in Workshop: Workshop on Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Thomas Kuntz · Agatha Duzan · Hao Zhao · Francesco Croce · Zico Kolter · Nicolas Flammarion · Maksym Andriushchenko

Oral
in
Workshop: Workshop on Computer Use Agents