ICML Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Poster
in
Workshop: Workshop on Computer Use Agents

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert · Abhay Puri · Chandra Kiran Evuru · Joshua Kazdan · Avinandan Bose · Quentin Cappart · Maryam Fazel · Sai Rajeswar Mudumba · Jason Stanley · Nicolas Chapados · Alexandre Drouin · Krishnamurthy Dvijotham

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in improving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.

Chat is not available.

Poster in Workshop: Workshop on Computer Use Agents

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert · Abhay Puri · Chandra Kiran Evuru · Joshua Kazdan · Avinandan Bose · Quentin Cappart · Maryam Fazel · Sai Rajeswar Mudumba · Jason Stanley · Nicolas Chapados · Alexandre Drouin · Krishnamurthy Dvijotham

Poster
in
Workshop: Workshop on Computer Use Agents