Workshop
Workshop on Computer Use Agents
David Barber · Doina Precup · Andrei Nica · Roberta Raileanu · Harshil Shah · Boyuan Zheng · Shuyan Zhou
West Meeting Room 211-214
Sat 19 Jul, 8:30 a.m. PDT
Computer use models are attracting significant interest in academia and industry due to their ability to perform complex tasks in non-deterministic environments. However, they are far from being ready for unattended deployment, as evidenced by their performance on the OSWorld benchmark where they achieve only a small fraction of human performance. The rapid evolution of these agents raises important questions regarding their accuracy, safe deployment, and potential impact on the future of work. The topics we would like to cover are:- Learning Algorithms --- which new architectures and learning techniques (e.g. memory mechanisms for extended tasks, exploration strategies) can enhance the intrinsic ability of computer use agents to acquire, represent, and refine knowledge?- Orchestration --- what novel frameworks or control methods (e.g. dynamic task planning, modular coordination, multi-agent systems) can efficiently manage and integrate multiple learning components to optimize overall agent performance?- Interfaces --- how should agents perceive and act within their environments (e.g., via APIs or UI interactions), and should we design unified systems or specialized agents for different modalities?- Guardrails, safety \& societal implications --- what guardrails do we need in order to make computer use models safe for deployment ``in the wild'' while ensuring that they have a positive impact on society?- Benchmarking \& tools --- how can we develop robust environments and evaluation metrics that capture the diversity of real-world settings? Do we need new tools or frameworks to make research on computer use more efficient and accessible?- Human-agent interaction --- how will future interactions evolve? Should we optimize agents for full autonomy or design them as personalized, human-centric collaborators?- Broader applications --- what are some practical applications for computer use agents across domains such as healthcare, scientific research, software engineering and testing etc.?- Capability horizon --- what breakthroughs or engineering challenges are required to enable agents orders of magnitude more capable than today, and what implications would such advances have?
Schedule
Sat 8:30 a.m. - 8:40 a.m.
|
Opening remarks
(
Intro
)
>
|
🔗 |
Sat 8:40 a.m. - 9:05 a.m.
|
Nouha Dziri - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
(
Invited Talk
)
>
|
🔗 |
Sat 9:05 a.m. - 9:30 a.m.
|
Zhiyong Wu - Large Scale Reinforcement Leanring for General Computer Agents
(
Invited Talk
)
>
|
🔗 |
Sat 9:30 a.m. - 9:40 a.m.
|
Spotlight "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows" - Rob Xiangru Tang
(
Accepted Paper Spotlight
)
>
|
🔗 |
Sat 9:40 a.m. - 10:30 a.m.
|
Posters & Coffee break
(
Poster Session
)
>
|
🔗 |
Sat 10:30 a.m. - 10:55 a.m.
|
Qingyun Wu
(
Invited Talk
)
>
|
🔗 |
Sat 10:55 a.m. - 11:20 a.m.
|
Yu Su - The Intelligence Feedback Loop: From Biological Inspiration to Augmented Cognition
(
Invited Talk
)
>
|
🔗 |
Sat 11:20 a.m. - 11:45 a.m.
|
Ruslan Salakhutdinov - Scaling up Multimodal AI Agents
(
Invited Talk
)
>
|
🔗 |
Sat 11:45 a.m. - 11:55 a.m.
|
Spotlight "Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search" - Sam Holt
(
Accepted Paper Spotlight
)
>
|
🔗 |
Sat 1:00 p.m. - 1:15 p.m.
|
Sercan Arık
(
Invited Talk
)
>
|
🔗 |
Sat 1:15 p.m. - 2:15 p.m.
|
Panel discussion - Ruslan Salakhutdinov, Alexandre Drouin, Qingyun Wu, Victor Zhong, Nouha Dziri, Yu Su
(
Panel
)
>
|
🔗 |
Sat 2:15 p.m. - 2:40 p.m.
|
Alexandre Drouin - Computer-use agents in the enterprise: progress and key challenges
(
Invited Talk
)
>
|
🔗 |
Sat 2:40 p.m. - 2:50 p.m.
|
Spotlight "How to Train Your LLM Web Agent: A Statistical Diagnosis" - Massimo Caccia
(
Accepted Paper Spotlight
)
>
|
🔗 |
Sat 2:50 p.m. - 3:00 p.m.
|
Spotlight "OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents" - Maksym Andriushchenko
(
Accepted Paper Spotlight
)
>
|
🔗 |
Sat 3:00 p.m. - 3:50 p.m.
|
Posters & Coffee break
(
Poster Session
)
>
|
🔗 |
Sat 3:50 p.m. - 4:05 p.m.
|
Graham Neubig
(
Invited Talk
)
>
|
🔗 |
Sat 4:05 p.m. - 4:30 p.m.
|
Victor Zhong - Building and Evaluating Generalist Agents
(
Invited Talk
)
>
|
🔗 |
Sat 4:30 p.m. - 4:55 p.m.
|
Alane Suhr - Training Language-Conditioned Agents with Reinforcement Learning
(
Invited Talk
)
>
|
🔗 |
Sat 4:55 p.m. - 5:05 p.m.
|
Closing Remarks
(
Closing Remarks
)
>
|
🔗 |
Sat 5:05 p.m. - 6:00 p.m.
|
Poster & Social
(
Poster Session
)
>
|
🔗 |
-
|
Weathering the CUA Storm: Mapping Security Threats in the Rapid Rise of Computer Use Agents ( Poster ) > link | Dan Jones · Martin Pouliot · Giorgio Severi · Joris de Gruyter · Gary Lopez Munoz · Santiago Zanella-Beguelin · Justin Song · Amanda Minnich · Pamela Cortez 🔗 |
-
|
Universal Retrieval for Multimodal Trajectory Modeling ( Poster ) > link | Xuan Zhang · Ziyan Jiang · Rui Meng · Yifei Leng · Zhenbang Xiao · Zhiruo Wang · Yanni Shawn · Yanni Shawn 🔗 |
-
|
UI-Evol: Automatic Knowledge Evolving for Computer Use Agents ( Poster ) > link | Ziyun Zhang · Xinyi Liu · Xiaoyi Zhang · Jun Wang · Gang Chen · Yan Lu 🔗 |
-
|
BIMgent: Towards Autonomous Building Modeling via Computer-use Agents ( Poster ) > link | Zihan Deng · Changyu Du · Stavros Nousias · André Borrmann 🔗 |
-
|
OSWorld-Gold: Benchmarking the Efficiency of Computer-Use Agents ( Poster ) > link | Reyna Abhyankar · Qi Qi · Yiying Zhang 🔗 |
-
|
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection ( Poster ) > link | Yuhang Liu · Pengxiang Li · Zishu Wei · Congkai Xie · Xueyu Hu · Xinchen Xu · Shengyu Zhang · Xiaotian Han · Hongxia Yang · Fei Wu 🔗 |
-
|
EARL: Early Intent Recognition in GUI Tasks Using Theory of Mind ( Poster ) > link | Shraddha Vijay Pawar · Balavarun Pedapudi · Pramod Kaushik · Sarath Sivaprasad · Mario Fritz · Shirish Karande 🔗 |
-
|
OS-MAP: How Far Can Computer Use Agents Go in Breadth and Depth? ( Poster ) > link |
15 presentersXuetian Chen · Yinghao Chen · Xinfeng Yuan · ZhuoPeng · Lu Chen · Yuekeng Li · Zhoujia Zhang · Yingqian Huang · Leyan Huang · Jiaqing Liang · Tianbao Xie · Zhiyong Wu · Qiushi Sun · Biqing Qi · Bowen Zhou |
-
|
AgentSearchBench: Evaluating Agentic Search with Agent-as-a-Judge ( Poster ) > link |
26 presentersBoyu Gou · Zanming Huang · Yuting Ning · Yu Gu · Michael Lin · Botao Yu · Andrei Kopanev · Weijian Qi · Yiheng Shu · Jiaman Wu · Chan Hee Song · Bernal Jimenez Gutierrez · Yifei Li · Zeyi Liao · Hanane Nour Moussa · TIANSHU ZHANG · Jian Xie · Tianci Xue · Shijie Chen · Boyuan Zheng · Kai Zhang · Zhaowei Cai · Viktor Rozgic · Morteza Ziyadi · Huan Sun · Yu Su |
-
|
WebGames: Challenging General-Purpose Web-Browsing AI Agents ( Poster ) > link | George Thomas · Filippos Christianos · Alexander Chan · Rohit Midha · Jikun Kang · Wenqi Wu · Fraser Greenlee · Andrew Toulis · Marvin Purtorab 🔗 |
-
|
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents ( Oral ) > link | Thomas Kuntz · Agatha Duzan · Hao Zhao · Francesco Croce · Zico Kolter · Nicolas Flammarion · Maksym Andriushchenko 🔗 |
-
|
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks ( Poster ) > link | Ivan Evtimov · Arman Zharmagambetov · Aaron Grattafiori · Chuan Guo · Kamalika Chaudhuri 🔗 |
-
|
EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments ( Poster ) > link | Sara Fish · Julia Shephard · Minkai Li · Ran Shorrer · Yannai A. Gonczarowski 🔗 |
-
|
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment ( Poster ) > link | Siliang Zeng · Quan Wei · William Brown · Oana Frunza · Yuriy Nevmyvaka · Yang Zhao · Mingyi Hong 🔗 |
-
|
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents ( Poster ) > link | Ido Levy · Ben wiesel · Sami Marreed · Alon Oved · Avi Yaeli · Segev Shlomov 🔗 |
-
|
Dynamic Risk Assessments for Offensive Cybersecurity Agents ( Poster ) > link | Boyi Wei · Benedikt Stroebl · Jiacen Xu · Joie Zhang · Zhou Li · Peter Henderson 🔗 |
-
|
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats ( Poster ) > link |
11 presentersLéo Boisvert · Abhay Puri · Gabriel Huang · Mihir Bansal · Chandra Kiran Evuru · Avinandan Bose · Maryam Fazel · Quentin Cappart · Alexandre Lacoste · Alexandre Drouin · Krishnamurthy Dvijotham |
-
|
API Agents vs. GUI Agents: Divergence and Convergence ( Poster ) > link | Chaoyun Zhang · Shilin He · Liqun Li · Si Qin · Yu Kang · Qingwei Lin · Saravanakumar Rajmohan · Dongmei Zhang 🔗 |
-
|
Semantic Context for Tool Orchestration ( Poster ) > link | Robert Müller 🔗 |
-
|
Reimagining ABM with LLM Agents via Shachi ( Poster ) > link | So Kuroki · Yingtao Tian · Kou Misaki · Takashi Ikegami · Takuya Akiba · Yujin Tang 🔗 |
-
|
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents ( Poster ) > link | Arman Zharmagambetov · Chuan Guo · Ivan Evtimov · Maya Pavlova · Russ Salakhutdinov · Kamalika Chaudhuri 🔗 |
-
|
Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning ( Poster ) > link |
12 presentersLéo Boisvert · Abhay Puri · Chandra Kiran Evuru · Joshua Kazdan · Avinandan Bose · Quentin Cappart · Maryam Fazel · Sai Rajeswar Mudumba · Jason Stanley · Nicolas Chapados · Alexandre Drouin · Krishnamurthy Dvijotham |
-
|
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning ( Poster ) > link |
12 presentersZhepei Wei · Wenlin Yao · Yao Liu · Weizhi Zhang · Qin Lu · Liang Qiu · Changlong Yu · Puyang Xu · Chao Zhang · Bing Yin · Hyokun Yun · Lihong Li |
-
|
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning ( Poster ) > link |
13 presentersZhen Xiang · Linzhi Zheng · Yanjie Li · Junyuan Hong · Qinbin Li · Han Xie · Jiawei Zhang · Zidi Xiong · Chulin Xie · Nathaniel Bastian · Carl Yang · Dawn Song · Bo Li |
-
|
Replacing thinking with tool usage enables reasoning in small language models ( Poster ) > link | Corrado Rainone · Tim Bakker · Roland Memisevic 🔗 |
-
|
Toward Autonomous UI Exploration: The UIExplorer Benchmark ( Poster ) > link | Andrei Nica · Akshaya Shanbhogue · Harshil Shah · Aleix Cambray · Tudor Berariu · Lucas Maystre · David Barber 🔗 |
-
|
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows ( Oral ) > link |
21 presentersQiushi Sun · Zhoumianze Liu · Chang Ma · Zichen Ding · Fangzhi Xu · Zhangyue Yin · Haiteng Zhao · Zhenyu Wu · Kanzhi Cheng · Zhaoyang Liu · Jianing Wang · Qintong Li · Robert Tang · Tianbao Xie · Xiachong Feng · Xiang Li · Ben Kao · Wenhai Wang · Biqing Qi · Lingpeng Kong · Zhiyong Wu |
-
|
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences ( Poster ) > link | Maria Wang · Srinivas Sunkara · Jason Lin · Gilles Baechler · Fedir Zubach · Lei Shu · YUN ZHU · Jindong Chen 🔗 |
-
|
VerificAgent: Integrating Expert Knowledge and Fact-Checked Memory for Robust Domain-Specific Task Planning ( Poster ) > link | Thong Nguyen · Shubhang Desai · Yash Jain · Tanvir Aumi · Vishal Chowdhary 🔗 |
-
|
How to Train Your LLM Web Agent: A Statistical Diagnosis ( Oral ) > link |
16 presentersDheeraj Vattikonda · Santhoshi Ravichandran · Emiliano Penaloza · Hadi Nekoei · Megh Thakkar · Thibault de Chezelles · Nicolas Gontier · Miguel Muñoz-Mármol · Sahar Omidi Shayegan · Stefania Raimondo · Xue Liu · Alexandre Drouin · Laurent Charlin · Alex Piche · Alexandre Lacoste · Massimo Caccia |
-
|
Coding Agents with Multimodal Browsing are Generalist Problem Solvers ( Poster ) > link | Aditya Bharat Soni · Boxuan Li · Xingyao Wang · Valerie Chen · Graham Neubig 🔗 |
-
|
Context manipulation attacks : Web agents are susceptible to corrupted memory ( Poster ) > link | Atharv Singh Patlan · Ashwin Hebbar · Pramod Viswanath · Prateek Mittal 🔗 |
-
|
Improving LLM Agent Planning for Computer Use via In-Context Learning with Atomic Fact Augmentation and Lookahead Search ( Oral ) > link | Samuel Holt · Max Ruiz Luyten · Thomas Pouplin · Mihaela van der Schaar 🔗 |