Poster
Cradle: Empowering Foundation Agents towards General Computer Control
Weihao Tan · Wentao Zhang · Xinrun Xu · Haochong Xia · gang Ding · Boyu Li · Bohan Zhou · Junpeng Yue · Jiechuan Jiang · Yewen Li · Ruyi An · Molei Qin · Chuqiao Zong · Longtao Zheng · YuJie Wu · Xiaoqiang Chai · Yifei Bi · Tianbao Xie · Pengjie Gu · Xiyun Li · Ceyao Zhang · Long Tian · Chaojie Wang · Xinrun Wang · Börje F. Karlsson · Bo An · Shuicheng YAN · Zongqing Lu
West Exhibition Hall B2-B3 #W-101
Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer's Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.
How can we enable AI agents to perform all kinds of computer tasks—not just browsing the web, but also playing video games and operating complex software? The Cradle framework offers an answer by allowing chatbot models like GPT-4o to use computers the same way humans do: by viewing the screen and controlling the keyboard and mouse.Rather than relying on built-in shortcuts or special software access, Cradle harnesses the power of multimodal large language models to interpret screenshots and generate code that simulates human interactions. Comprising six key modules, Cradle enables models to observe ongoing activities, reflect on past actions, plan subsequent steps, and store useful skills for future tasks, thereby effectively managing challenging and intricate assignments.Cradle has successfully completed long, complex missions in demanding games like Red Dead Redemption 2, Cities: Skylines, and Stardew Valley as well as executing various software tasks, like image and video editing. This marks a significant step toward building general-purpose AI agents that are adaptable, capable, and human-like in their digital interactions.