Track: Oral 6A Applications in Agents and Coding

Thu 17 July 15:30 - 15:45 PDT

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang · Hanyang(Jeremy) Chen · Junyu Zhang · Mark Zhao · Cheng Qian · Kangrui Wang · Qineng Wang · Teja Koripella · Marziyeh Movahedi · Manling Li · Heng Ji · Huan Zhang · Tong Zhang

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents.EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning.Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\\%$ on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at [https://embodiedbench.github.io](https://embodiedbench.github.io).

Thu 17 July 15:45 - 16:00 PDT

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino · Michele Wang · Tejal Patwardhan · Johannes Heidecke

We introduce SWE-Lancer, a benchmark of over 1400 freelance software engineering tasks from Upwork, valued at \\\$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks — ranging from \\\$50 bug fixes to \\\$32000 feature implementations — and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

Thu 17 July 16:00 - 16:15 PDT

CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction

Junlong Li · Daya Guo · Dejian Yang · Runxin Xu · Yu Wu · Junxian He

Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives—like logic flow planning, state-space searching, decision tree traversal, and modular decomposition—while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models will be publicly available.

Thu 17 July 16:15 - 16:30 PDT

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha · Rohan Arora · Yuji Watanabe · Takumi Yanagawa · Yinfang Chen · Jackson Clark · Bhavya Bhavya · Mudit Verma · Harshit Kumar · Hirokuni Kitahara · Noah Zheutlin · Saki Takano · Divya Pathak · Felix George · Xinbo Wu · Bekir Turkkan · Gerard Vanloo · Michael Nidd · Ting Dai · Oishik Chatterjee · Pranjal Gupta · Suranjana Samanta · Pooja Aggarwal · Rong Lee · Jae-wook Ahn · Debanjana Kar · Amit Paradkar · Yu Deng · Pratibha Moogi · Prateeti Mohapatra · Naoki Abe · Chandrasekhar Narayanaswami · Tianyin Xu · Lav Varshney · Ruchi Mahindru · Anca Sailer · Laura Shwartz · Daby Sow · Nicholas Fuller · Ruchir Puri

Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available at https://github.com/ibm/itbench.

Main Navigation

Oral Sessions

Oral 6A Applications in Agents and Coding

West Exhibition Hall C

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction

ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks