Skip to yearly menu bar Skip to main content


Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

A2Perf: Benchmarking Autonomous Agents End-to-End in Realistic Domains

Ikechukwu Uchendu · Jason Jabbour · Korneel Van den Berghe · Joel Runevic · Matthew Stewart · Jeffrey Ma · Srivatsan Krishnan · Izzeddin Gur · Austin Huang · Colton Bishop · Paige Bailey · Wenjie Jiang · Ebrahim M. Songhori · Sergio Guadarrama · Jie Tan · Jordan Terry · Aleksandra Faust · Vijay Janapa Reddi

[ ] [ Project Page ]
Fri 18 Jul 2:15 p.m. PDT — 3 p.m. PDT

Abstract:

Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements.Several classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs.However, there is a lack of benchmarking suites that define the environments, datasets, and metrics which can be used to provide a meaningful way for the community to compare progress on applying these methods to real-world problems.We introduce A2Perf—a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion.A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications.In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches.As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.

Chat is not available.