Skip to yearly menu bar Skip to main content


Poster

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang · Hanyang(Jeremy) Chen · Junyu Zhang · Mark Zhao · Cheng Qian · Kangrui Wang · Qineng Wang · Teja Koripella · Marziyeh Movahedi · Manling Li · Heng Ji · Huan Zhang · Tong Zhang

East Exhibition Hall A-B #E-2411
[ ] [ ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT
 
Oral presentation: Oral 6A Applications in Agents and Coding
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT

Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents.EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning.Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\\%$ on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at [https://embodiedbench.github.io](https://embodiedbench.github.io).

Lay Summary:

EmbodiedBench is a new tool that helps researchers test how well advanced AI systems—ones that can understand both pictures and language—perform tasks in virtual environments. These AI systems, called Multimodal Large Language Models (MLLMs), are developed to help robots and digital assistants better understand and interact with the world around them.The benchmark includes over 1,100 tasks in different simulated settings. These tasks range from moving through a room to more complex ones like organizing household items. They test important abilities such as following instructions, recognizing objects, making plans, and adjusting to changes. When researchers tested 24 top AI models, they found that while many models are good at big-picture planning, they still struggle with detailed, hands-on actions. For example, even the most advanced model, GPT-4o, completed tasks successfully only 28.9% of the time on average. EmbodiedBench provides a standard way to compare different AI systems and understand where they need to improve. By pointing out these weaknesses, it helps guide the development of smarter, more reliable AI agents in the future.

Chat is not available.