Poster
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Rui Yang · Hanyang(Jeremy) Chen · Junyu Zhang · Mark Zhao · Cheng Qian · Kangrui Wang · Qineng Wang · Teja Koripella · Marziyeh Movahedi · Manling Li · Heng Ji · Huan Zhang · Tong Zhang
East Exhibition Hall A-B #E-2411
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
EmbodiedBench is a new tool that helps researchers test how well advanced AI systems—ones that can understand both pictures and language—perform tasks in virtual environments. These AI systems, called Multimodal Large Language Models (MLLMs), are developed to help robots and digital assistants better understand and interact with the world around them.The benchmark includes over 1,100 tasks in different simulated settings. These tasks range from moving through a room to more complex ones like organizing household items. They test important abilities such as following instructions, recognizing objects, making plans, and adjusting to changes. When researchers tested 24 top AI models, they found that while many models are good at big-picture planning, they still struggle with detailed, hands-on actions. For example, even the most advanced model, GPT-4o, completed tasks successfully only 28.9% of the time on average. EmbodiedBench provides a standard way to compare different AI systems and understand where they need to improve. By pointing out these weaknesses, it helps guide the development of smarter, more reliable AI agents in the future.