Track: Oral 6D Evaluation

Thu 17 July 15:30 - 15:45 PDT

On Path to Multimodal Generalist: General-Level and General-Bench

Hao Fei · Yuan Zhou · Juncheng Li · Xiangtai Li · Qingshan Xu · Bobo Li · Shengqiong Wu · Yaoting Wang · Junbao Zhou · Jiahao Meng · Qingyu Shi · Zhiyuan Zhou · Liangtao Shi · Minghe Gao · Daoan Zhang · Zhiqi Ge · Siliang Tang · Kaihang Pan · Yaobo Ye · Haobo Yuan · Tao Zhang · Weiming Wu · Tianjie Ju · Zixiang Meng · Shilin Xu · Liyu Jia · Wentao Hu · Meng Luo · Jiebo Luo · Tat-Seng Chua · Shuicheng YAN · Hanwang Zhang

The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question: Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI?We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions.To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI.Project Page: https://generalist.top/,Leaderboard: https://generalist.top/leaderboard/,Benchmark: https://huggingface.co/General-Level/.

Thu 17 July 15:45 - 16:00 PDT

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Wendong Bu · Yang Wu · Qifan Yu · Minghe Gao · Bingchen Miao · Zhenkui Zhang · Kaihang Pan · liyunfei · Mengze Li · Wei Ji · Juncheng Li · Siliang Tang · Yueting Zhuang

As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it improves generalization across environments. We conduct multidimensional evaluations for virtual agents, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io.

Thu 17 July 16:00 - 16:15 PDT

How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer · Joshua Kazdan · John Hughes · Jordan Juravsky · Sara Price · Aengus Lynch · Erik Jones · Robert Kirk · Azalia Mirhoseini · Sanmi Koyejo

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts.In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts.We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge?We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own.We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute.Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Thu 17 July 16:15 - 16:30 PDT

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

Angéline Pouget · Mohammad Yaghini · Stephan Rabanser · Nicolas Papernot

Deploying machine learning models in safety-critical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose the suitability filter, a novel framework designed to detect performance deterioration by utilizing suitability signals—model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degradation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.

Main Navigation

Oral Sessions

Oral 6D Evaluation

West Ballroom C

On Path to Multimodal Generalist: General-Level and General-Bench

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

How Do Large Language Monkeys Get Their Power (Laws)?

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings