Poster
in
Workshop: Assessing World Models: Methods and Metrics for Evaluating Understanding
HueManity: Probing Fine-Grained Visual Perception in MLLMs
Rynaa Grover · Jayant Sravan Tamarapalli · Sahiti Yerramilli · Nilay Pande
Keywords: [ Fine-Grained Visual Perception ] [ MLLM Evaluation ] [ Ishihara-style Stimuli ] [ HueManity ] [ Perceptual Understanding ] [ Multimodal Large Language Models (MLLMs) ] [ Benchmark ]
Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning, yet their foundational understanding of nuanced perceptual details is often overlooked by existing evaluations.To address this, we introduce HueManity, a novel benchmark specifically designed to assess this crucial dimension of MLLM visual understanding.HueManity comprises 83,850 Ishihara-style images with embedded alphanumeric strings, challenging models on precise pattern recognition - a fundamental aspect of visual understanding.Our evaluation of nine MLLMs reveals a profound performance deficit: the best-performing model achieved only 33.6\% accuracy on an 'easy' numeric task and 3\% on a 'hard' alphanumeric task.This starkly contrasts with human (100\% numeric, 95.6\% alphanumeric) and fine-tuned ResNet50 (96.5\% numeric, 94.5\% alphanumeric) performance.These findings uncover a critical gap in MLLMs' fine-grained visual understanding, a limitation not apparent through conventional high-level assessments.HueManity offers a new paradigm for evaluating this specific type of model understanding. We will open-source the dataset and code to foster research towards robust perception in MLLMs.