Poster
Reasoning Limitations of Multimodal Large Language Models. A case study of Bongard Problems
Mikołaj Małkiński · Szymon Pawlonka · Jacek Mańdziuk
East Exhibition Hall A-B #E-3307
Humans are good at recognizing abstract patterns in images, like those found in IQ tests. But can advanced AI models do the same? Our study investigates whether multimodal large language models that can process both images and text can solve Bongard Problems, a type of visual-textual reasoning challenge. We tested several advanced AI models on tasks involving synthetic and real-world images. While these models showed some success with real-world tasks, they struggled with synthetic puzzles, which are more abstract. To probe this issue, we created Bongard-RWR, a new dataset representing abstract concepts with real-world images. Our findings suggest that the AI models’ difficulties aren't just because of an unfamiliar synthetic image domain, but stem from fundamental limitations in how they understand visual concepts. This highlights the need to improve abstract reasoning capabilities of these models, a step toward more human-like reasoning.