ICML Poster Probing Visual Language Priors in VLMs

Poster

Probing Visual Language Priors in VLMs

Tiange Luo · Ang Cao · Gunhee Lee · Justin Johnson · Honglak Lee

West Exhibition Hall B2-B3 #W-205

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Vision-Language Models (VLMs) may over-rely on visual language priors from their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4o achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data and then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our proposed training objective, Image-DPO, compels VLMs to focus more on the actual visual inputs, and we demonstrate its effectiveness in LLaVA-v1.5 and Cambrian. Project Page: \href{https://vilp-team.github.io/}{ViLP}.

Lay Summary:

Modern AI systems that look at pictures and read text—called vision-language models—sometimes answer by guessing from familiar patterns in their training data instead of truly “seeing” the image. We built a new test, ViLP, that shows them unusual, computer-generated pictures with tricky questions. Each question has three possible answers: one you could guess from general knowledge and two that require careful visual inspection. People get almost everything right, but even the powerful GPT-4o is correct only 66 percent of the time, revealing a heavy reliance on shortcuts.To help these models improve, we created a self-training recipe called Image-DPO. The model first invents its own image–question pairs, then practices on slightly “corrupted” versions of those images and learns to change its answer whenever the picture changes. This teaches the model to pay closer attention to what it actually sees. Two open-source models, LLaVA-v1.5 and Cambrian, showed clear gains after this training. All code, data, and a demo are freely available at the ViLP website: https://vilp-team.github.io/.

Chat is not available.