ICML Poster The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

Poster

The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

Zichao Li · Xueru Wen · Jie Lou · Yuqiu Ji · Yaojie Lu · Xianpei Han · Debing Zhang · Le Sun

East Exhibition Hall A-B #E-2301

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling. Our source code is provided on https://github.com/alignrm/Generalizable-MM-RM.

Lay Summary:

Reward models, which teach AI systems what makes a good response to images and text, often take dangerous shortcuts. Our research found these models primarily judge responses based on text patterns alone, ignoring visual information. This causes them to fail in new situations where text-only shortcuts don't work.We developed a "Shortcut-aware" algorithm that identifies when reward models over-rely on text patterns. It emphasizes training examples where text-only understanding fails, forcing the model to develop genuine comprehension of both visual and textual information together, rather than taking the easier text-only route.Our approach significantly improves reward models' ability to handle unfamiliar data, making AI systems more reliable when responding to image-based questions. In real-world tests, our models produced better responses with fewer visual errors and hallucinations. This advancement is crucial for developing trustworthy AI assistants that accurately understand both what they see and read.

Chat is not available.