Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Actionable Interpretability

Needle in a Patched Haystack: Evaluating Saliency Maps for Vision LLMs.

Bastien Zimmermann · Matthieu Boussard

[ ] [ Project Page ]
Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

\emph{ColPali} recently proposed a method for explaining multimodal retrieval-augmented generation (RAG) by visualizing how vision–language models (VLMs) connect image patches to text tokens. However, our theoretical analysis and experiments show that these similarity-based saliency maps are fragile and often misleading.We therefore caution against relying solely on intuitive visualizations and present a principled patch-level dissection technique that traces how vision LLMs actually accumulate evidence across modalities.To address this issue, we introduce \emph{Needle-in-a-Patched-Haystack}: a patch-centered dataset and metric suite that quantifies transparency by benchmarking localization performance in vision LLMs. Together, our analysis and toolkit establish a stricter standard for VLM interpretability and provide a drop-in evaluation protocol for future research on robust, multimodal explanations.

Chat is not available.