Poster
in
Workshop: Actionable Interpretability
Needle in a Patched Haystack: Evaluating Saliency Maps for Vision LLMs.
Bastien Zimmermann · Matthieu Boussard
\emph{ColPali} recently proposed a method for explaining multimodal retrieval-augmented generation (RAG) by visualizing how vision–language models (VLMs) connect image patches to text tokens. However, our theoretical analysis and experiments show that these similarity-based saliency maps are fragile and often misleading.We therefore caution against relying solely on intuitive visualizations and present a principled patch-level dissection technique that traces how vision LLMs actually accumulate evidence across modalities.To address this issue, we introduce \emph{Needle-in-a-Patched-Haystack}: a patch-centered dataset and metric suite that quantifies transparency by benchmarking localization performance in vision LLMs. Together, our analysis and toolkit establish a stricter standard for VLM interpretability and provide a drop-in evaluation protocol for future research on robust, multimodal explanations.