ICML Poster Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG

Spotlight Poster

Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG

Wenbin Wang · Yongcheng Jing · Liang Ding · Yingjie Wang · Li Shen · Yong Luo · Bo Du · Dacheng Tao

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Oral presentation: Oral 6B Deep Learning Architectures
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT

Abstract: High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43\% improvement on $V^*$ Bench and 19\% on HR-Bench. Code is available at https://github.com/DreamMr/RAP.

Lay Summary:

Understanding high-resolution (HR) images is still a big challenge for Multimodal Large Language Models (MLLMs) that work with both text and images. This paper introduces a new approach to help these models better understand HR images by using advanced methods designed for handling long and complex information. We propose a method called Retrieval-Augmented Perception (RAP). Instead of looking at the whole large image at once, RAP smartly breaks the image into smaller parts (called crops) and picks the most relevant ones. It then puts them together in a way that keeps the image’s structure and context. Importantly, this method doesn’t require extra training. We also introduce RE-Search that decides how many image parts to use, depending on how confident the model is and how useful each part seems. In tests on high-resolution image tasks, our method worked well. For example, one model improved by 43% on a tough benchmark and 19% on another.

Chat is not available.