ICML 2025 Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG Oral

Oral

Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG

Wenbin Wang · Yongcheng Jing · Liang Ding · Yingjie Wang · Li Shen · Yong Luo · Bo Du · Dacheng Tao

West Ballroom A

[ Abstract ] [ Visit Oral 6B Deep Learning Architectures ]

Thu 17 Jul 3:30 p.m. — 3:45 p.m. PDT

[ OpenReview]

Abstract: High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43\% improvement on $V^*$ Bench and 19\% on HR-Bench. Code is available at https://github.com/DreamMr/RAP.

Chat is not available.