Oral
Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG
Wenbin Wang · Yongcheng Jing · Liang Ding · Yingjie Wang · Li Shen · Yong Luo · Bo Du · Dacheng Tao
West Ballroom A
[
Abstract
]
[ Visit Oral 6B Deep Learning Architectures ]
Thu 17 Jul 3:30 p.m. — 3:45 p.m. PDT
[
OpenReview]
Abstract:
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43\% improvement on $V^*$ Bench and 19\% on HR-Bench. Code is available at https://github.com/DreamMr/RAP.
Chat is not available.
Successful Page Load