Spotlight Poster
Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG
Wenbin Wang · Yongcheng Jing · Liang Ding · Yingjie Wang · Li Shen · Yong Luo · Bo Du · Dacheng Tao
Thu 17 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
Understanding high-resolution (HR) images is still a big challenge for Multimodal Large Language Models (MLLMs) that work with both text and images. This paper introduces a new approach to help these models better understand HR images by using advanced methods designed for handling long and complex information. We propose a method called Retrieval-Augmented Perception (RAP). Instead of looking at the whole large image at once, RAP smartly breaks the image into smaller parts (called crops) and picks the most relevant ones. It then puts them together in a way that keeps the image’s structure and context. Importantly, this method doesn’t require extra training. We also introduce RE-Search that decides how many image parts to use, depending on how confident the model is and how useful each part seems. In tests on high-resolution image tasks, our method worked well. For example, one model improved by 43% on a tough benchmark and 19% on another.