Poster
RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning
Yuanhuiyi Lyu · Xu Zheng · Lutao Jiang · Yibo Yan · Xin Zou · Huiyu Zhou · Linfeng Zhang · Xuming Hu
West Exhibition Hall B2-B3 #W-300
Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present the first real-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18\% FID score with the auto-regressive model on the Stanford Car benchmark.
Text-to-image models, like Stable Diffusion V3 and Flux, have made impressive strides in generating images from text. However, these models often struggle when asked to generate highly specific or unseen objects, leading to strange or distorted results. For instance, they may fail to accurately generate new or detailed objects, such as a Tesla Cybertruck, because they only know what they've been trained on.To address this issue, we developed a new framework called RealRAG. This framework enhances text-to-image generation by incorporating real-world images to fill in the gaps in the model's knowledge. We introduced a novel approach called self-reflective contrastive learning to ensure the model retrieves relevant real-world images, allowing it to generate more realistic and accurate images of unfamiliar objects.RealRAG can be applied to any state-of-the-art text-to-image model and improves their performance significantly. For example, it improved the realism of auto-regressive models by 16.18%, demonstrating its ability to generate high-quality images of fine-grained objects.