Poster
QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval
Jaehyun Kwak · Izaaz Inhar · Se-Young Yun · Sung-Ju Lee
East Exhibition Hall A-B #E-1503
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.
Imagine uploading a photo of a shirt and typing “make it blue with short sleeves.”Today’s image search systems, Composed Image Retrieval (CIR) models are trained using a rigid rule: only the exact match is correct, and every other image is considered incorrect. This leads the model to penalize many relevant but imperfect matches during training, resulting in mixed-quality retrievals.Our method, QuRe (Query-Relevant Retrieval), takes a different approach. Instead of judging images in isolation, it teaches the model to compare: is image A more relevant than image B? We carefully select the “B” images, those that fall between two steep drops in the model’s own relevance scores, ensuring they are truly different and informative. This hard negative sampling better reflects what users actually care about.QuRe achieves state-of-the-art performance on standard benchmarks and aligns more closely with human preferences on HP-FashionIQ, a new human-annotated dataset we release. This approach enhances the accuracy and user satisfaction of visual search in e-commerce and media platforms.