ICML Poster What If We Recaption Billions of Web Images with LLaMA-3?

Poster

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li · Haoqin Tu · Mude Hui · Zeyu Wang · Bingchen Zhao · Junfei Xiao · Sucheng Ren · Jieru Mei · Qing Liu · Huangjie Zheng · Yuyin Zhou · Cihang Xie

East Exhibition Hall A-B #E-3305

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and $\textit{open-sourced}$ LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe an average of 3.1% enhanced zero-shot performance cross four cross-modal retrieval tasks using a mixed set of the original and our captions. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/.

Lay Summary:

To understand images usually requires a lot of text describing what’s in those images — captions. But collecting billions of content rich image-caption pairs to train models is expensive and time-consuming. To address this, we created Recap-DataComp-1B, the first publicly available dataset with one billion synthetic captions generated by a powerful large language model, LLaMA-3.While generating captions isn’t new, doing it at this scale is. This massive dataset helps researchers train and evaluate systems that connect images and language, like those that generate pictures from text or find images that match a written description. Our experiments show that models trained on Recap-DataComp-1B perform better at understanding long and complex image-text relationships.By releasing this dataset to the public, we hope to accelerate progress in multimodal systems that learn from both images and text — and make cutting-edge tools more accessible to everyone. We believe this work sets a new standard for building large-scale, high-quality datasets with synthetic data.

Chat is not available.