Poster
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li · Haoqin Tu · Mude Hui · Zeyu Wang · Bingchen Zhao · Junfei Xiao · Sucheng Ren · Jieru Mei · Qing Liu · Huangjie Zheng · Yuyin Zhou · Cihang Xie
East Exhibition Hall A-B #E-3305
To understand images usually requires a lot of text describing what’s in those images — captions. But collecting billions of content rich image-caption pairs to train models is expensive and time-consuming. To address this, we created Recap-DataComp-1B, the first publicly available dataset with one billion synthetic captions generated by a powerful large language model, LLaMA-3.While generating captions isn’t new, doing it at this scale is. This massive dataset helps researchers train and evaluate systems that connect images and language, like those that generate pictures from text or find images that match a written description. Our experiments show that models trained on Recap-DataComp-1B perform better at understanding long and complex image-text relationships.By releasing this dataset to the public, we hope to accelerate progress in multimodal systems that learn from both images and text — and make cutting-edge tools more accessible to everyone. We believe this work sets a new standard for building large-scale, high-quality datasets with synthetic data.