Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
DEETS: Detailed Evaluation of Image Text Specificity
Yasumasa Onoe · Hailey Joren · Cyrus Rashtchian · Su Wang · Olivia Wiles · Yonatan Bitton · Brian Gordon · Keran Rong · Austin Waters · Jason Baldridge · Roopal Garg · Radu Soricut · Jordi Pont-Tuset
Keywords: [ Evaluation Metrics ] [ Image Captioning ] [ Image Descriptions ]
Abstract:
Large multimodal models now produce increasingly detailed and generally accurate (but often imperfect) descriptions of images. However, it is both time consuming and challenging for human annotators to assess the quality of _paragraph-length_ captions, and no reliable automatic metrics are yet available. To address this gap, we propose DEETS, a __reference-free__ metric targeting long ($>$100 words) visual descriptions. DEETS is a model-based metric trained on a new dataset of images paired with two captions: an original model-generated caption and a human-rewritten version that fixes errors and adds details. The DEETS score (a) is informative, measuring (and combining) both descriptive correctness and richness; (b) tracks human-annotated metrics well; and (c) correlates with performance in downstream tasks we considered ($\rho=0.398$, $\tau=0.287$ with ranking in a text-based retrieval task).It is learned end-to-end, practical (cheap to run), and can serve both as a reliable measure of model quality during training and as a complement to human judgments.
Chat is not available.