Skip to yearly menu bar Skip to main content


Poster

Whitened CLIP as a Likelihood Surrogate of Images and Captions

Roy Betser · Meir Yossef Levi · Guy Gilboa

East Exhibition Hall A-B #E-3206
[ ] [ ] [ Project Page ]
Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.

Lay Summary:

CLIP is a powerful AI model that connects images and text, but it doesn’t indicate how typical or unusual an image or caption is. In this work, we introduce a simple, training-free method that enables CLIP to estimate the likelihood of an image or caption based on its internal representation. This allows us to detect AI-generated or suspicious images more effectively, assess whether a caption is simple or complex, and identify image domain shifts. Our approach is fast, general, and requires no labeled data, making it valuable for tasks like fake image detection and enhancing AI safety.

Chat is not available.