Poster
Whitened CLIP as a Likelihood Surrogate of Images and Captions
Roy Betser · Meir Yossef Levi · Guy Gilboa
East Exhibition Hall A-B #E-3206
Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.
CLIP is a powerful AI model that connects images and text, but it doesn’t indicate how typical or unusual an image or caption is. In this work, we introduce a simple, training-free method that enables CLIP to estimate the likelihood of an image or caption based on its internal representation. This allows us to detect AI-generated or suspicious images more effectively, assess whether a caption is simple or complex, and identify image domain shifts. Our approach is fast, general, and requires no labeled data, making it valuable for tasks like fake image detection and enhancing AI safety.