Poster
in
Affinity Workshop: New In ML
Rethinking Learning from Label Proportions in the Era of Large-Scale Vision-Language Models
Learning from Label Proportions (LLP) is a challenging, weakly-supervised problem that trains instance-level classifiers from only aggregated bag-level proportions. Existing methods typically rely on complex models or iterative pseudo-labeling to deconvolve this ambiguous signal. In this paper, we challenge this paradigm by introducing large-scale vision-language models (VLMs) to the LLP domain for the first time. We show that a straightforward framework, using a lightly fine-tuned CLIP encoder, substantially outperforms all state-of-the-art methods. We posit that this is because the powerful, semantic feature space provided by VLMs fundamentally simplifies the LLP problem, degrading it to a nearly linearly separable one. We verify this hypothesis and the robustness of our method through in-depth visualizations and ablation studies. Our work not only contributes a new, powerful, and easy-to-reproduce baseline for the LLP field but, more importantly, reveals a transferable principle: leveraging the prior knowledge in foundation models to simplify classic weakly-supervised learning problems.