Skip to yearly menu bar Skip to main content


Poster

Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

Yuchen Wang · Xuefeng Bai · Xiucheng Li · Weili Guan · Liqiang Nie · Xinyang Chen


Abstract:

Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance.While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated.To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29\% over the SoTA method. Our code is avaliable at https://github.com/Noahwangyuchen/CAP

Lay Summary:

Modern AI systems that connect images with language—called vision-language models (VLMs)—are being used to label new images without human effort. However, these automatic labels (called pseudolabels) are often unbalanced. That means the model favors some categories over others, which leads to poor performance when applied to real-world tasks.We explored why this imbalance happens and discovered two key issues: the model may fail to extract the precise meaning of certain categories (concept mismatch), or confuse similar-looking ones (concept confusion). To fix this, we designed a new framework called CAP, combining concept alignment to help the model better match text and images, and confusion-aware calibrated margin to to help the model better tell similar categories apart.Our approach leads to more accurate and fair labels across categories. We tested it on six widely-used datasets and three learning setups, showing that it consistently improves results—by over 6% compared to the best existing method.

Chat is not available.