ICML Poster Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Poster

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza · Tianyue Zhang · Laurent Charlin · Mateo Espinosa Zarlenga

East Exhibition Hall A-B #E-1206

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Concept Bottleneck Models (CBMs) propose toenhance the trustworthiness of AI systems byconstraining their decisions on a set of humanunderstandable concepts. However, CBMs typically rely on datasets with assumedly accurateconcept labels—an assumption often violated inpractice which we show can significantly degradeperformance. To address this, we introduce theConcept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigatesthe negative impact of concept mislabeling onCBM performance. We provide an analysis onsome key properties of the CPO objective showing it directly optimizes for the concept’s posteriordistribution, and contrast it against Binary CrossEntropy (BCE) where we show CPO is inherentlyless sensitive to concept noise. We empiricallyconfirm our analysis finding that CPO consistentlyoutperforms BCE in three real-world datasets withand without added label noise

Lay Summary:

Concept Bottleneck Models (CBMs) are a type of machine learning model that first predict human-understandable concepts — like “has a beak” or “is smiling” — and then use those concepts to make a final decision. This design makes the model’s reasoning easier to inspect and, importantly, allows users to intervene by correcting mispredicted concepts.Unfortunately, like many machine learning models, CBMs assume all concept labels are accurate — which isn’t realistic. Real-world data is often contaminated with labeling errors due to subjectivity, labeler fatigue, or even standard training tricks like cropping images that can accidentally hide important features. Our work introduces a new training method called Concept Preference Optimization (CPO) that makes CBMs more reliable when labels aren’t perfect.Instead of treating every label as correct, CPO compares pairs of labels during training and teaches the model to favor those that seem more trustworthy. We show that CPO improves CBM performance even when many concept labels are wrong. It also helps the model better recognize when it’s unsure — a critical ability in high-stakes fields like healthcare or law enforcement.

Chat is not available.