ICML Fine-Grained Visual Tokens Align with Localized Semantics

Poster
in
Workshop: Actionable Interpretability

Fine-Grained Visual Tokens Align with Localized Semantics

Zhuohao Ni · Xiaoxiao Li

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Deep learning models have achieved high accuracy in skin lesion classification, but their black-box nature impedes clinical adoption. Clinicians rely on visible lesion attributes (e.g., color variegation, border irregularity) as key evidence in diagnosis, underscoring the need for models that provide similar attribute-based explanations. We propose FiLMed, an interpretable deep learning framework that addresses this transparency challenge without sacrificing performance. FiLMed is built upon a locality-aligned Vision Transformer (ViT) that associates learned concept tokens with specific image regions, combined with a dynamic multi-token representation to capture diverse manifestations of each concept. This design enables the model to output both the predicted lesion category and a set of human-understandable attributes grounded in their corresponding image locations, thereby offering visual and semantic evidence for its decisions. We evaluated FiLMed on large-scale dermoscopic image datasets, and the results demonstrate that it achieves competitive classification performance comparable to state-of-the-art models while providing faithful attribute-level explanations for each prediction. By bridging the gap between accuracy and interpretability, FiLMed offers a promising solution for transparent and trustworthy AI-assisted skin lesion diagnosis in clinical practice.

Chat is not available.

Poster in Workshop: Actionable Interpretability

Fine-Grained Visual Tokens Align with Localized Semantics

Zhuohao Ni · Xiaoxiao Li

Poster
in
Workshop: Actionable Interpretability