ICML Poster ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $\alpha$-$\beta$-Divergence

Spotlight Poster

ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $\alpha$-$\beta$-Divergence

Guanghui Wang · Zhiyong Yang · Zitai Wang · Shi Wang · Qianqian Xu · Qingming Huang

East Exhibition Hall A-B #E-2204

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Oral presentation: Oral 5B Deep Learning Algorithms
Thu 17 Jul 10 a.m. PDT — 11 a.m. PDT

Abstract: Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving a better trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy.

Lay Summary:

Knowledge Distillation (KD) is a method where a smaller student AI model learns from a larger, more powerful teacher model by imitating its output predictions. This process allows the student to benefit from the richer information in the teacher’s output, which goes beyond simple correct/incorrect labels.However, the effectiveness of KD depends heavily on how the difference between teacher and student outputs is measured. Most methods use either forward Kullback-Leibler divergence (FKLD) or reverse KL divergence (RKLD), but both have issues: FKLD is too relaxed, making the student spread its focus too widely, while RKLD is too aggressive, forcing the student to over-focus on a few predictions and ignore useful information.We find that this problem stems from two competing effects: hardness-concentration (focusing on hard-to-predict examples) and confidence-concentration (focusing on what the student already predicts with confidence). FKLD and RKLD treat these effects in extreme ways.To solve this, we introduce ABKD, a new framework based on α-β divergence, which lets us balance both effects more flexibly. By tuning just the loss function, ABKD improves performance across 17 language and vision tasks, showing strong results without needing additional model changes.

Chat is not available.