Spotlight Poster
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $\alpha$-$\beta$-Divergence
Guanghui Wang · Zhiyong Yang · Zitai Wang · Shi Wang · Qianqian Xu · Qingming Huang
East Exhibition Hall A-B #E-2204
Thu 17 Jul 10 a.m. PDT — 11 a.m. PDT
Knowledge Distillation (KD) is a method where a smaller student AI model learns from a larger, more powerful teacher model by imitating its output predictions. This process allows the student to benefit from the richer information in the teacher’s output, which goes beyond simple correct/incorrect labels.However, the effectiveness of KD depends heavily on how the difference between teacher and student outputs is measured. Most methods use either forward Kullback-Leibler divergence (FKLD) or reverse KL divergence (RKLD), but both have issues: FKLD is too relaxed, making the student spread its focus too widely, while RKLD is too aggressive, forcing the student to over-focus on a few predictions and ignore useful information.We find that this problem stems from two competing effects: hardness-concentration (focusing on hard-to-predict examples) and confidence-concentration (focusing on what the student already predicts with confidence). FKLD and RKLD treat these effects in extreme ways.To solve this, we introduce ABKD, a new framework based on α-β divergence, which lets us balance both effects more flexibly. By tuning just the loss function, ABKD improves performance across 17 language and vision tasks, showing strong results without needing additional model changes.