ICML Poster DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Poster

DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Changyi He · Yifu Ding · Jinyang Guo · Ruihao Gong · Haotong Qin · Xianglong Liu

East Exhibition Hall A-B #E-2507

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2\% with half training cost and even surpass the teacher model with 4.7$\times$ compression.

Lay Summary:

Large language models are powerful but slow and expensive to train. One way to make them faster is to teach smaller models to copy the behavior of large ones—a process called distillation. But most methods waste time on examples that are already easy for the small model to learn.We propose a smarter method that focuses only on the hard examples the small model struggles with. It also uses a better way to guide the learning process, so the model trains more smoothly and effectively.Our approach builds smaller models that are just as good—or even better—than the large ones, while using much less time and computing power. This makes it easier to use advanced language models in everyday applications.

Chat is not available.