Expo Talk Panel
Distillation Scaling Laws
Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers
West Ballroom A
Abstract:
Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.
Chat is not available.