Skip to yearly menu bar Skip to main content


Apple

Expo Talk Panel

Distillation Scaling Laws

Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers

West Ballroom A
[ ]
Sun 13 Jul 5 p.m. PDT — 6 p.m. PDT

Abstract:

Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.

Chat is not available.