Spotlight Poster
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Shikai Qiu · Lechao Xiao · Andrew Wilson · Jeffrey Pennington · Atish Agarwala
East Exhibition Hall A-B #E-3510
Tue 15 Jul 10 a.m. PDT — 11 a.m. PDT
Understanding neural network training dynamics at scale is an important open problem. Although realistic model architectures, optimizers, and data interact in complex ways that make predictive theory challenging, we show that compute-optimally trained models exhibit remarkably precise collective regularities. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, discrepancies between normalized curves fall below the noise floor of individual models' loss curves across random seeds, yielding an exceptionally tight collapse we term "supercollapse." We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction. This collapse breaks down when hyperparameters are scaled suboptimally, providing a practical indicator of proper scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple but effective model of SGD noise dynamics that accurately captures how learning rate schedules deform loss curves away from power laws while preserving universality, and why learning rate decay suppresses variance to enable supercollapse.
We find the loss curves of neural networks follow nearly identical shapes as they scale up in model size and training duration. We find evidence that this surprising phenomenon reveals valuable diagnostic information of neural network training dynamics at scale, and we provide some theoretical explanation of the mechanisms behind it.