Poster
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
Prasanna Mayilvahanan · Thaddäus Wiedemer · Sayak Mallick · Matthias Bethge · Wieland Brendel
East Exhibition Hall A-B #E-2500
Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute.More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization.In this work, we investigate which factors most strongly influence loss-to-loss scaling.Our experiments reveal that the pretraining data determines the scaling trend.In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Our work indicates that if two models with different training setups (architecture, context length, tokenizer, etc.) but trained on the same data—achieve similar training losses, they will exhibit closely matched downstream test performance.