Poster
No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions
Benjamin Ruben · William Tong · Hamza Chaudhry · Cengiz Pehlevan
West Exhibition Hall B2-B3 #W-1018
Training large machine-learning models is costly and time-consuming, so practitioners often train several smaller models and average their predictions. This method, known as “ensembling,” lets computations run in parallel across many machines—but is it always the best choice? We explore that question in a clean test bed where each model is just a collection of random features followed by linear regression. In this setting we prove a clear rule: if the total feature budget is fixed, the lowest possible error comes from putting all features into one model rather than spreading them across an ensemble.Next we ask when an ensemble can perform almost as well as a single larger model. The key is how model size compares to the size of the dataset. If each model already has more features than there are training examples, then averaging several such models works nearly as well as simply enlarging a single model. But when data outnumber features—as in modern language models trained on a web-scale corpus—individual model size becomes the limiting factor to performance. In this setting, researchers track progress by how accuracy improves as models grow, or ``scale.’’ For ensembles, scaling up their size can mean adding more ensemble members, adding more features to each ensemble member, or a combination of both. We show that even in this data-rich regime, ensembles can keep pace with a single larger model, provided the learning task is sufficiently easy and every member widens quickly enough as total size increases.