ICML Poster No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

Poster

No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

Benjamin Ruben · William Tong · Hamza Chaudhry · Cengiz Pehlevan

West Exhibition Hall B2-B3 #W-1018

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$.Consequently, a single large model achieves optimal performance. We then ask when ensembles can achieve *near*-optimal performance.In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance.To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.

Lay Summary:

Training large machine-learning models is costly and time-consuming, so practitioners often train several smaller models and average their predictions. This method, known as “ensembling,” lets computations run in parallel across many machines—but is it always the best choice? We explore that question in a clean test bed where each model is just a collection of random features followed by linear regression. In this setting we prove a clear rule: if the total feature budget is fixed, the lowest possible error comes from putting all features into one model rather than spreading them across an ensemble.Next we ask when an ensemble can perform almost as well as a single larger model. The key is how model size compares to the size of the dataset. If each model already has more features than there are training examples, then averaging several such models works nearly as well as simply enlarging a single model. But when data outnumber features—as in modern language models trained on a web-scale corpus—individual model size becomes the limiting factor to performance. In this setting, researchers track progress by how accuracy improves as models grow, or ``scale.’’ For ensembles, scaling up their size can mean adding more ensemble members, adding more features to each ensemble member, or a combination of both. We show that even in this data-rich regime, ensembles can keep pace with a single larger model, provided the learning task is sufficiently easy and every member widens quickly enough as total size increases.

Chat is not available.