ICML Poster Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate

Spotlight Poster

Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate

Yingzhen Yang

East Exhibition Hall A-B #E-2006

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regressionby an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $O(\epsilon_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\epsilon_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions.As a special case of our general result, when the eigenvalues of the associated NTKdecay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, we immediately obtain the minimax optimal rate of$O(n^{-\frac{d}{2d-1}})$, which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only $d$ and $\epsilon_n$, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data.Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the ReproducingKernel Hilbert Space associated with the NTK and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employedto tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD. Our resultformally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions about the spherical covariate.

Lay Summary:

Understanding how well neural networks trained by gradient descent (GD) can generalize to new data is a core challenge in modern machine learning. In this work, we study a specific problem: predicting smooth relationships between inputs and outputs (nonparametric regression) using a two-layer neural network trained by GD. We show that, with early stopping, such a network can match the best-known performance of classical kernel methods — a class of powerful, well-understood algorithms — without relying on strong assumptions about the data distribution.Our results show that even with minimal structural assumptions (only requiring the input data to lie on a sphere), these neural networks achieve the same optimal prediction accuracy as if the data had followed more idealized, structured distributions. This makes our findings more widely applicable.To prove our results, we developed two new techniques: one that tracks how the network evolves during training, and another that carefully measures the complexity of all functions the network could learn. These tools may be useful in understanding broader classes of learning algorithms.

Chat is not available.