ICML Poster Learning Curves of Stochastic Gradient Descent in Kernel Regression

Poster

Learning Curves of Stochastic Gradient Descent in Kernel Regression

Haihan Zhang · Weicheng Lin · Yuanshi Liu · Cong Fang

West Exhibition Hall B2-B3 #W-619

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online first-order algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under source condition where the optimal predictor can even not belong to the RKHS, i.e. the model is misspecified. Specifically, we focus on the inner product kernel over the sphere and characterize the exact orders of the excess risk curves under different scales of sample sizes $n$ concerning the input dimension $d$. Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales, $without$ suffering the saturation, a prevalent phenomenon observed in (ridge) regression, except when the model is highly misspecified and the learning is in a final stage where $n\gg d^\gamma$ with any constant $\gamma >0$. The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule, a common practice in deep neural network training. As a byproduct, we provide the $first$ provable advantage of the scheme over the iterative averaging method in the common setting.

Lay Summary:

Modern machine learning often deals with very high-dimensional data—meaning each data point has many features or variables. In such situations stochastic gradient descent (SGD)—a simple yet powerful algorithm—often performs remarkably well. Our work explores a question: how effective is SGD when applied to kernel regression, a classic machine learning method, particularly when both the number of data points and the dimension grow very large? We report a surprising phenomenon: for certain moderately challenging learning problems, stochastic gradient descent (SGD) achieves optimal sample efficiency when the number of data points scales polynomially with the data dimension. Building upon this finding, we further demonstrate that SGD can outperform spectral methods, such as Kernel Ridge Regression (KRR), in simpler problem settings.

Chat is not available.