ICML Poster Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization

Spotlight Poster

Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization

Shira Vansover-Hager · Tomer Koren · Roi Livni

West Exhibition Hall B2-B3 #W-810

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = {\widetilde O}(n)$, improving on recent results of Koren et al. (2022) and Schliserman et al. (2024).

Lay Summary:

AI models are often trained using an algorithm called stochastic gradient descent (SGD), which processes data iteratively and updates the model based on each example it sees. In its basic form, SGD makes a single pass over the training data and is known to generalize well — meaning it performs well on new, unseen examples.In practice, however, it's common to make multiple passes over the same data to improve performance. This raises a key question: What are the limits of reusing data in SGD when it comes to generalization?Our research shows that generalization can break down surprisingly quickly — even after just one additional pass. In cases where the one-pass version performs optimally, a second pass can already lead to catastrophic overfitting, where the model memorizes the training data instead of learning patterns that apply more broadly.We analyze this behavior and identify a kind of phase transition after the first pass, where generalization begins to break down. These findings reveal a gap between theory and practice, pointing to the need for new theoretical tools to understand why multi-pass training often appears to succeed in practice, despite these fundamental limitations.

Chat is not available.