ICML Poster Gradient Descent Converges Arbitrarily Fast for Logistic Regression via Large and Adaptive Stepsizes

Poster

Gradient Descent Converges Arbitrarily Fast for Logistic Regression via Large and Adaptive Stepsizes

Ruiqi Zhang · Jingfeng Wu · Peter Bartlett

East Exhibition Hall A-B #E-2309

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We analyze the convergence of gradient descent (GD) with large, adaptive stepsizes for logistic regression on linearly separable data. The stepsize adapts to the current risk, scaled by a fixed base stepsize \eta. We prove that once the number of iterates t surpasses a margin-dependent threshold, the averaged GD iterate achieves a risk upper bound of \exp(-\Theta(\eta t)), where \eta can be chosen arbitrarily large. This implies that GD attains \emph{arbitrarily fast} convergence rates via large stepsizes, although the risk evolution might not be monotonic. In contrast, prior adaptive stepsize GD analyses require a monotonic risk decrease, limiting their rates to \exp(-\Theta(t)). We further establish a margin-dependent lower bound on the iteration complexity for any first-order method to attain a small risk, justifying the necessity of the burn-in phase in our analysis. Our results generalize to a broad class of loss functions and two-layer networks under additional assumptions.

Lay Summary:

We study how gradient descent (GD) performs with large, adaptive stepsizes when training logistic regression models on linearly separable data. The stepsize changes based on the current loss, scaled by a fixed base stepsize \eta. We prove that after the number of iterations t exceeds a certain threshold, the averaged iterate from GD achieves a small error bounded by \exp(−\Theta(\eta t)) for an arbitrary \eta. This implies that GD attains \emph{arbitrarily fast} convergence rates via large and adaptive stepsizes. Additionally, we show that a minimum number of iterations is necessary for any first-order optimization method to reach a small risk. Our findings also extend to many other loss functions and two-layer neural networks under extra conditions.

Chat is not available.