Skip to yearly menu bar Skip to main content


Poster

Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Qi He · Peiran Yu · Ziyi Chen · Heng Huang

West Exhibition Hall B2-B3 #W-520
[ ] [ ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.

Lay Summary:

Many AI engineers speed up learning by shuffling the order of training examples, yet proofs of its reliability have existed only for the ideal case where the loss surface is perfectly smooth. Modern models—from image classifiers to language models—violate that tidy assumption, belonging instead to a broader “generalized smoothness” family whose bumps and plateaus were not covered by earlier theory.Our study shows that shuffling‑type gradient methods still converge rapidly under this more permissive generalized‑smoothness condition. We supply tight rate guarantees for both easy (convex) and hard (non‑convex) objectives and present a simple formula for picking step sizes that keeps training on track in these rougher landscapes. By extending the safety net around a technique practitioners already trust, our results let developers use shuffling with mathematical confidence on today’s messier problems and point to step‑size schedules that can make training even faster in practice.

Chat is not available.