ICML Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Chen Fan · Mark Schmidt · Christos Thrampoulidis

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Different gradient-based methods for optimizing overparameterized models can allachieve zero training error yet converge to distinctly different solutions inducingdifferent generalization properties. We provide the first complete characterizationof implicit optimization bias for p-norm normalized steepest descent (NSD) andmomentum steepest descent (NMD) algorithms in multi-class linear classificationwith cross-entropy loss. Our key theoretical contribution is proving that these algo-rithms converge to solutions maximizing the margin with respect to the classifiermatrix's p-norm, with established convergence rates. These results encompassimportant special cases including Spectral Descent and Muon, which we showconverge to max-margin solutions with respect to the spectral norm. A key insightof our contribution is that the analysis of general entry-wise and Schatten p-normscan be reduced to the analysis of NSD/NMD with max-norm by exploiting a naturalordering property between all p-norms relative to the max-norm and its dual sum-norm. For the specific case of descent with respect to the max-norm, we furtherextend our analysis to include preconditioning, showing that Adam convergesto the matrix's max-norm solution. Our results demonstrate that the multi-classlinear setting, which is inherently richer than the binary counterpart, providesthe most transparent framework for studying implicit biases of matrix-parameteroptimization algorithms.

Chat is not available.

Poster in Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

Chen Fan · Mark Schmidt · Christos Thrampoulidis

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)