ICML Towards Understanding Orthogonalization in Muon

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Towards Understanding Orthogonalization in Muon

Valentyn Boreiko · Zhiqi Bu · Sheng Zha

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: Muon is a recent optimizer that relies on matrix orthogonalization of updates and has been shown to improve large language model (LLM) training. It does so by introducing additional momentum and Newton-Schulz iteration to the stochastic spectral descent method (SSD).However, it incurs higher communication cost if tensor parallelism is enabled, and its hyperparameter transfer propertiesare not yet fully explored. We first introduce block-wise orthogonalization, splitting weight matrices into independent tiles that are orthogonalized separatelyand recombined, and we empirically analyze its influence on training. This retains the validationloss while allowing up to $16$x tensor parallel splits of weight matrices. Second, we show that under spectral regularization a single learning rate transfers whendepth, width of the model, and token count are co-scaled under Chinchilla guidelines. Finally, we show that a higher weight decay value of $0.1$ underperforms during the first 80\% of the training but outperforms lower values after that, which can be attributed to the tighter spectral norm constraint. Based on this, we propose weight decay clipping and scheduling to capture both regimes. The code is available at https://anonymous.4open.science/r/MuonSBW-23A2.

Chat is not available.

Poster in Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Towards Understanding Orthogonalization in Muon

Valentyn Boreiko · Zhiqi Bu · Sheng Zha

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)