Spotlight Poster
Training Deep Learning Models with Norm-Constrained LMOs
Thomas Pethick · Wanyun Xie · Kimon Antonakopoulos · Zhenyu Zhu · Antonio Silveti-Falls · Volkan Cevher
East Exhibition Hall A-B #E-3405
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.
Modern deep learning models are usually trained using algorithms that adapt during training to the structure of the data. In this work, we propose a new family of training methods that, instead, adapt in advance to the model’s structure—using mathematical tools that respect how neural networks are built. Our method leads to faster training, requires less memory, and avoids the need for commonly used algorithms like the Adam optimizer. It also allows settings like the learning rate to be reused across different model sizes, making it easier to scale up models. We demonstrate that our method can train large models more efficiently, including popular architectures like GPT and vision transformers.