ICML Poster The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Poster

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Jinbo Wang · Mingze Wang · Zhanpeng Zhou · Junchi Yan · Weinan E · Lei Wu

East Exhibition Hall A-B #E-3613

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important.In this paper, we uncover a clear **sharpness disparity** across these blocks, which intriguingly emerges early in training and persists throughout the training process.Building on this insight, we propose a novel **Blockwise Learning Rate (LR)** strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile.Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Lay Summary:

Modern AI such as ChatGPT are powered by transformer models, which are made up of different types of blocks working together. In this study, we found that these parts exhibit distinct optimization property called sharpness, roughly speaking, how quick can they learn. Based on this observation, we propose Blockwise Learning Rate (LR), adjusting how quickly each part learns, rather than treating them all the same. Our method can train popular models like GPT-2 and LLaMA much faster and more efficiently, nearly halves the training time, shedding light on cheaper and more accessible AI development.

Chat is not available.