Poster
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
Yang Luo · Zangwei Zheng · Ziheng Qin · Zirui Zhu · Yong Liu · Yang You
West Exhibition Hall B2-B3 #W-502
Training large AI language models efficiently is challenging because using bigger batches of data often leads to unstable or lower-quality results. Current training methods (like AdamW or LAMB) struggle with this because they can’t properly control sharp spikes in attention logits—critical parts of how these models process information. While LAMB partly fixes this, it still misses key details, like how to limit extreme values in certain weights or account for relationships between neighboring row/column values in the model’s parameters.To solve this, we developed MERIT, a new training method that:Controls extreme values by using a "max-aware" approach to adjust updates, preventing attention values from spiking.Focuses on local attention patterns in the model’s weights to make updates more precise and stable.In tests with GPT-2 models, MERIT allowed training with larger batch sizes than AdamW and LAMB without sacrificing performance. This means models can be trained faster, accelerating progress in AI development.Why it matters:By addressing overlooked details in how training updates are scaled, MERIT improves stability and opens the door to training larger AI models more efficiently—a critical step for advancing technologies like large language models.