Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs
Guoliang HE · Youhe Jiang · Wencong Xiao · Jiang Kaihua · Shuguang Wang · Jun Wang · Du Zixian · Zhuo Jiang · Xinlei Zhang · Binhang Yuan · Eiko Yoneki
Abstract:
The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, LLM pre-training presents unique challenges due to its complex communication patterns, where GPUs exchange data in sparse yet high-volume bursts within specific groups. Inefficient resource scheduling exacerbates bandwidth contention, leading to suboptimal training performance. This paper presents Arnold, a scheduling system summarizing our experience to effectively align LLM communication patterns with data center topology at scale. An in-depth characteristic study is performed to identify the impact of physical network topology to LLM pre-training jobs, and a scheduling algorithm is developed to effectively align communication patterns with the physical network topology in modern data centers. In production training, our scheduling system improves the end-to-end performance by $10.6\%$ when training with more than $9600$ GPUs, a significant improvement for our training pipeline.
Chat is not available.