ICML Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Guoliang HE · Youhe Jiang · Wencong Xiao · Jiang Kaihua · Shuguang Wang · Jun Wang · Du Zixian · Zhuo Jiang · Xinlei Zhang · Binhang Yuan · Eiko Yoneki

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract: The scaling law for large language models (LLMs) depicts that the path towards machine intelligence necessitates training at large scale. Thus, companies continuously build large-scale GPU clusters, and launch training jobs that span over thousands of computing nodes. However, LLM pre-training presents unique challenges due to its complex communication patterns, where GPUs exchange data in sparse yet high-volume bursts within specific groups. Inefficient resource scheduling exacerbates bandwidth contention, leading to suboptimal training performance. This paper presents Arnold, a scheduling system summarizing our experience to effectively align LLM communication patterns with data center topology at scale. An in-depth characteristic study is performed to identify the impact of physical network topology to LLM pre-training jobs, and a scheduling algorithm is developed to effectively align communication patterns with the physical network topology in modern data centers. In production training, our scheduling system improves the end-to-end performance by $10.6\%$ when training with more than $9600$ GPUs, a significant improvement for our training pipeline.

Chat is not available.

Poster in Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

Guoliang HE · Youhe Jiang · Wencong Xiao · Jiang Kaihua · Shuguang Wang · Jun Wang · Du Zixian · Zhuo Jiang · Xinlei Zhang · Binhang Yuan · Eiko Yoneki

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models