Skip to yearly menu bar Skip to main content


Poster

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan · Sameera Ramasinghe · Yan Zuo · Gil Avraham · Alexander Long

West Exhibition Hall B2-B3 #W-615
[ ] [ ] [ Project Page ]
Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

Lay Summary:

Training very large neural networks often requires splitting the model into parts and running them across several smaller devices. If the connection bandwidth between these devices is low (e.g., the internet), the devices would stay idle due to communication delays. Asynchronous optimization eliminates this idle time by ensuring all devices are active at all times. This comes at a cost of incorrect (or delayed) information being used for training, which often affects model performance. We address this by predicting the future state of the model using a look-ahead approach, effectively removing inaccuracies. We provide a theoretical guarantee that our approach still converges. Our experiments in training large language models show that our asynchronous method not only improves device utilization but also improves the final model performance compared to synchronized training. This shows the possibility of training large AI models using devices connected via the internet, instead of expensive centralized infrastructures.

Chat is not available.