ICML Poster Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Poster

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Muru Zhang · Mayank Mishra · Zhongzhu Zhou · William Brandon · Jue Wang · Yoon Kim · Jonathan Ragan-Kelley · Shuaiwen Song · Ben Athiwaratkun · Tri Dao

West Exhibition Hall B2-B3 #W-1000

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Slides] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

Lay Summary:

As foundation models continue to scale, multi-GPU inference is crucial. Tensor Parallelism, a widely adopted distributed inference approach, divides weights and computation across all devices, which helps with both memory efficiency and speed. However, the inter-GPU communication turns out to be a major bottleneck of the overall latency. For a 70B model running with TP on 8 GPUs, the communication can account for 38% of the total inference time. We introduce Ladder-residual, a simple architecture tweak that allows computation and communication to happen in parallel—reducing latency without needing custom kernels or hardware changes.Here's a quick summary of what Ladder-residual achieves:* ~30% speedup for LLaMA 3.1-70B (TP=8) and LLaMA 3.1-405B (TP=16), and almost doubled speedup when fast interconnect (NVLink) is not available. Comparable performance to standard Transformer. * Can be applied to a pretrained model - we adapt LlaMA 3.1-8B and gained 23% speedup with no accuracy lost * Pure PyTorch level modification, no custom CUDA kernels needed, work on any hardware.

Chat is not available.