ICML Poster Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

Poster

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

Weilin Cai · Juyong Jiang · Le Qin · Junwei Cui · Sunghun Kim · Jiayi Huang

West Exhibition Hall B2-B3 #W-614

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100\% overlap with computation.Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of $1.49\times$ in training and $1.82\times$ in inference.Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.

Lay Summary:

The Problem: Imagine a team of super-specialized workers (experts) collaborating on a massive project (a large AI model). To get the job done quickly, the work needs to be split efficiently among many different workstations (devices). However, coordinating who does what and passing information back and forth between these workstations creates a huge traffic jam (communication bottleneck). This slowdown is the main thing preventing these powerful models from working even faster.The Solution (ScMoE): We created a smarter way to design the team (model architecture) and organize their workflow (parallelization strategy):1. New Design (Shortcut Connection): We added some direct connections within the team structure, allowing information to flow more flexibly.2. Smarter Workflow (Overlapping): Crucially, this new design lets the team members do their own calculations while the information they need is still being passed around. Previously, everyone had to wait for all the information to arrive before anyone could start their real work. Now, communication and calculation happen at the same time.The Results: This combined approach eliminates the traffic jam almost completely:* Much Faster: It makes training the AI model 1.5 times faster and running the finished model (inference) 1.8 times faster compared to the standard method used today.* Just as Smart (or Smarter): Importantly, our method doesn't sacrifice quality. The AI models built this way perform just as well as, and sometimes even better than, models built using older, slower methods.

Chat is not available.