Poster
Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
Weilin Cai · Juyong Jiang · Le Qin · Junwei Cui · Sunghun Kim · Jiayi Huang
West Exhibition Hall B2-B3 #W-614
The Problem: Imagine a team of super-specialized workers (experts) collaborating on a massive project (a large AI model). To get the job done quickly, the work needs to be split efficiently among many different workstations (devices). However, coordinating who does what and passing information back and forth between these workstations creates a huge traffic jam (communication bottleneck). This slowdown is the main thing preventing these powerful models from working even faster.The Solution (ScMoE): We created a smarter way to design the team (model architecture) and organize their workflow (parallelization strategy):1. New Design (Shortcut Connection): We added some direct connections within the team structure, allowing information to flow more flexibly.2. Smarter Workflow (Overlapping): Crucially, this new design lets the team members do their own calculations while the information they need is still being passed around. Previously, everyone had to wait for all the information to arrive before anyone could start their real work. Now, communication and calculation happen at the same time.The Results: This combined approach eliminates the traffic jam almost completely:* Much Faster: It makes training the AI model 1.5 times faster and running the finished model (inference) 1.8 times faster compared to the standard method used today.* Just as Smart (or Smarter): Importantly, our method doesn't sacrifice quality. The AI models built this way perform just as well as, and sometimes even better than, models built using older, slower methods.