ICML Poster Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Poster

Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Yike Yuan · Ziyu Wang · Zihao Huang · Defa Zhu · Xun Zhou · Jingyi Yu · Qiyang Min

East Exhibition Hall A-B #E-3302

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Lay Summary:

It is observed that the entire diffusion process actually consists of numerous sub-tasks with varying levels of difficulty; evidently, denoising a pure Gaussian noise is more difficult than denoising an almost fully clean image. Currently, these tasks are all addressed by the same model. We aim to employ models of different sizes to handle tasks with different levels of difficulty.The Mixture of Experts (MoE) technique is commonly used to scale up model capacity, and we find that it is also effective when applied to diffusion models. Furthermore, since the model incorporates a number of experts, we extend the routing strategy to enable the model to autonomously learn to activate different numbers of experts with different sub-tasks. This results in a model with dynamic size that can adaptively adjust to the complexity of each task.Compared to previous MoE approaches, our method enables much more efficient model scaling with simple modification to the routing strategy. Also, we demonstrate the potential of leveraging dynamics in diffusion models.

Chat is not available.