Skip to yearly menu bar Skip to main content


Poster

MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition

Wei Li · Lujun Li · Hao Gu · Youliang Huang · Mark Lee · Shengjie Sun · Wei Xue · Yike Guo

West Exhibition Hall B2-B3 #W-1003
[ ] [ ]
Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Mixture of Experts (MoE) architecture improves Large Language Models (LLMs) with better scaling, but its higher parameter counts and memory demands create challenges for deployment. In this paper, we present MoE-SVD, a new decomposition-based compression framework tailored for MoE LLMs without any extra training. By harnessing the power of Singular Value Decomposition (SVD), MoE-SVD addresses the critical issues of decomposition collapse and matrix redundancy in MoE architectures. Specifically, we first decompose experts into compact low-rank matrices, resulting in accelerated inference and memory optimization. In particular, we propose selective decomposition strategy by measuring sensitivity metrics based on weight singular values and activation statistics to automatically identify decomposable expert layers. Then, we share a single V-matrix across all experts and employ a top-k selection for U-matrices. This low-rank matrix sharing and trimming scheme allows for significant parameter reduction while preserving diversity among experts. Comprehensive experiments on Mixtral, Phi-3.5, DeepSeek, and Qwen2 MoE LLMs show MoE-SVD outperforms other compression methods, achieving a 60\% compression ratio and 1.5× faster inference with minimal performance loss. Codes are available at: https://github.com/lliai/MoE-SVD.

Lay Summary:

We introduce MoE-SVD, a decomposition-based compression approach specifically designed for Mixture of Experts (MoE) Large Language Models (LLMs). Leveraging Singular Value Decomposition (SVD), our method reduces parameter redundancy and memory requirements without requiring additional training. We propose selective decomposition using sensitivity metrics, employing a shared V-matrix across experts and trimming U-matrices through top-k selection. Experiments conducted on various MoE models such as Mixtral, Phi-3.5, DeepSeek, and Qwen2 demonstrate a 60% compression ratio and 1.5× faster inference speed with minimal performance degradation.

Chat is not available.