Poster
Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Yuqi Luo · Chenyang Song · Xu Han · Yingfa Chen · Chaojun Xiao · Xiaojun Meng · Liqun Deng · Jiansheng Wei · Zhiyuan Liu · Maosong Sun
East Exhibition Hall A-B #E-2610
We study activation sparsity, a widely existing phenomenon in most LLMs that benefits computation efficiency and interpretability. Three underexplored research questions are addressed in this work: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM?First, we propose a more general and performance-friendly metric for activation sparsity, named CETT-PPL-1\%. Next, comprehensive experiments are conducted to reveal the quantitative influence of four factors on activation sparsity: the amount of training data, the activation function, the width-depth ratio, and the parameter scale. Finally, we summarize the implications for building more sparsely-activated efficient LLMs, including the sparsity-promoting benefits of more training data, the ReLU activation (compared with SiLU), and a smaller width-depth ratio. The insensitiveness of sparsity to parameter scale is also a surprising and interesting observation.Our paper offers a more accurate paradigm for inspecting the sparsity level of an LLM. The empirical laws found in this work can provide instructional values for designing and pre-training an LLM with greater activation sparsity, which helps produce more efficient LLMs.