Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Compression of Large Language Models by Neuron Summary

Yancheng Wang · Dongfang Sun · Yingzhen Yang

[ ] [ Project Page ]
Fri 18 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract:

The rapid growth in the size of Large Language Models (LLMs) poses significant challenges for deployment, particularly in resource-limited environments. To address this issue, we propose Neuron Summary (NS), a novel approach for compressing LLMs by constructing compact representations of the weights in their linear layers. Given that these layers contribute the most to the overall model size, NS offers an effective method to reduce the model size and computational costs while maintaining strong performance in downstream natural language processing tasks. Our compressed model, NSNet, substitutes each linear layer in an LLM with an NS-Linear layer, where the weights are represented using NS. The transition from a pre-trained LLM to NSNet is achieved through regression-based initialization, followed by knowledge distillation to preserve the original model’s capabilities.Extensive experiments on compressing various LLMs, including DeBERTaV3-base and Llama-2, demonstrate that NS significantly outperforms existing compression methods across multiple tasks, such as natural language understanding, question answering, and text generation. Additionally, NS is complementary to other compression techniques, such as quantization and layer-wise parameter sharing, enabling further reduction in model size while maintaining competitive performance. The code of NSNet is available at \url{https://anonymous.4open.science/r/NSNet-D6B8/}.

Chat is not available.