Poster
Olica: Efficient Structured Pruning of Large Language Models without Retraining
Jiujun He · Huazhen Lin
East Exhibition Hall A-B #E-2603
Network pruning is a pivotal technique for reducing the complexity and accelerating the inference time of large language models (LLMs) by removing redundant components (e.g., neurons), but conventional methods require substantial computational and data resources for retraining to restore the corrupted correlations. We propose an efficient pruning framework for LLMs that employs orthogonal neuron decomposition and linear calibration, applied to the multi-head attention (MHA) layer and the feed-forward network (FFN) layer of the transformer, respectively. By developing a fast decomposition method and leveraging the closed-form solution of the least-squares problem, our method is efficient in terms of data usage, GPU memory consumption, and runing time. This enables pruning of models with 70B parameters on a single NVIDIA GeForce RTX 4090 GPU with less than an hour of runtime.