ICML Compressing Large Language Models to Any Size Without Re-Computation

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Compressing Large Language Models to Any Size Without Re-Computation

Martin Genzel · Patrick Putzky · Pengfei Zhao · Sebastian Schulze · Mattes Mollenhauer · Robert Seidel · Stefan Dietzel · Thomas Wollmann

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Abstract:

The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To ensure parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the resulting pruning order gives rise to a global parameter ranking that allows compressing a model to any target size without requiring re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods.

Chat is not available.

Poster in Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Compressing Large Language Models to Any Size Without Re-Computation

Martin Genzel · Patrick Putzky · Pengfei Zhao · Sebastian Schulze · Mattes Mollenhauer · Robert Seidel · Stefan Dietzel · Thomas Wollmann

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models