ICML Poster GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Poster

GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Jinuk Kim · Marwa El Halabi · Wonpyo Park · Clemens Schaefer · Deokjae Lee · Yeonhong Park · Jae W. Lee · Hyun Oh Song

East Exhibition Hall A-B #E-2602

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Slides] [ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.

Lay Summary:

Modern AI Chatbots like ChatGPT are very powerful, but they rely on extremely large language models (LLMs) that require a lot of memory and computing power to run. This makes them difficult to deploy, especially on small devices like smartphones. One effective way to make them smaller and faster is through quantization, a technique that replaces the model’s parameters with lower-precision approximations. For example, a parameter with value 1.897 might be rounded to 2. But this often hurts the model’s performance.We introduce a new quantization approach called GuidedQuant, which estimates how each parameter in the model affects its overall accuracy and uses this information to decide how to approximate it. Our approach doesn’t require retraining the model and can be applied to many existing quantization methods, improving their performance. We tested GuidedQuant across a range of LLMs and quantization methods and found it consistently improved performance after compression. We also developed a new algorithm for a specific type of quantization that further boosts performance.Our work contributes to an ongoing effort to make LLMs faster and more efficient, helping bring them to more users and devices, and reducing their environmental footprint. Our code and results are available at: https://github.com/snu-mllab/GuidedQuant.

Chat is not available.