ICML Poster GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Poster

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Pengxiang Zhao · Xiaoming Yuan

East Exhibition Hall A-B #E-2704

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Lay Summary:

Large language models (LLMs) are powerful but require substantial computing resources, making them expensive to operate., This poses a significant barrier to their widespread use. A common solution is quantization which reduces the precision of model weights to save memory, but current methods face hardware inefficiencies and accuracy issues. To address these challenges, we developed GANQ, a layer-wise post-training non-uniform quantization framework optimized for lookup-table (LUT)-based inference. This method replaces complex computations with simple table lookups, allowing GANQ to determine effective low-bit representations for LUTs. Unlike existing heuristic approaches, GANQ mathematically minimizes errors layer by layer and decomposes the optimization task into independent subproblems, allowing for parallel processing on GPUs and enhancing computational efficiency. This innovative approach reduces memory usage and accelerates inference, enabling larger and faster LLMs to run on everyday hardware. By making advanced AI more practical and accessible, GANQ represents a significant advancement in the field.

Chat is not available.