Poster
GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models
Pengxiang Zhao · Xiaoming Yuan
East Exhibition Hall A-B #E-2704
Large language models (LLMs) are powerful but require substantial computing resources, making them expensive to operate., This poses a significant barrier to their widespread use. A common solution is quantization which reduces the precision of model weights to save memory, but current methods face hardware inefficiencies and accuracy issues. To address these challenges, we developed GANQ, a layer-wise post-training non-uniform quantization framework optimized for lookup-table (LUT)-based inference. This method replaces complex computations with simple table lookups, allowing GANQ to determine effective low-bit representations for LUTs. Unlike existing heuristic approaches, GANQ mathematically minimizes errors layer by layer and decomposes the optimization task into independent subproblems, allowing for parallel processing on GPUs and enhancing computational efficiency. This innovative approach reduces memory usage and accelerates inference, enabling larger and faster LLMs to run on everyday hardware. By making advanced AI more practical and accessible, GANQ represents a significant advancement in the field.