Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
Xinyu Wang · Vahid Partovi Nia · Peng Lu · Jerry Huang · Xiao-Wen Chang · Boxing Chen · Yufei Cui
Abstract:
Large Language Models (LLMs) are powerful but resource-intensive. Power-of-two (PoT) quantization offers hardware-friendly compression but often struggles with accuracy, especially on GPUs due to sign bit entanglement and sequential dequantization. We propose PoTPTQ, a novel PoT quantization framework for LLM weights that achieves state-of-the-art accuracy in extremely low-precision (2- and 3-bit) and enables faster inference via efficient dequantization. Our two-step post-training algorithm initializes quantization scales robustly and refines them with a minimal calibration set. PoTPTQsurpasses integer quantization baselines at low precisions and achieves dequantization speedups of up to $3.67\times$ on NVIDIA V100 and $1.63\times$ on NVIDIA RTX 4090 compared to uniform integer dequantization.
Chat is not available.