Skip to yearly menu bar Skip to main content


Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

FPTQuant: Function-Preserving Transforms for LLM Quantization

Boris van Breugel · Yelysei Bondarenko · Paul Whatmough · Markus Nagel

[ ] [ Project Page ]
Sat 19 Jul 10:30 a.m. PDT — 10:45 a.m. PDT

Abstract: Large language models (LLMs) are compute- and energy-intensive at inference time. While quantization improves efficiency, naive approaches often degrade performance due to outliers. We introduce FPTQuant, a method that enables effective transformer quantization through four novel, lightweight function-preserving transforms (FPTs): (1) a pre-RoPE transform for queries/keys, (2) a value transform, (3) an MLP scaling transform, and (4) a dynamic residual scaling. These FPTs exploit transformer equivariances to reshape activations without altering model function, require no custom kernels, and add negligible inference overhead. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to $3.9\times$ over FP.Empirically, FPTQuant has an excellent accuracy-speed trade-off---it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29\% slower.

Chat is not available.