Skip to yearly menu bar Skip to main content


Poster

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Wonsuk Jang · Thierry Tambe

East Exhibition Hall A-B #E-2603
[ ] [ ] [ Project Page ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

Lay Summary:

Large language models (LLMs) are getting very huge, demanding lots of memory and computing power. A common solution is to "quantize" them—making the numbers they use smaller (e.g., 16 bits to 4 bits), often processing data in "blocks", pieces of matrix. However, current block-based methods sometimes don't fully grasp the unique patterns within each data block.We introduce "BlockDialect," a new technique that intelligently assigns the optimal number format to each data block for more accurate representation. This uses our specially designed "DialectFP4," a collection of number formats tailored for diverse data patterns. An efficient two-stage process then chooses the best format for each block as the LLM runs, all while being energy-efficient by using simple, hardware-friendly calculations.BlockDialect improves the accuracy of popular LLMs like LLaMA3 compared to existing techniques, while using even fewer bits for each data. Beyond the direct energy savings from this data reduction, our approach is also designed to work seamlessly with energy-efficient hardware units, offering a promising way to run these advanced AI models much more efficiently overall.

Chat is not available.