Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Optimal Formats for Weight Quantisation

Douglas Orr · Luka Ribar · Carlo Luschi


Abstract:

We propose a framework for systematic design and analysis of quantisation formats. Our objective of minimising the KL divergence between the original and quantised model outputs aligns with minimising the squared quantisation error of the model parameters. Guided by classical quantisation theory, we therefore develop and evaluate squared-error-optimal formats for known distributions. Uniform quantisation followed by lossless compression with a variable-length code is shown to be optimal. However, we find that commonly used block formats and sparse outlier formats also outperform fixed-length codes, implying they also exploit variable-length encoding. Finally, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when tested with direct-cast quantisation of language models.

Chat is not available.