Poster
Matryoshka Quantization
Pranav Nair · Puranjay Datta · Jeff Dean · Prateek Jain · Aditya Kusupati
East Exhibition Hall A-B #E-3606
Quantizing model weights is critical for reducingthe communication and inference costs of largemodels. However, quantizing models – especiallyto low precisions like int4 or int2 – requires atrade-off in model quality; int2, in particular, isknown to severely degrade model quality. Consequently, practitioners are often forced to maintainmultiple models with different quantization levels or serve a single model that best satisfies thequality-latency trade-off. On the other hand, integer data types, such as int8, inherently possessa nested (Matryoshka) structure where smallerbit-width integers, like int4 or int2, are nestedwithin the most significant bits. Leveraging thisinsight, in this paper, we propose MatryoshkaQuantization (MatQuant), a novel multi-scalequantization technique that alleviates the aforementioned challenge. This technique allows us totrain and maintain a single quantized model butserve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’sco-training and co-distillation, int2 precision models extracted by MatQuant outperform standardint2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively.Finally, we demonstrate that by using an extra bitto represent outliers, a model with an effectiveprecision of 2.05-bit improves further by 6% withOmniQuant as the base algorithm.
Large artificial intelligence models demand substantial memory for their intricate weight parameters. Quantization, representing these weights with lower bit precision, mitigates this memory footprint but often at the cost of reduced accuracy. Consequently, practitioners frequently maintain multiple distinct model versions, each tailored to specific trade-offs between accuracy and computational speed, posing a practical challenge.Matryoshka Quantization (MatQuant) presents an innovative solution: a single, unified model trained with nested precision levels, conceptually akin to Russian Matryoshka dolls. In this framework, models with lower bit precision are intrinsically embedded within their higher-precision counterparts. These more compact versions can be readily accessed by selectively "slicing" the appropriate weight parameters from the larger model.This approach provides deployment flexibility, enabling the model to operate at various precision levels for either high-quality results or faster, compact execution. Notably, MatQuant's joint training process surprisingly improves the performance of the lower-bit precision models compared to when they are trained independently, offering a significant advantage beyond mere efficiency.