ICML Poster An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Poster

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar · Mahdi Erfanian · Abolfazl Asudeh

East Exhibition Hall A-B #E-3305

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure.To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices.Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms.Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication.Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

Lay Summary:

Deep Neural Networks and Large Language Models (LLMs), like ChatGPT, are very powerful but require expensive hardware and a lot of energy to run. This makes it difficult to use them on everyday devices like smartphones or in situations where computing resources are limited. Our research addresses this challenge by making these models much faster and more memory-efficient during the inference stage, when the model is used to generate answers.We focus on quantized models, where the weights (the core numerical values that define the model) are simplified to a small set of possible numerical values. This kind of quantization already reduces cost, but we take it further by designing algorithms that exploit the fact that these weights don’t change after training these models.Our approach compresses and preprocesses these fixed weights to build special "indices" that significantly speed up the model's computations. As a result, we reduce the time it takes to perform the most common operation, matrix multiplication. These improvements make it more practical to efficiently run advanced AI models on less powerful, more affordable hardware.

Chat is not available.