Poster
An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks
Mohsen Dehghankar · Mahdi Erfanian · Abolfazl Asudeh
East Exhibition Hall A-B #E-3305
Deep Neural Networks and Large Language Models (LLMs), like ChatGPT, are very powerful but require expensive hardware and a lot of energy to run. This makes it difficult to use them on everyday devices like smartphones or in situations where computing resources are limited. Our research addresses this challenge by making these models much faster and more memory-efficient during the inference stage, when the model is used to generate answers.We focus on quantized models, where the weights (the core numerical values that define the model) are simplified to a small set of possible numerical values. This kind of quantization already reduces cost, but we take it further by designing algorithms that exploit the fact that these weights don’t change after training these models.Our approach compresses and preprocesses these fixed weights to build special "indices" that significantly speed up the model's computations. As a result, we reduce the time it takes to perform the most common operation, matrix multiplication. These improvements make it more practical to efficiently run advanced AI models on less powerful, more affordable hardware.