Poster
HashAttention: Semantic Sparsity for Faster Inference
Aditya Desai · Shuo Yang · Alejandro Cuadron · Matei Zaharia · Joseph E Gonzalez · Ion Stoica
East Exhibition Hall A-B #E-3412
Modern AI systems like chatbots, image generators and code assistants, etc. rely on a mechanism called “attention” to decide which parts of the input are most important — but this process becomes slow and memory-intensive as inputs get longer. We noticed that not every word or token contributes equally; only a few really matter.Our method, HashAttention, finds and focuses only on these important tokens. We discovered this could be done by treating the problem like a recommendation system — similar to how Netflix suggests shows based on your preferences. Using clever mathematical tricks and learned functions, we represent the tokens in a compact format that allows fast comparisons using simple bitwise operations.HashAttention speeds up attention without hurting accuracy. It can reduce the number of tokens processed by up to 32×, while keeping the output quality nearly the same. This leads to faster, more efficient AI models — helping them handle longer inputs, think more and produce more text with less computing power and lower costs.