Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

MatMuls are Enough for Efficient and Performant Linear-Time Attention

Andrew Argatkiny · Ilya Makarov

[ ] [ Project Page ]
Fri 18 Jul 1 p.m. PDT — 1:45 p.m. PDT

Abstract:

Transformers, despite empowering current AI revolution, are bottlenecked by suboptimal hardware utilization and quadratic runtime complexity of softmax attention w.r.t. input sequence length. Many recent architectures aspire to bring the complexity down to sub-quadratic level without compromising modeling quality. However, they are either much slower on all but very long sequences or rely on low-level code tailored to a narrow subset of modern hardware. To simultaneously achieve linear complexity, hardware efficiency, and portability, we completely eliminate softmax from self-attention; remove, modify, or rearrange other transformations in the Transformer block; and reduce number of attention heads. The resulting architecture, DenseAttention Network, is composed entirely of dense matrix multiplications in the attention which allows for efficient training and inference in both quadratic and linear modes. It performs similarly with standard Transformer in language modeling and surpasses previous Transformer-based SOTA by 5% on challenging Long Range Arena benchmarks. DenseAttention model written in plain PyTorch is up to 22% faster even on small context sizes, and by orders of magnitude on longer sequences, than Transformer augmented with low-level FlashAttention kernel.

Chat is not available.