ICML Hardware-Efficient Attention for Fast Decoding

Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Hardware-Efficient Attention for Fast Decoding

Ted Zadouri · Hubert Strauss · Tri Dao

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 3:45 p.m. PDT — 4 p.m. PDT

Abstract: LLM decoding is bottlenecked for large batches and long contexts by loading the KV cache from high-bandwidth memory, which raises per-token latency, while its sequential nature limits parallelism. We redesign attention to perform more computation per byte of memory transfer, maximizing hardware efficiency without sacrificing parallel scalability. We first present \textit{Grouped-Tied Attention} (GTA), which merges and reuses key and value states to reduce memory traffic without affecting quality. Next, we introduce \textit{Grouped Latent Attention} (GLA), a parallel friendly latent attention enhanced with low-level optimizations for fast decoding at high quality. Experiments show that GTA matches Grouped Query Attention (GQA) quality while using roughly half the KV cache, and GLA matches Multi-head Latent Attention (MLA) yet shards more easily. Our optimized GLA kernel is up to $2\times$ faster than FlashMLA in speculative decoding once the query length exceeds one.

Chat is not available.

Oral in Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Hardware-Efficient Attention for Fast Decoding

Ted Zadouri · Hubert Strauss · Tri Dao

Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models