Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: Methods and Opportunities at Small Scale (MOSS)

Designing Efficient Attention: Insights from an Inference Perspective

Tri Dao

[ ]
Sat 19 Jul 9:55 a.m. PDT — 10:40 a.m. PDT

Abstract:

Inference now drives the progress of AI thanks to test-time compute, necessitating efficient redesign of core architectural components such as the attention layer. We examine recent progress in addressing these efficiency challenges through two complementary directions. Starting from first principles of hardware efficiency with arithmetic intensity, we motivate the design of DeepSeek's multi-head latent attention (MLA) and recent variants such as Grouped-Tied Attention and Grouped Latent Attention. These variants reduce memory bandwidth requirements by performing more computation per byte loaded from memory, achieving up to 2× speedup in decoding scenarios. Second, we explore a complementary direction that reduces the FLOPS of attention from quadratic to linear or quasi-linear, bridging the gap between linear attention's efficiency and softmax attention's expressiveness. These new linear and quasi-linear attention methods enable sub-quadratic scaling while maintaining modeling capacity. Through systematic investigation at small scales, we demonstrate how inference-driven design principles can unlock new insights into attention mechanisms and provide practical pathways toward more efficient large-scale deployment.

Chat is not available.