Skip to yearly menu bar Skip to main content


Invited Talk: Tri Dao
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

Open-Source Attention Optimizations

Tri Dao

[ ] [ Project Page ]
Fri 18 Jul 9:15 a.m. PDT — 9:45 a.m. PDT

Abstract:

The evolution of attention optimization represents one of the most impactful open-source success stories in machine learning. This talk traces the journey from early fused multi-head attention in Apex and FlashAttention to today's sophisticated kernels powering production systems worldwide. We examine key milestones: Apex's FMHA, FlashAttention, and the subsequent ecosystem of optimizations including FlexAttention, cuDNN implementations, FlashInfer, and integration into serving frameworks like vLLM, SGLang, and TensorRT-LLM.

Equally important is the tooling infrastructure that enabled this progress. From CUTLASS providing CUDA primitives, to Triton democratizing GPU kernel development, to emerging frameworks like Mojo, ThunderKittens, Gluon, and CuTe-DSL, each tool has lowered barriers and accelerated innovation. We discuss the tradeoffs of these new frameworks and when to use each of them.

Chat is not available.