Invited Talk: Tri Dao
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning
Open-Source Attention Optimizations
Tri Dao
The evolution of attention optimization represents one of the most impactful open-source success stories in machine learning. This talk traces the journey from early fused multi-head attention in Apex and FlashAttention to today's sophisticated kernels powering production systems worldwide. We examine key milestones: Apex's FMHA, FlashAttention, and the subsequent ecosystem of optimizations including FlexAttention, cuDNN implementations, FlashInfer, and integration into serving frameworks like vLLM, SGLang, and TensorRT-LLM.
Equally important is the tooling infrastructure that enabled this progress. From CUTLASS providing CUDA primitives, to Triton democratizing GPU kernel development, to emerging frameworks like Mojo, ThunderKittens, Gluon, and CuTe-DSL, each tool has lowered barriers and accelerated innovation. We discuss the tradeoffs of these new frameworks and when to use each of them.