Skip to yearly menu bar Skip to main content


Spotlight
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Austin Silveria · Soham Govande · Daniel Y Fu


Abstract:

Diffusion Transformers (DiTs) excel in generating high-quality images and videos but suffer from redundant computations at inference, increasing costs. Observing that only a small fraction (5-25\%) of activations in attention and MLP layers account for 70-90\% of the change across inference steps, we introduce Chipmunk, a dynamic sparsity method that recomputes only these rapidly changing activations while caching the remainder. Dynamic sparsity, however, poses system-level challenges, specifically GPU tensor core underutilization and additional runtime overhead from computing sparsity patterns and managing cached activations. To maximize GPU efficiency and approximation quality, Chipmunk employs voxel-based token reordering and efficient column-sparse kernels, achieving a 9.3x kernel speedup at 93\% sparsity. Chipmunk also overlaps sparsity pattern computation and cache updates with ongoing computation to mask overhead latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.56x speedup on FLUX.1-dev with minimal quality impact.

Chat is not available.