Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
Exploring Diffusion Transformer Designs via Grafting
Keshigeyan Chandrasegaran · Michael Poli · Daniel Y Fu · Dongjun Kim · Lea Hadzic · Manling Li · Agrim Gupta · Stefano Massaroli · Azalia Mirhoseini · Juan Carlos Niebles · Stefano Ermon · Li Fei-Fei
Abstract:
Model architecture design requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural exploration.Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models?We present *grafting*, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. We study the impact of grafting on model quality using the DiT-XL/2 design. We develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local, and linear attention; and MLPs with variable-width and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38–2.64 vs. 2.27 for DiT-XL/2)using $<2$\% pretraining compute. Next, we graft a text-to-image model (PixArt-$\Sigma$), achieving a 43\% speedup with $<2$\% drop in GenEval score. Finally, we present a case study where we restructure DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting, reducing model depth by 2x, achieving better quality (FID: 2.77) than models of comparable depth.Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.Code and grafted models: https://grafting.stanford.edu
Chat is not available.