ICML Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue · Tianyu Xie · Tianyang Hu · Zijin Feng · Jiacheng Sun · Kenji Kawaguchi · Zhenguo Li · Zhi-Ming Ma

[ Abstract ] [ Project Page ]

[ Slides] [ OpenReview]

Sat 19 Jul 3:30 p.m. PDT — 3:45 p.m. PDT

Abstract: Efficiently scaling Large Language Models (LLMs) necessitates exploring alternatives to dominant autoregressive (AR) methods, with Masked Diffusion Models (MDMs) emerging as candidates. However, comparing AR (typically decoder-only) and MDM (often encoder-only) paradigms is confounded by differing architectures, obscuring true algorithmic and efficiency trade-offs. This research decouples these factors by evaluating MDMs within a decoder-only framework to: (1) Equitably compare MDM (as Any-Order AR) and standard AR paradigms through discrepancies on orders. (2) Investigate MDM architectural impacts on computational efficiency. We show decoder-only MDMs, despite a larger modeling space, can achieve significant inference speedups ($\sim25\times$) and comparable perplexity with techniques like temperature annealing, offering a path to reduced inference compute. This work provides insights for developing more computationally efficient foundation models by disentangling core modeling choices from architectural influences. Code is available at \url{https://github.com/scxue/AO-GPT-MDM}.

Chat is not available.

Oral in Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture

Shuchen Xue · Tianyu Xie · Tianyang Hu · Zijin Feng · Jiacheng Sun · Kenji Kawaguchi · Zhenguo Li · Zhi-Ming Ma

Oral
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models