Poster
Elucidating the design space of language models for image generation
Xuantong Liu · Shaozhe Hao · Xianbiao Qi · Tianyang Hu · JUN WANG · Rong Xiao · Yuan YAO
West Exhibition Hall B2-B3 #W-215
Large language models (LLMs) have achieved remarkable success in text generation, motivating researchers to explore their potential for image generation. However, most existing approaches either rely on custom model designs with vision-specific biases or apply LLMs directly without fully exploring their potential in vision tasks.In this work, we systematically examine how to best repurpose LLMs for image generation by investigating fundamental design choices, including tokenization, modeling strategies, scan patterns, vocabulary construction, and sampling techniques. Through comprehensive analysis and experiments, we show that LLMs — without any domain-specific architectural changes — can achieve state-of-the-art image generation quality when these components are carefully selected.We also study how model size affects learning in this setting, revealing that larger LLMs capture more useful visual patterns and require less randomness during sampling. Additionally, we compare the intrinsic differences between language and images, providing practical insights for adapting autoregressive language models to other non-text domains.Our work demonstrates that general-purpose LLMs, with thoughtful design, can serve as powerful image generators, bridging modality boundaries and informing future multi-domain generative model research.