Skip to yearly menu bar Skip to main content


Poster

David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Weijian Luo · colin zhang · Debing Zhang · Zhengyang Geng

East Exhibition Hall A-B #E-1808
[ ] [ ] [ Project Page ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

We propose Diff-Instruct(DI), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community.

Lay Summary:

We introduced a theory-driven post-training preference alignment method for one-step text-to-image models named Diff-Instruct* (DI*). The *DI* * trained a 2.6B one-step text-to-image generative model at 1024x1024 resolution, which outperforms the 12B 50-step FLUX-dev model with significant margins. Our contribution can improve the efficiency as well as human preferences of text-to-image and video models.

Chat is not available.