Poster
Improving the Diffusability of Autoencoders
Ivan Skorokhodov · Sharath Girish · Benran Hu · Willi Menapace · Yanyu Li · Rameen Abdal · Sergey Tulyakov · Aliaksandr Siarohin
East Exhibition Hall A-B #E-3110
In recent years, image πΌοΈ and video πΉ generation models have rapidly advanced π, with both industry π’ and academia π investing heavily πΈ. Most of these models follow the latent diffusion π¬οΈ approach: an autoencoder π€ first compresses images or videos into a smaller latent space π, and then a diffusion model is trained ποΈ to generate samples in that space π§ͺ.So far, most work has focused on improving π§ the autoencoderβs reconstruction quality π and compression rate π¦. But our work shows π‘ that the choice of autoencoder has a deeper effectβit shapes π§© how well a diffusion model can generate realistic outputs π¨. We call this diffusability β¨: how easy π it is for a diffusion model to learn π to generate in a given representation space π.Diffusion models build π§± images by gradually refining noise π«οΈ, starting from a blurry outline and adding details βοΈ step by step π. This process tends to struggle π with high-frequency details πΆ (like textures π§΅ or fine edges βοΈ), where errors β can accumulate. Normally, the human eye ποΈ is less sensitive π§βοΈ to these errors in pixel space π§·. But we found π§ that some autoencoders place more emphasis π£ on high frequencies in their latent spaceβmore than RGB images do π. As a result β οΈ, critical image structures ποΈ get encoded in unstable π₯ high-frequency components, making them harder π΅ for the diffusion model to learn and sample correctly π―.To address this π οΈ, we introduce a simple training technique π: during autoencoder training, we downsample β¬οΈ the latent representation and require the decoder to still produce a meaningful reconstruction π οΈβ‘οΈπΌοΈ. This encourages π the autoencoder to store important information βΉοΈ in more robust πͺ, low-frequency components π§.We show π§ͺ that this small change π§ leads to large improvements π. It makes latent spaces more suitable β for diffusion models, improving both image πΌοΈ and video πΉ generation quality π― on benchmarks like ImageNet π§ and Kinetics πβοΈ.