ICML Poster A Theory for Conditional Generative Modeling on Multiple Data Sources

Poster

A Theory for Conditional Generative Modeling on Multiple Data Sources

Rongzhen Wang · Yan Zhang · Chenyu Zheng · Chongxuan Li · Guoqiang Wu

West Exhibition Hall B2-B3 #W-902

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper provides a first attempt to fill this gap by rigorously analyzing multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number.Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training.We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory.

Lay Summary:

Modern AI models often learn from data collected across many different sources—for example, text from websites, books, and social media. While this is observed to make models more powerful, we still don't fully understand how and when using multiple sources actually helps.Our work takes a first step in answering this question for conditional generative models, which learn to generate new data based on given conditions (like generating a photo based on a label). We provide a mathematical explanation showing that when the data sources are similar enough and the model is expressive, training on multiple sources can lead to better performance than training on just one. We also apply our theory to specific models, including deep learning methods, and support it with both simulations and real-world experiments. This helps us better understand how to use data from diverse environments to build stronger AI systems.

Chat is not available.