Poster
Taming Diffusion for Dataset Distillation with High Representativeness
Lin Zhao · Yushu Wu · Xinru Jiang · Jianyang Gu · Yanzhi Wang · Xiaolin Xu · Pu Zhao · Xue Lin
East Exhibition Hall A-B #E-3312
How can training efficiency, in terms of both time and memory, be improved through data reduction? Many researchers have explored this question by generating a small subset to replace the full training dataset. The key challenge lies in ensuring that this generated subset accurately approximates the distribution of the full dataset.Our paper aims to address this challenge by leveraging the generative capabilities of large models, inspired by several prior works. We conduct a systematic analysis of existing large model based methods and suggest that the key to improving performance lies in finding a simpler distribution for approximation. Motivated by this insight, we propose an efficient method for constructing a simpler distribution that better approximates the original distribution of the full dataset.Our method achieves the best performance across datasets of four different scales. To facilitate future research, we open-source the generated small datasets and code, aiming to support the community in enhancing training efficiency or developing more effective dataset compression methods under this new paradigm.