Poster
Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion
Tianyuan Zou · Yang Liu · Peng Li · Yufei Xiong · Jianqing Zhang · Jingjing Liu · Xiaozhou Ye · Ye Ouyang · Ya-Qin Zhang
East Exhibition Hall A-B #E-1003
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained generative models framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models. Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://github.com/LindaLydia/WASP.
How can we create useful training data without risking anyone’s privacy? This was the question we set out to explore by studying how to generate synthetic data that mimics real private datasets containing limited samples, while revealing as little as possible about the individuals behind them.Our work introduces a method called WASP, which combines the strengths of multiple AI models to produce realistic and privacy-preserving synthetic data. Unlike existing approaches that rely on a single model or large amounts of real data, WASP uses a collaborative strategy: it asks different models to generate data, scores the results using limited private examples, and then learns to trust the best-performing models more in future rounds. It also learns not just from good examples but from bad ones as well, by contrasting them during training.We found that this strategy leads to better synthetic data even when only limited real examples are available. This is important, as many real-world applications, like healthcare and finance, must work with sensitive data of small amount that cannot be freely shared. Our results suggest a promising path forward for training AI models in data-scarce, privacy-sensitive environments.