Skip to yearly menu bar Skip to main content


Oral
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Unlocking Post-hoc Dataset Inference with Synthetic Data

Bihe Zhao · Pratyush Maini · Franziska Boenisch · Adam Dziedzic

Keywords: [ large language models ] [ synthetic data ] [ dataset inference ]

[ ] [ Project Page ]
Sat 19 Jul 2:15 p.m. PDT — 2:30 p.m. PDT

Abstract:

The remarkable capabilities of large language models stem from massive internet-scraped training datasets, often obtained without respecting data owners' intellectual property rights. Dataset Inference (DI) enables data owners to verify unauthorized data use by identifying whether a suspect dataset was used for training. However, current DI methods require private held-out data with a distribution that closely matches the compromised dataset. Such held-out data are rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set through two key contributions: (1) creating high-quality, diverse synthetic data via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations.

Chat is not available.