ICML Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models

Laksh Patel · Neel Shanbhag

Keywords: [ Generative Models ] [ Forget Events ] [ Data Cartography ] [ Memorization Detection ] [ Influence Functions ] [ Data-Centric Interventions ] [ Uniform Stability ] [ Privacy Preservation ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract:

Modern generative models risk overfitting andunintentionally memorizing rare training exam-ples, which can be extracted by adversaries orinflate benchmark performance. We propose Gen-erative Data Cartography (GenDataCarto), adata-centric framework that assigns each pretrain-ing sample a difficulty score (early-epoch loss)and a memorization score (frequency of “forgetevents”), then partitions examples into four quad-rants to guide targeted pruning and up-/down-weighting. We prove that our memorization scorelower-bounds classical influence under smooth-ness assumptions and that down-weighting high-memorization hotspots provably decreases thegeneralization gap via uniform stability bounds.Empirically, GenDataCarto reduces synthetic ca-nary extraction success by over 40% at just 10%data pruning, while increasing validation perplex-ity by less than 0.5%. These results demonstratethat principled data interventions can dramaticallymitigate leakage with minimal cost to generativeperformance.

Chat is not available.

Poster in Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models

Laksh Patel · Neel Shanbhag

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)