Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)
Data Cartography for Detecting Memorization Hotspots and Guiding Data Interventions in Generative Models
Laksh Patel · Neel Shanbhag
Keywords: [ Generative Models ] [ Forget Events ] [ Data Cartography ] [ Memorization Detection ] [ Influence Functions ] [ Data-Centric Interventions ] [ Uniform Stability ] [ Privacy Preservation ]
Modern generative models risk overfitting andunintentionally memorizing rare training exam-ples, which can be extracted by adversaries orinflate benchmark performance. We propose Gen-erative Data Cartography (GenDataCarto), adata-centric framework that assigns each pretrain-ing sample a difficulty score (early-epoch loss)and a memorization score (frequency of “forgetevents”), then partitions examples into four quad-rants to guide targeted pruning and up-/down-weighting. We prove that our memorization scorelower-bounds classical influence under smooth-ness assumptions and that down-weighting high-memorization hotspots provably decreases thegeneralization gap via uniform stability bounds.Empirically, GenDataCarto reduces synthetic ca-nary extraction success by over 40% at just 10%data pruning, while increasing validation perplex-ity by less than 0.5%. These results demonstratethat principled data interventions can dramaticallymitigate leakage with minimal cost to generativeperformance.