ICML A Data-Centric Safety Framework for Generative Models: Adversarial Fingerprint Detection and Attribution

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

A Data-Centric Safety Framework for Generative Models: Adversarial Fingerprint Detection and Attribution

Dong Liu · Yanxuan Yu

Keywords: [ Contrastive Learning ] [ Post-hoc Data Removal ] [ Memorization Attribution ] [ Contamination Detection ] [ Model Forensics ] [ Token-level Fingerprints ] [ Data-centric Safety ] [ Trustworthy AI ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract: Generative models have revolutionized applications from text synthesis to image creation, yet their safety and trustworthiness are undermined by unintended memorization and data contamination. Existing detection methods—relying on output similarity or final-layer embeddings—either lack instance-level precision or fail to provide actionable attributions. To address these limitations, we propose \textbf{FPGuard}, a Data-Centric Safety Framework that performs Adversarial Fingerprint Detection and Attribution. FPGuard extracts token-level fingerprints from intermediate hidden states, constructs a scalable fingerprint bank from training data, and employs contrastive learning to enhance discriminability. At test time, FPGuard computes a contamination score by aggregating top-$k$ cosine similarities between test and banked fingerprints, and generates fine-grained attribution maps that identify the exact training instances responsible. Moreover, FPGuard enables post-hoc detoxification through targeted data removal, significantly reducing contamination effects. Experiments on LLaMA-2-7B and GPT-J under synthetic (SQuAD$\rightarrow$Pile) and natural (RedPajama$\rightarrow$TriviaQA) contamination settings show that FPGuard improves detection \textbf{Precision@10} by up to 25\%, enhances attribution precision by over 30--45\%, and lowers contamination scores by up to 43\% compared to prior baselines—all without retraining.

Chat is not available.

Poster in Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

A Data-Centric Safety Framework for Generative Models: Adversarial Fingerprint Detection and Attribution

Dong Liu · Yanxuan Yu

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)