Poster
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models
ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data
Tong Chen · Faeze Brahman · Jiacheng Liu · Niloofar Mireshghallah · Weijia Shi · Pang Wei Koh · Luke Zettlemoyer · Hannaneh Hajishirzi
Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity.We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets, achieving a 25.4\% reduction on unintentional regurgitation in creative writing, whereas unlearning methods are less effective out of their unlearned domain (with only a 2.3\% reduction). On the instruction-tuned Tulu3-8B model, ParaPO combined with system prompting successfully preserves desirable quotation recall while reducing unintentional regurgitation by 27.5\% in creative writing when instructed not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction.