Poster
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models
Are Samples Extracted From Large Language Models Memorized?
Chawin Sitawarin · Karan Chadha · John Morris · Saeed Mahloujifar · Chuan Guo
Training large language models (LLMs) on diverse datasets, including news, books, and user data, enhances their capabilities but also raises significant privacy and copyright concerns due to their capacity to memorize training data. Current memorization measurements, primarily based on extraction attacks like Discoverable Memorization, focus on an LLM’s ability to reproduce training data verbatim when prompted. While various extensions to these methods exist, allowing for different prompt forms and approximate matching, they introduce numerous parameters whose arbitrary selection significantly impacts reported memorization rates. This paper addresses the critical research question of how to compute the false positive rate (FPR) of these diverse memorization measurements. We propose a practical definition of FPR and ways to interpret them, offering a more principled approach to select an extraction attack and its parameters. Our findings reveal that while "stronger" extraction attacks often identify more memorized samples, they also tend to have higher FPRs. Notably, some computationally intensive methods exhibit lower extraction rates than simpler baselines when controlling for a fixed FPR.