Skip to yearly menu bar Skip to main content


Invited Talk
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models

Invited Talk 2: A. Feder Cooper - What Copyright Can Learn From Memorization Measurements of Language Models

[ ]
Sat 19 Jul 10:30 a.m. PDT — 11 a.m. PDT

Abstract:

Machine learning researchers often ground their work on memorization in ongoing debates about copyright infringement. But standard methods for measuring memorization—typically based on average, greedy-sampled extraction rates over a given dataset—don’t map well onto the kinds of questions that arise in copyright litigation, which tend to focus on nuanced issues pertaining to specific expressive works. This talk discusses an alternative approach designed to bridge that gap. Rather than asking how much open-weight LLMs memorize on average, we ask: how much do they memorize from specific books?

To investigate this, we use a recent probabilistic extraction technique, which is more sensitive than standard methods for detecting the extent to which specific content has been memorized. Our results are complicated, and suggest that sweeping, opposing claims dramatically oversimplify the relationship between memorization and copyright. In our specific experiments, even the largest LLMs don't memorize most books—either in full or in part. But there are striking exceptions to this high-level observation: Llama 3.1 70B appears to memorize certain books, such as Harry Potter and 1984, almost entirely. These results have significant implications for copyright cases, though not ones that clearly favor either plaintiffs or defendants.

Chat is not available.