Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Technical AI Governance

Exploring an Agenda on Memorization-based Copyright Verification

Harry Jiang · Aster Plotnik · Carlee Joe-Wong


Abstract:

The methods and systems through which developers of large language models (LLMs) acquire training data is nebulous and contentious.As a result, many data owners to have concerns about whether their data is being used inappropriately.Thus, finding a way for data owners to independently verify whether their data has been used to train LLMs is an increasingly popular area of research.In such situations, a recourse for data owners is civil litigation.In many legal systems, such as that of the United States, a major hurdle for a civil suit is finding the evidence necessary to bring a case from pleadings to discovery; memorization of a text by an LLM would be helpful in this case as evidence a text was present in training data.Currently, there is a disconnect between the legal system, which demands evidence of verbatim memorization, and the realities of the technology that can rarely produce this. From analysis of both existing legal cases in various jurisdictions, and a review of memorization techniques and benchmarks, we propose in this paper a set of objectives for researchers to better align work on memorization and other defences for data owners with the practicalities of argumentation in the legal realm, namely: specificity, substantiality, and accessibility.

Chat is not available.