Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains

Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs

Dongyang Fan · Vinko SabolĨec · Matin Ansaripour · Ayush Tarun · Martin Jaggi · Antoine Bosselut · Imanol Schlag

Keywords: [ Data Compliance ] [ AI ethics ] [ Data valuation ]


Abstract: The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. We also show that compliant pre-training can reduce verbatim memorization of copyrighted articles. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions.

Chat is not available.