Invited Talk
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models
Invited Talk 3: Vitaly Feldman - Trade-offs in Data Memorization via Strong Data Processing Inequalities
Abstract:
Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we demonstrate that several simple and well-studied binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm and the amount of information about the training data that a learning algorithm needs to memorize to be accurate.
In particular, order-$d$ bits of information about the training data need to be memorized when a single $d$-dimensional example is available, which then decays as $\Theta(d/n)$ as the number of examples grows. Further, this rate is achieved (up to logarithmic factors) by simple learning algorithms. Our results build on the work of Brown et al. (2021), and establish a new framework for proving memorization lower bounds that is based on an approximate version of strong data processing inequalities.
Joint work with Guy Kornowski (Tel Aviv University) and Xin Lyu (UC Berkeley)
Chat is not available.