ICML Poster Textual Unlearning Gives a False Sense of Unlearning

Poster

Textual Unlearning Gives a False Sense of Unlearning

Jiacheng Du · Zhibo Wang · Jie Zhang · Xiaoyi Pang · Jiahui Hu · Kui Ren

East Exhibition Hall A-B #E-1005

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.

Lay Summary:

Machine unlearning is a method designed to make artificial intelligence (AI) models "forget" specific pieces of information. This is especially important for protecting sensitive data and complying with privacy laws. However, our research shows that unlearning doesn't work as well as it seems in language models (the AI behind tools like chatbots). We found that current unlearning techniques often fail to fully erase the targeted information. Using a rigorous auditing approach, we were still able to detect traces of the supposedly forgotten data. Even more concerning, we discovered that trying to unlearn data can backfire: it can actually make it easier for attackers to figure out what you want to forget by comparing models before and after unlearning. Our work highlights serious flaws in current machine unlearning practices and emphasizes the need for safer, more reliable methods to truly protect user privacy in AI systems.

Chat is not available.