Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The Impact of Memorization on Trustworthy Foundation Models

Knowledge‑Distilled Memory Editing for Plug‑and‑Play LLM Alignment

Haozheng Luo · Jiahao Yu · Wenxin Zhang · Jialong Li · Jerry Yao-Chieh Hu · Yan Chen · Binghui Wang · Xinyu Xing · Han Liu

[ ] [ Project Page ]
Sat 19 Jul 8:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.

Chat is not available.