ICML When Compression Breaks Safety, Interpretability Reveals The Fix

Poster
in
Workshop: Actionable Interpretability

When Compression Breaks Safety, Interpretability Reveals The Fix

Vishnu Chhabra · Mahdi Khalili

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 10:40 a.m. PDT — 11:40 a.m. PDT

Abstract:

The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures.In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.

Chat is not available.

Poster in Workshop: Actionable Interpretability

When Compression Breaks Safety, Interpretability Reveals The Fix

Vishnu Chhabra · Mahdi Khalili

Poster
in
Workshop: Actionable Interpretability