Spotlight Poster
Model Immunization from a Condition Number Perspective
Amber Yijia Zheng · Cedar Site Bai · Brian Bullins · Raymond A. Yeh
East Exhibition Hall A-B #E-1604
Thu 17 Jul 10 a.m. PDT — 11 a.m. PDT
Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available at https://github.com/amberyzheng/model-immunization-cond-num.
When powerful AI models are open-sourced, there is a risk that it could be fine-tuned to produce harmful content. This paper addresses how to train these models in a way that makes misuse more difficult while still ensuring they remain useful for safe purposes. We explore this issue by examining how easily a model can be optimized after its initial training. By optimizing the condition number during training, we improve the model's resistance of being fine-tuned with harmful data. Empirically, our method works both in theory and practice, yielding promising results. We hope this approach contributes to making AI models safer for public release.