ICML Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning

Karan Uppal · Pavan Kalyan Tankala

Keywords: [ tamper-resistance ] [ Safety ] [ model robustness ] [ trustworthy AI ] [ loss landscape ] [ reliability ] [ pretraining ] [ alignment ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

This position paper argues that foundational models must be engineered during pretraining to develop inherent resistance to harmful fine-tuning, rather than relying on post-training interventions or inference-time guardrails. Recent works have shown that even minimal adversarial data can readily compromise safety alignment in state-of-the-art models at remarkably low cost. We propose an integrated approach combining loss landscape engineering, self-destructing model techniques, and constrained optimization to create models that naturally resist harmful adaptations while preserving beneficial fine-tuning capabilities. By proactively addressing this vulnerability through pretraining interventions rather than reactive measures, we can enhance the safety and trustworthiness of AI systems as they continue to advance in capabilities.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning

Karan Uppal · Pavan Kalyan Tankala

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models