Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
Foundational Models Must Be Designed To Yield Safer Loss Landscapes That Resist Harmful Fine-Tuning
Karan Uppal · Pavan Kalyan Tankala
Keywords: [ tamper-resistance ] [ Safety ] [ model robustness ] [ trustworthy AI ] [ loss landscape ] [ reliability ] [ pretraining ] [ alignment ]
This position paper argues that foundational models must be engineered during pretraining to develop inherent resistance to harmful fine-tuning, rather than relying on post-training interventions or inference-time guardrails. Recent works have shown that even minimal adversarial data can readily compromise safety alignment in state-of-the-art models at remarkably low cost. We propose an integrated approach combining loss landscape engineering, self-destructing model techniques, and constrained optimization to create models that naturally resist harmful adaptations while preserving beneficial fine-tuning capabilities. By proactively addressing this vulnerability through pretraining interventions rather than reactive measures, we can enhance the safety and trustworthiness of AI systems as they continue to advance in capabilities.