Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms might be Unsafe

Fanjunduo Wei · Zhenheng Tang · Rongfei Zeng · Tongliang Liu · Chengqi Zhang · Xiaowen Chu · Bo Han

Keywords: [ Large Language Models ] [ Jailbreak ] [ LoRA ]

[ ] [ Project Page ]
Sat 19 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract:

Low-Rank Adaptation (LoRA) benefits from its plug-and-play nature, enabling large language models (LLMs) to achieve significant performance gains at low cost, has driven the development of LoRA-sharing platforms. However, the jailbreak and backdoor concerns associated with LoRA-sharing platforms remain underexplored. Existing LoRA-based attacks primarily focus on achieving high attack success rates, while neglecting the core reason why LoRA is adopted by user, i.e. to gain downstream task capabilities. However, achieving effective attacks while preserving strong multi-task performance remains challenging, as the largely unrelated objectives tend to interfere with each other during optimization. In this paper, we propose JailbreakLoRA, a multi-task jailbreak LoRA training method that balances task utility and attack capability, it resolves training interference by uncertainty-weighting losses and mitigating gradient conflicts. Additionally, JailbreakLoRA is designed to generate an affirmative prefix upon trigger activation, exploiting inference-time hallucinations to enhance the effectiveness of jailbreak. Experimental results demonstrate that our method outperforms SOTA LoRA-based attacks, achieving a 10\% improvement in attack success rate while also enhancing performance on multi-downstream tasks by 20\%.

Chat is not available.