ICML MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Csaba Dékány · Stefan Balauca · Robin Staab · Dimitar I. Dimitrov · Martin Vechev

Keywords: [ Adversarial Examples ] [ Adversarial Training ] [ Large Language Models ] [ Adversarial Robustness ] [ LLM ] [ Jailbreak Attacks ]

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Despite recent efforts in safety and alignment, current adversarial attacks on frontier Large Language Models (LLMs) still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of other models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often too costly, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployments, exploring how quantization, adapters, and temperature affect both the adversarial training and evaluation, revealing blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Csaba Dékány · Stefan Balauca · Robin Staab · Dimitar I. Dimitrov · Martin Vechev

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models