ICML When Meaning Doesn’t Matter: Exposing Guard Model Fragility via Paraphrasing

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

When Meaning Doesn’t Matter: Exposing Guard Model Fragility via Paraphrasing

Cristina Pinneri · Christos Louizos

Keywords: [ robustness to paraphrases ] [ llm safety ] [ ai alignment ] [ semantic robustness ] [ guard models ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Guard models are increasingly used to evaluate the safety of large language model (LLM) outputs. These models are intended to assess the semantic content of responses, ensuring that outputs are judged based on meaning rather than superficial linguistic features. In this work, we reveal a critical failure mode: guard models often assign significantly different scores to semantically equivalent responses that differ only in phrasing. To systematically expose this fragility, we introduce a paraphrasing-based evaluation framework that generates meaning-preserving variants of LLM outputs and measures the variability in guard model scores. Our experiments show that even minor stylistic changes can lead to large fluctuations in scoring, indicating a reliance on spurious features rather than true semantic understanding. This behavior undermines the reliability of guard models in real-world applications. Our framework provides a model-agnostic diagnostic tool for assessing semantic robustness, offering a new lens through which to evaluate and improve the trustworthiness of LLM safety mechanisms.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

When Meaning Doesn’t Matter: Exposing Guard Model Fragility via Paraphrasing

Cristina Pinneri · Christos Louizos

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models