ICML Poster POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Poster

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Batuhan K. Karaman · ishmam zabir · Alon Benhaim · Vishrav Chaudhary · Mert Sabuncu · Xia Song

East Exhibition Hall A-B #E-802

[ Abstract ] [ Lay Summary ] [ Project Page ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Achieving both high safety and high usefulness simultaneously in large language models has become a critical challenge in recent years.Models often exhibit unsafe behavior or adopt an overly cautious approach leading to frequent overrefusal of benign prompts, which reduces their usefulness. A major factor underlying these behaviors is how the models are finetuned and aligned, particularly the nature and extent of the data used.In this work, we examine how overgenerating finetuning data with advanced teacher models (e.g., GPT-4o)—covering both general-purpose and toxic prompts—affects safety and usefulness in instruction-following language models.Additionally, we present POROver, an alignment strategy designed for models that are highly safe but prone to overrefusal. POROver employs preference optimization algorithms and leverages completions from an advanced teacher model to reduce overrefusals while maintaining safety.Our results show that overgenerating completions for general-purpose prompts significantly boosts safety with only a minimal impact on usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 74.4% to 91.8% because of a substantial rise in safety. Moreover, overgeneration for toxic prompts raises usefulness from 11.1% to 57.6% while preserving safety. Finally, applying POROVer increases usefulness further—from 57.6% to 82.1%—while keeping safety at comparable levels.

Lay Summary:

Large language models are powerful but often face a tough tradeoff: they either say things they should not (unsafe) or refuse to answer even harmless questions (not useful). Our paper explores how to make these models both safer and more helpful. We found that giving models more examples from smarter AI systems, such as GPT-4o, improves their behavior. We also introduce a new strategy called POROver, which aligns cautious models to make them less likely to say "no" unnecessarily, while still keeping them safe. Our work offers a practical path toward building AI systems that are both trustworthy and actually helpful in real-world use.

Chat is not available.