Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test (PUT)

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim · Fahim Tajwar · Aditi Raghunathan · Aviral Kumar

[ ] [ Project Page ]
Fri 18 Jul 2:30 p.m. PDT — 3:15 p.m. PDT

Abstract: Reasoning methods that adaptively allocate test-time compute have advanced LLM performance in math and code. We study how we can utilize this framework to train models for safety. We build a recipe called $\textit{\textbf{TARS}}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using Chain-of-Thought traces and a reward signal that balances safety with task completion. When building TARS, we identify three critical design choices: (1) a premature SFT training stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as over-refusal, and (3) a reward function to prevent an absence of reasoning. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, achieve better safety-refusal trade-offs, internally learn to better distinguish between safe and unsafe prompts, attain greater robustness to attacks, and preserve general reasoning capabilities. Overall, our work provides a principled and open recipe for LLMs for safety through adaptive reasoning.

Chat is not available.