ICML LLMs Can Covertly Sandbag On Capability Evaluations Against Chain-of-Thought Monitoring

Spotlight Talk
in
Workshop: Workshop on Technical AI Governance

LLMs Can Covertly Sandbag On Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li · Noah Siegel · Mary Phuong

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 10:50 a.m. PDT — 11 a.m. PDT

Abstract: Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is ${\textit{sandbagging}}$ — the strategic underperformance on evaluations by AI models or their developers for their benefit. One promising avenue to defend against this is by monitoring a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while either being CoT monitoring-oblivious or monitoring-aware. We show that both frontier models and smaller open-sourced models can $\textit{covertly}$ sandbag and bypass CoT monitoring, with a trade-off between sandbagging efficacy and covertness. We qualitatively analyzed the uncaught CoT to understand $\textit{why}$ the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbag attack policies discovered in the wild. These results help inform potential failure modes of CoT monitoring and build sandbagging model organisms to strengthen monitoring.

Chat is not available.

Spotlight Talk in Workshop: Workshop on Technical AI Governance

LLMs Can Covertly Sandbag On Capability Evaluations Against Chain-of-Thought Monitoring

Chloe Li · Noah Siegel · Mary Phuong

Spotlight Talk
in
Workshop: Workshop on Technical AI Governance