Spotlight Talk
in
Workshop: Workshop on Technical AI Governance
LLMs Can Covertly Sandbag On Capability Evaluations Against Chain-of-Thought Monitoring
Chloe Li · Noah Siegel · Mary Phuong
Abstract:
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is ${\textit{sandbagging}}$ — the strategic underperformance on evaluations by AI models or their developers for their benefit. One promising avenue to defend against this is by monitoring a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while either being CoT monitoring-oblivious or monitoring-aware. We show that both frontier models and smaller open-sourced models can $\textit{covertly}$ sandbag and bypass CoT monitoring, with a trade-off between sandbagging efficacy and covertness. We qualitatively analyzed the uncaught CoT to understand $\textit{why}$ the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbag attack policies discovered in the wild. These results help inform potential failure modes of CoT monitoring and build sandbagging model organisms to strengthen monitoring.
Chat is not available.