Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test (PUT)

Learning to Self-Correct through Chain-of-Thought Verification

Bradley Guo · Jingwen Gu · Jin Zhou · Wen Sun

[ ] [ Project Page ]
Fri 18 Jul 2:30 p.m. PDT — 3:15 p.m. PDT

Abstract:

Self-correction in Large Language Models (LLMs) has emerged as a promising approach towards enhancing the reasoning capabilities of LLMs at inference-time. In this paper, we study how self-correction can enable an LLM to effectively perform search over multiple sequential turns on 3-SAT problems. We train a self-correcting model with reinforcement learning that verifies an initial solution through chain-of-thought reasoning and uses its own evaluation to provide a new solution. Despite being trained to self-correct once, the model can revise its answers in a sequential loop at inference-time, allowing for multi-turn gains. Our experiments demonstrate that generating strong chain-of-thought evaluations of potential solutions is essential, allowing sequential scaling through refining an initial solution over k turns to surpass even the strongest oracle-guided parallel scaling methods (i.e. pass@k).

Chat is not available.