ICML Poster Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Poster

Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Pinzheng Wang · Juntao Li · Zecheng Tang · Haijia Gui · Min zhang

East Exhibition Hall A-B #E-2512

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding.However, recent studies indicate that even the best models lack true comprehension of their reasoning processes.In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models.We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame}~(\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback.Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.

Lay Summary:

Large language models (LLMs) can solve complex reasoning problems, but they often struggle to recognize when their own answers are wrong—even when they do know better. This limits their reliability in critical tasks.We propose a method called the Critic Discernment Game (CDG) to improve their self-evaluation skills. In this game, three roles interact: the Prover, who answers a question step-by-step; the Helpful Critic, who points out flaws in wrong answers without giving away the fix; and the Misleading Critic, who tries to trick the Prover into changing a correct answer by suggesting a fake error.The Prover must learn to accept helpful feedback while ignoring misleading suggestions—strengthening its ability to judge reasoning quality independently. Inspired by training strategies like those used in AlphaGo, this game-like setup encourages deeper reflection.Our results show that CDG helps models better detect their own reasoning mistakes, a step toward building AI systems that are more trustworthy, cautious, and robust in real-world decision-making.

Chat is not available.