ICML Poster rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Spotlight Poster

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan · Li Lyna Zhang · Yifei Liu · Ning Shang · Youran Sun · Yi Zhu · Fan Yang · Mao Yang

East Exhibition Hall A-B #E-2407

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Oral presentation: Oral 3A Reasoning
Wed 16 Jul 10 a.m. PDT — 11 a.m. PDT

Abstract:

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising ``deep thinking'' through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0%, surpassing o1-preview by +4.5%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% of the brightest high school math students. Code and data are available at https://github.com/microsoft/rStar.

Lay Summary:

rStar-Math shows that small AI models can match or outperform larger ones like OpenAI’s o1 in solving challenging math problems, without learning from them directly. It uses a technique called Monte Carlo Tree Search (MCTS) to explore different solution paths before choosing the best one. rStar-Math trains two small models: one for generating step-by-step solutions and another for scoring them. It introduces three key innovations: high-quality data generation through code-augmented rollouts, a more effective scoring model without step-by-step labels, and a self-improvement loop where both models evolve together. After several rounds of this self-improvement process, rStar-Math reaches state-of-the-art results. On the difficult MATH benchmark, it reaches the same level as OpenAI’s o1. On real USA Math Olympiad (AIME) exams, it solves 8 out of 15 problems, placing it in the top 20% of top-performing high school students.

Chat is not available.