ICML Poster VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Spotlight Poster

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng · Shuibai Zhang · Shutong Wu · Christian Classen · Daewon Chae · Ethan Ewer · Minjae Lee · Heeju Kim · Wonjun Kang · Jackson Kunde · Ying Fan · Jungtaek Kim · HYUNG IL KOO · Kannan Ramchandran · Dimitris Papailiopoulos · Kangwook Lee

East Exhibition Hall A-B #E-2812

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Oral presentation: Oral 3A Reasoning
Wed 16 Jul 10 a.m. PDT — 11 a.m. PDT

Abstract:

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Lay Summary:

Recent advances have shown that Large Language Models (like ChatGPT) solve math problems more effectively if they explain their thinking process step by step before giving a final answer. This ability can be further improved with Process Reward Models—models that check and grade each step of the reasoning process for correctness.However, previous work on Process Reward Models has mostly focused on math problems. We find that these models don’t perform well on questions from other areas, such as Law or Biology. To address this, we introduce a new Process Reward Model called VersaPRM, which is trained on a more diverse set of reasoning tasks. As a result, VersaPRM can help Large Language Models reason better across a wider range of subjects—not just math.

Chat is not available.