Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd AI for Math Workshop @ ICML 2025

Not All Votes Count! Translated Program for Verification Improves Self-Consistency of Language Models for Math Reasoning

Vernon Yan Han Toh · Deepanway Ghosal · Soujanya Poria


Abstract:

Large language models (LLMs) have become increasingly capable of solving mathematical reasoning problems. However, many open-source LLMs still encounter issues with calculation errors and semantic misunderstandings during intermediate reasoning steps. In this work, we present Prove, a simple yet effective framework that leverages translated Python programs derived from natural language solutions as a verification mechanism. This verification mechanism helps identify and filter out potentially incorrect paths before final answers are aggregated. Unlike basic majority voting, our approach rejects solutions whose program outputs do not align with the generated solution, only aggregating those that pass the verification step. We conducted extensive experiments with 13 open-source LLMs of various model sizes, ranging from 0.5B to 13B parameters, across eight mathematical benchmarks. Our findings demonstrate that Prove consistently outperforms basic majority voting as a heuristic and other program-assisted reasoning baselines for solving mathematical reasoning tasks, achieving improvements of up to 18\% on GSM8K and 8\% on MATH-500.

Chat is not available.