Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

PerplexMATH: Steering LLMs Toward Mathematical Reasoning


Abstract: Large language models (LLMs) have demonstrated unprecedented progress in their reasoning abilities, significantly enhancing their capacity to tackle both mathematical and logical problems. However, when increasing the difficulty of math problems, these models under-perform in many areas. Furthermore, their reasoning becomes "brittle" as performance drops when the math problems are slightly changed in structure or additional information is introduced. This questions the validity of the reasoning an LLM performs. This study investigates whether large language models (LLMs) $\textbf{balance memorization and reasoning}$ when solving hard math problems with some perturbations. Through systematic evaluation of different LLMs, we see that some models $\textit{selectively}$ use recall strategy, while others exhibit better reasoning performance for certain problem types. Results reveal a computation-reasoning trade-off: models solve computationally intensive problems faster but with $\textbf{$50\%$}$ % less engagement in some models, suggesting heuristic shortcuts under load. Confidence scores remain high but unstable in challenging tasks, raising reliability concerns. DeepSeek exhibits the sharpest trade-off, Gemini 1.5 handles paradoxes best, while Gemini 2.0 provides more stable confidence estimates. We introduce $\textbf{PerplexMATH}$, a hard math problem dataset with different problem types along with a reasoning framework for evaluating LLMs with more robust, human-like reasoning capabilities specifically for math problems.Code available at https://github.com/jerryfrancis-97/llmination-reasoning}.

Chat is not available.