ICML Poster GSM-$\infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?

Poster

GSM-$\infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?

Yang Zhou · Hongyi Liu · Zhuoming Chen · Yuandong Tian · Beidi Chen

East Exhibition Hall A-B #E-2901

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Recently, long-context large language models (LLMs) have shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs—and the ability to introduce noise by adding unnecessary nodes and edges—we develop a grade-school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-$\infty$ benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-$\infty$ benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

Lay Summary:

Do we really need LLMs to have long-context ability? Retrieval Augmented Generation (RAG) is a cheap-to-build alternative that is seemingly powerful for general long-context tasks, while a long context window tends to require big companies millions of dollars to train. In the paper, we discover that on most existing long-context benchmarks, through comprehensive evaluation, RAG is on par or even surpasses the performance of long-context LLMs. However, context-level methods alone are not sufficient for agents that one day will be capable of contributing to frontier scientific discovery. We clearly need a much more difficult long-context benchmark.In the paper, we first develop a mapping between an abstract computation graph and natural language problems. Then, by randomly perturbing the computation graphs' generation, we can map to different natural language math problems. The difficulty of a problem is defined as the number of essential steps required to solve the particular problem. Besides, the computation graph can be extended to include unnecessary nodes. Benefitting from the tight semantic connections to the essential core graph, the added noise is RAG-insolvable. Empirically, we show that the added noise is hard to filter by RAG.By using the generator, we are able to generate a large quantity of problems guaranteed to be correctly labeled with controllable reasoning complexity and context length of the problems. We named the suite of problems generated as GSM-Infinite. We comprehensively evaluate LLMs on GSM-Infinite. We discovered that LLM performance decay follows a sigmoid pattern, alongside other insights revealed from our studies.

Chat is not available.