Poster
GSM-$\infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?
Yang Zhou · Hongyi Liu · Zhuoming Chen · Yuandong Tian · Beidi Chen
East Exhibition Hall A-B #E-2901
Do we really need LLMs to have long-context ability? Retrieval Augmented Generation (RAG) is a cheap-to-build alternative that is seemingly powerful for general long-context tasks, while a long context window tends to require big companies millions of dollars to train. In the paper, we discover that on most existing long-context benchmarks, through comprehensive evaluation, RAG is on par or even surpasses the performance of long-context LLMs. However, context-level methods alone are not sufficient for agents that one day will be capable of contributing to frontier scientific discovery. We clearly need a much more difficult long-context benchmark.In the paper, we first develop a mapping between an abstract computation graph and natural language problems. Then, by randomly perturbing the computation graphs' generation, we can map to different natural language math problems. The difficulty of a problem is defined as the number of essential steps required to solve the particular problem. Besides, the computation graph can be extended to include unnecessary nodes. Benefitting from the tight semantic connections to the essential core graph, the added noise is RAG-insolvable. Empirically, we show that the added noise is hard to filter by RAG.By using the generator, we are able to generate a large quantity of problems guaranteed to be correctly labeled with controllable reasoning complexity and context length of the problems. We named the suite of problems generated as GSM-Infinite. We comprehensively evaluate LLMs on GSM-Infinite. We discovered that LLM performance decay follows a sigmoid pattern, alongside other insights revealed from our studies.