Poster
Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs
Aryan Gulati · Brando Miranda · Eric Chen · Emily Xia · Kai Fronsdal · Bruno de Moraes Dumont · Sanmi Koyejo
East Exhibition Hall A-B #E-2502
Modern AI chatbots can already pass many school-level math tests, so researchers need harder ways to see whether these systems are truly “thinking” or just repeating what they have seen online. We built a new test set called Putnam-AXIOM by collecting 522 tough questions from the famous Putnam university math contest. To make sure the bots have not memorized the answers from the internet, we also wrote a computer program that automatically rewrites each problem—changing numbers, variable names, and wording—while keeping the difficulty the same. This lets us generate endless fresh versions that the bots have never encountered.When we ran twenty leading AI models on our test, every one of them scored lower on the rewritten versions than on the originals; the best model’s score fell from 42 % to 22 %. This gap suggests that today’s systems still rely heavily on memorization. Finally, we introduce a simple scoring method that checks how well a model follows each step of a correct solution, not just the final answer. All data, code, and evaluation tools are publicly released so others can track future progress in genuine mathematical reasoning.