Poster
in
Workshop: Exploration in AI Today (EXAIT)
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Setlur · Matthew Yang · Charlie Snell · Jeremiah Greer · Ian Wu · Virginia Smith · Max Simchowitz · Aviral Kumar
Keywords: [ RL ] [ test-time compute ] [ reasoning ] [ LLM ] [ exploration ]
Sat 19 Jul 8:30 a.m. PDT — 5:15 p.m. PDT
Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., performance continues to improve on hard problems as LLMs keep "thinking" for longer, much beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate. We show that one way to enable extrapolation is by training the LLM at in-context exploration; that is, training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), and testing multiple hypotheses before it can commit to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining asymmetries in base LLM competence, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leverage negative gradients from incorrect traces to amplify exploration that chains additional asymmetries, resulting in longer search traces during RL; and (3) align task difficulty with training token budget to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME/HMMT'25 scores, which can also extrapolate compute to 2.5x the model training budget.