Poster
in
Workshop: 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test (PUT)
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Setlur · Matthew Yang · Charlie Snell · Jeremiah Greer · Ian Wu · Virginia Smith · Max Simchowitz · Aviral Kumar
Test-time scaling offers a promising path to improve LLM reasoning; however, the true promise of this paradigm lies in extrapolation (i.e., to scale performance as LLMs "think" for longer). We show that one way to enable extrapolation is by training the LLM at in-context exploration; that is, training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.). To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining asymmetries in base LLM competence, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging negative gradients from incorrect traces to amplify exploration that chains additional asymmetries; and (3) aligning task difficulty with training token budget to structure in-context exploration. Our recipe e3 produces the best performing 1.7B model on AIME/HMMT'25, and can also extrapolate compute to 2.5x the model training budget.