Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

SPECS: Faster Test-Time Scaling through Speculative Drafts

Mert Cemri · Nived Rajaraman · Rishabh Tiwari · Xiaoxuan Liu · Kurt Keutzer · Ion Stoica · Kannan Ramchandran · Ahmad Beirami · Ziteng Sun


Abstract:

Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration. However, increased compute often comes at the expense of higher user facing latency, directly impacting the user experience. Current test-time scaling methods primarily optimize accuracy based on total compute (FLOPS), often overlooking latency constraints. To address this gap, we propose SPECS, a latency-aware test-time scaling method inspired by speculative decoding. SPECS uses a smaller, faster model to generate candidate sequences efficiently, and evaluates these candidates using signals from both a larger target model and a dedicated reward model. We introduce new integration strategies, including reward-guided soft verification and a reward-based deferral mechanism. Empirical results on MATH-500 and AMC23 datasets show that SPECS matches or surpasses beam search accuracy while reducing latency by up to 15.3%. Our theoretical analysis shows that our algorithm converges to the solution of a KL-regularized reinforcement learning objective with increasing beam width.

Chat is not available.