Poster
polybasic Speculative Decoding Through a Theoretical Perspective
Ruilin Wang · Huixia Li · Yuexiao Ma · Xiawu Zheng · Fei Chao · Xuefeng Xiao · Rongrong Ji
East Exhibition Hall A-B #E-2502
Large language models (LLMs) face high inference latency, hindering real-world deployment. Current speculative decoding methods, which use a draft model to propose tokens and a target model to verify them, are limited by their two-model setup and lack rigorous theoretical guidance, capping speedup potential. We propose polybasic speculative decoding, a framework employing multiple interconnected draft models guided by a new theoretical analysis. We derive optimal inference time equations, establish conditions for efficiently adding models, and prove that speculative sampling stabilizes token acceptance. Experiments across major LLMs (e.g., LLaMA, Vicuna) show speedups of 3.16–4.43× without altering output quality. The theory enables systematic model selection and system design, advancing beyond heuristic approaches. This accelerates LLMs for applications like translation, reasoning, and chatbots while preserving reliability, making high-quality AI more accessible.