ICML Poster Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Poster

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu · Beidi Chen · Ce Zhang

East Exhibition Hall A-B #E-2808

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7$\times$ maximal end-to-end QPS on real downstream tasks and 7.66$\times$ TTFT improvement.

Lay Summary:

Improving the inference performance of Large Language Models (LLMs) has been one of the most important research directions because it directly decides the accessibility and applicability of LLMs for different applications. One of the crucial problems is to optimize the response time after the arrival of the query and before the generation of the first token, which often gives users the first impression of the responsiveness of the serving system. Contrary to accelerating the generation phase, the query processing has drastically different bottlenecks, hence making it challenging. We introduce our method, Speculative Prefill, which introduces an easy-to-use method that builds upon a simple intuition that not all information of the prompt is necessary for answering the query. To figure out what the essential information is, we adopt a light-weight assistant model to estimate, the result of which, along with other properties, are then sent to the target model for further processing. Our method has several advantages: 1) it is a plug-in method that does not require further training or adaptation, 2) it can be used together with one of the most widely deployed decoding method called speculative decoding with the same assistant model, 3) it helps reduces the prompt processing latency by up to 7 times with minimal quality loss and even improved performance for certain tasks, and 4) it can be applied to high-throughput regime. Overall, our method can help make LLM inference much more efficient and accessible while remaining relatively easy to adopt to different systems.

Chat is not available.