Poster
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
Jingyu Liu · Beidi Chen · Ce Zhang
East Exhibition Hall A-B #E-2808
Improving the inference performance of Large Language Models (LLMs) has been one of the most important research directions because it directly decides the accessibility and applicability of LLMs for different applications. One of the crucial problems is to optimize the response time after the arrival of the query and before the generation of the first token, which often gives users the first impression of the responsiveness of the serving system. Contrary to accelerating the generation phase, the query processing has drastically different bottlenecks, hence making it challenging. We introduce our method, Speculative Prefill, which introduces an easy-to-use method that builds upon a simple intuition that not all information of the prompt is necessary for answering the query. To figure out what the essential information is, we adopt a light-weight assistant model to estimate, the result of which, along with other properties, are then sent to the target model for further processing. Our method has several advantages: 1) it is a plug-in method that does not require further training or adaptation, 2) it can be used together with one of the most widely deployed decoding method called speculative decoding with the same assistant model, 3) it helps reduces the prompt processing latency by up to 7 times with minimal quality loss and even improved performance for certain tasks, and 4) it can be applied to high-throughput regime. Overall, our method can help make LLM inference much more efficient and accessible while remaining relatively easy to adopt to different systems.