Poster
AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
Zhepei Wei · Wei-Lin Chen · Xinyu Zhu · Yu Meng
East Exhibition Hall A-B #E-2605
Large language models (LLMs) are increasingly used to produce long, detailed texts such as Chain-of-Thought reasoning. However, generating this kind of content can be slow because LLMs typically produce one word at a time, and each word must be completed before the next begins. This sequential process restricts the ability to fully leverage modern computer hardware’s parallel processing capabilities.In this work, we present AdaDecode, a new method that speeds up text generation without changing the original model parameters or introducing extra auxiliary models. The idea is simple: if the model is confident about a word early on, we make early predictions using only part of the model, then start working on the next word immediately. Any unfinished computation is done in parallel later, followed by a verification step to ensure the output quality. This approach makes better use of hardware and significantly reduces generation time. Our experiments show that AdaDecode can make generation up to 1.73x faster while keeping the output exactly the same as standard generation.