Skip to yearly menu bar Skip to main content


Spotlight Poster

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Nadav Timor · Jonathan Mamou · Daniel Korat · Moshe Berchansky · Gaurav Jain · Oren Pereg · Moshe Wasserblat · David Harel

East Exhibition Hall A-B #E-2810
[ ] [ ]
Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT
 
Oral presentation: Oral 2D Efficient ML
Tue 15 Jul 3:30 p.m. PDT — 4:30 p.m. PDT

Abstract:

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Lay Summary:

Making large language models (like those that power chatbots) generate text faster is a big challenge. A technique called "speculative decoding" can speed things up by having a smaller, faster "drafter" model predict several words ahead, which the larger "target" model then checks at once. However, this usually only works if both models use the exact same dictionary of words (vocabulary), which is often not the case and can mean needing to build a new drafter model from scratch.We've developed three new ways for speculative decoding to work even when the drafter and target models have different vocabularies. These methods don't change the quality of the text generated by the large model and can use existing, off-the-shelf models without any extra training.Our experiments demonstrate that our techniques can make language models run up to 2.8 times faster on tasks like summarizing text, writing computer code, and understanding long documents. This makes it easier and more practical to use speculative decoding with a much wider variety of models, speeding up AI applications without needing to create specialized drafter models.

Chat is not available.