ICML Poster Fast Large Language Model Collaborative Decoding via Speculation

Poster

Fast Large Language Model Collaborative Decoding via Speculation

Jiale Fu · Yuchu Jiang · Junkai Chen · Jiaming Fan · Xin Geng · Xu Yang

East Exhibition Hall A-B #E-2608

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding—where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x–2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.

Lay Summary:

Large language models (LLMs), like ChatGPT, generate responses by predicting one word (or token) at a time based on the input. A natural idea is that instead of using only one LLM to guess the next token, we can combine the guesses from several LLMs to get more accurate and reliable results. We refer to this class of methods as LLM collaborative decoding. However, since it needs to run several models for each token, the time it takes becomes n times longer, which makes it hard to use in real situations.To fix this problem, we propose a new framework: Collaborative Decoding via Speculation (CoS). CoS can speed up any type of collaborative decoding—such as model ensemble, contrastive decoding, or decoding-time realignment—while still keeping the same high-quality output.Also, CoS does not need any training, added parameters, or extra calculation. This means it can be used directly to replace current ways of doing LLM collaborative decoding. Because of this, CoS has strong potential and value for real-world use.

Chat is not available.