ICML Higher Acceptance Rates for Speculative Decoding with Randomised Drafting

Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Higher Acceptance Rates for Speculative Decoding with Randomised Drafting

William Toner · Martin Asenov · Rajkarn Singh · Artjom Joosen

[ Abstract ] [ Project Page ]

[ OpenReview]

Fri 18 Jul 1 p.m. PDT — 1:45 p.m. PDT

Abstract: Speculative decoding is an approach for increasing the Tokens Per Second (TPS) of a base LLM by using a smaller draft model to predict subsequent tokens. These draft tokens can be generated quickly and their verification by the base model can occur in parallel with generating the next token. A key determiner of the impact of SD on TPS is the _acceptance rate_ - how likely a draft token is to be accepted upon verification. This work explores *Randomised Drafting* wherein a draft is only generated with some probability $a \leq 1$. By introducing this random component we show that the acceptance rate can be boosted while preserving the distributional guarantees of SD. Despite sometimes using the base model directly we show that Randomised Drafting can result in an overall boost in TPS. The improvement in TPS is minor but comes without cost.

Chat is not available.

Poster in Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

Higher Acceptance Rates for Speculative Decoding with Randomised Drafting

William Toner · Martin Asenov · Rajkarn Singh · Artjom Joosen

Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)