Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)
Higher Acceptance Rates for Speculative Decoding with Randomised Drafting
William Toner · Martin Asenov · Rajkarn Singh · Artjom Joosen
Abstract:
Speculative decoding is an approach for increasing the Tokens Per Second (TPS) of a base LLM by using a smaller draft model to predict subsequent tokens. These draft tokens can be generated quickly and their verification by the base model can occur in parallel with generating the next token. A key determiner of the impact of SD on TPS is the _acceptance rate_ - how likely a draft token is to be accepted upon verification. This work explores *Randomised Drafting* wherein a draft is only generated with some probability $a \leq 1$. By introducing this random component we show that the acceptance rate can be boosted while preserving the distributional guarantees of SD. Despite sometimes using the base model directly we show that Randomised Drafting can result in an overall boost in TPS. The improvement in TPS is minor but comes without cost.
Chat is not available.