Speculative Decoding

tags: LLM, Optimization, Test-time compute, Language modeling

The principle of speculative decoding for LLMs is based on the asymmetry of Transformer-based architecture between:

decoding tokens one by one, resulting in individual full passes through the model
verifying multiple tokens at once, which results in only one full pass on a slightly longer sequence

In speculative decoding, a large expensive model is augmented with a cheap draft model, that will generate draft tokens directly. The expensive model is used to verify the draft tokens, and accept or rejects them one by one. The more tokens are accepted the larger the speedup of this technique.

In other words:

Normal decoding: run the large model once → get 1 token
Speculative decoding: run the draft model k times cheaply → get k candidate tokens → run the large model once on all k tokens in parallel → accept/reject

This paradigm is interesting for memory-bandwidth bound environments, which is the case for token-by-token inference.

Links to this note

Comments

Leave a comment