The principle of speculative decoding for LLMs is based on the asymmetry of Transformer-based architecture between:
- decoding tokens one by one, resulting in individual full passes through the model
- verifying multiple tokens at once, which results in only one full pass on a slightly longer sequence
In speculative decoding, a large expensive model is augmented with a cheap draft model, that will generate draft tokens directly. The expensive model is used to verify the draft tokens, and accept or rejects them one by one. The more tokens are accepted the larger the speedup of this technique.
In other words:
- Normal decoding: run the large model once → get 1 token
- Speculative decoding: run the draft model k times cheaply → get k candidate tokens → run the large model once on all k tokens in parallel → accept/reject
This paradigm is interesting for memory-bandwidth bound environments, which is the case for token-by-token inference.
Loading comments...