- tags
- LLM, Diffusion models, Transformers, Test-time compute
- source
- (Chen et al. 2026)
Summary
DFlash is a speculative decoding framework that replaces the usual small autoregressive draft model with a lightweight block diffusion draft model. The draft model generates a whole block of tokens in a single forward pass, which are then verified in parallel by the target LLM. The authors argue that the main bottleneck of prior speculative decoding methods (e.g. EAGLE-3) is the sequential nature of autoregressive drafting, which limits speedups to about 2-4x even with a very small draft model.
The key insight is the target knows best: deep hidden representations from the target model already implicitly encode rich information about future tokens. A fixed set of hidden layers (sampled uniformly from shallow to deep) is extracted from the target’s prefill pass, projected to form a compact “target context feature,” and injected into the draft diffusion model as Key/Value entries in each of its layers. This lets a very small (3-5 layer) draft transformer reach high acceptance lengths while denoising an entire block of masked tokens in parallel.
On Qwen3-4B/8B and LLaMA-3.1-8B, DFlash delivers roughly 4-6x end-to-end speedup and 4.9x acceptance-length improvement over EAGLE-3, even under reasoning-mode and high-concurrency (SGLang) serving conditions. Training uses random anchor sampling, exponentially decaying loss weighting to prioritize early-block tokens, and KV injection for tight target/draft alignment.
Key Ideas
- Reframe diffusion LLMs as drafters inside a speculative decoding loop, sidestepping the end-to-end quality gap of standalone diffusion LLMs by always verifying with the autoregressive target.
- KV injection of target hidden states into every draft layer (not just at the input) so contextual guidance does not get diluted with draft depth.
- Block-level parallel drafting: an entire block of masked tokens is denoised in a single forward pass, breaking the sequential drafting bottleneck.
- Random masked-block sampling and exponential early-token loss weighting to align training with speculative-decoding verification dynamics.
- Shared embedding + frozen LM head with the target, making the draft model a lightweight diffusion adapter of the target.
- Train-with-large-block / infer-with-small-block generalizes well, enabling dynamic block-size scheduling at inference.
Comments
Speculative decoding and diffusion language modeling have mostly been independent research lines; the clever move here is to use each for what it is good at (diffusion for parallelism, autoregressive for quality) rather than pitting them against each other. The “target knows best” framing is consistent with what EAGLE-style methods already exploit at the logit/hidden level, but DFlash pushes it further by making the draft model a KV-conditioned diffusion adapter.
Limitations worth flagging: the acceptance-length improvement depends on harvesting hidden features from 5 target-model layers at prefill, which costs memory that scales with block size; block-size scheduling is left as future work; and the evaluation is restricted to Qwen3 / LLaMA-3.1-8B with fairly short 2K-token generations, so behavior on very-long contexts remains open.
Connections
- Related to LLM because the whole motivation is to accelerate inference of large autoregressive language models.
- Related to Diffusion models because the draft model is a discrete block diffusion model over masked tokens.
- Related to Transformers because both target and draft are Transformer-based, and the technique is essentially a cross-model KV-injection scheme.
- Related to Test-time compute because speculative decoding is a core inference-time efficiency technique that changes the cost/quality trade-off at serving time.
Bibliography
- Jian Chen, Yesheng Liang, Zhijian Liu. . "DFlash: Block Diffusion for Flash Speculative Decoding". https://arxiv.org/abs/2602.06036.
Loading comments...