Speculative Decoding, Visualized
Large language models generate text one token at a time. Each token requires a full forward pass through billions of parameters. Speculative decoding makes this faster by using a small draft model to propose tokens, then verifying them all at once with the large target model.
Why Generation Is Slow
You can't predict token 5 without first knowing token 4. The process is inherently sequential. A 70-billion parameter model that takes 50ms per token needs 5 full seconds to generate 100 tokens, waiting for each step to complete before starting the next.
But here's the asymmetry that makes speculative decoding possible: verifying N tokens takes one forward pass, while generating N tokens takes N forward passes. A language model can score an entire sequence in parallel, computing the probability distribution at every position simultaneously. If we could guess the next several tokens correctly, we could verify all of them in a single pass rather than generating them one by one.
How It Works
The algorithm follows a draft-then-verify loop:
- Generate K draft tokens using the draft model
- Verify all K tokens in one target model forward pass
- Accept tokens until a mismatch, then resample and discard the rest
In the best case, all K tokens are accepted, plus a bonus token sampled by the target model. That's K+1 tokens from a single pass. Even when some are rejected, you're guaranteed at least one token of progress.
The draft model proposes several tokens. The target model scores them all in one forward pass, then we check each position in order. When a token is rejected, we resample a replacement and discard the remaining drafts.
Preserving the Distribution
Speculative decoding produces samples from the exact target distribution, not an approximation. This requires careful handling of both acceptance and rejection.
The animation above shows how this works. We compare target and draft probabilities at each token. The green region—the minimum of the two—represents guaranteed acceptance. The red excess above it is where the draft overestimated.
When we sample a token from the draft, what's the probability we accept it? It's simply the green height divided by the total draft height:
If the target is higher than the draft, there's no red—we always accept. If the draft overestimated, we accept proportionally less.
When rejected, we can't simply resample from ptarget—that would double-count tokens the draft already had a chance to propose. Instead, we sample from the residual distribution: the target minus the guaranteed acceptance, normalized to form a valid distribution.
Accepted samples plus residual resamples combine to give exactly the target distribution. Same output, faster sampling.
Why does this preserve the target distribution?
Consider sampling a single token. The draft proposes token x with probability pdraft(x). What's the probability that x ends up in our final output?
Two paths lead to outputting x:
- Path 1: Draft proposes x and we accept it
- Path 2: Draft proposes some other token y, we reject it, and resample x from the residual
For Path 1: P(propose x) × P(accept x) = pdraft(x) × min(1, ptarget(x)/pdraft(x))
When ptarget ≥ pdraft, this equals pdraft(x). The remaining probability mass (ptarget(x) - pdraft(x)) comes from Path 2, which is what the residual distribution provides.
When ptarget < pdraft, Path 1 contributes ptarget(x) directly. The residual is zero for x, so Path 2 contributes nothing. Total: ptarget(x).
In both cases, the final probability of outputting x equals ptarget(x).
When Does It Help?
Speedup depends on acceptance rate and the cost ratio between models. The technique works best when:
- The draft aligns well with the target (distilled versions, same-family smaller models)
- The size ratio is large (70B/7B gains more than 7B/1B)
- Generation is predictable (code, structured output, common phrases)
Expect 2-3× speedup with a well-matched draft model. Worst case is worse than standard generation: you pay for both models but get just one token.
Calculating the speedup
Suppose:
- Target model: 50ms per forward pass
- Draft model: 5ms per forward pass
- Draft length K = 5 tokens
- Average acceptance: 3 tokens
Cost per iteration: 50ms (target) + 5×5ms (drafts) = 75ms
Tokens generated: 3 accepted + 1 resampled = 4 tokens
Effective rate: 75ms / 4 = 18.75ms per token
Compared to 50ms baseline: 2.67× speedup
The optimal K depends on acceptance rate. Too few drafts wastes verification parallelism. Too many wastes effort on tokens that will be rejected.
Summary
Generating tokens is slow. Verifying them is fast. A small draft model guesses, a large target model checks in parallel. 2-3× speedup, identical output.
Want more visual explanations like this? Subscribe to our newsletter




