Blog posts

January 22, 2026

Speculative Decoding, Visualized

Education

Authors

Dylan Ebert

Editors

No items found.

Acknowledgements

Speculative Decoding, Visualized

Large language models generate text one token at a time. Each token requires a full forward pass through billions of parameters. Speculative decoding makes this faster by using a small draft model to propose tokens, then verifying them all at once with the large target model.

Why Generation Is Slow

You can't predict token 5 without first knowing token 4. The process is inherently sequential. A 70-billion parameter model that takes 50ms per token needs 5 full seconds to generate 100 tokens, waiting for each step to complete before starting the next.

But here's the asymmetry that makes speculative decoding possible: verifying N tokens takes one forward pass, while generating N tokens takes N forward passes. A language model can score an entire sequence in parallel, computing the probability distribution at every position simultaneously. If we could guess the next several tokens correctly, we could verify all of them in a single pass rather than generating them one by one.

How It Works

The algorithm follows a draft-then-verify loop:

Generate K draft tokens using the draft model
Verify all K tokens in one target model forward pass
Accept tokens until a mismatch, then resample and discard the rest

In the best case, all K tokens are accepted, plus a bonus token sampled by the target model. That's K+1 tokens from a single pass. Even when some are rejected, you're guaranteed at least one token of progress.

The draft model proposes several tokens. The target model scores them all in one forward pass, then we check each position in order. When a token is rejected, we resample a replacement and discard the remaining drafts.

Preserving the Distribution

Speculative decoding produces samples from the exact target distribution, not an approximation. This requires careful handling of both acceptance and rejection.

The animation above shows how this works. We compare target and draft probabilities at each token. The green region—the minimum of the two—represents guaranteed acceptance. The red excess above it is where the draft overestimated.

When we sample a token from the draft, what's the probability we accept it? It's simply the green height divided by the total draft height:

If the target is higher than the draft, there's no red—we always accept. If the draft overestimated, we accept proportionally less.

When rejected, we can't simply resample from p_target—that would double-count tokens the draft already had a chance to propose. Instead, we sample from the residual distribution: the target minus the guaranteed acceptance, normalized to form a valid distribution.

Accepted samples plus residual resamples combine to give exactly the target distribution. Same output, faster sampling.

Why does this preserve the target distribution?

Consider sampling a single token. The draft proposes token x with probability p_draft(x). What's the probability that x ends up in our final output?

Two paths lead to outputting x:

Path 1: Draft proposes x and we accept it
Path 2: Draft proposes some other token y, we reject it, and resample x from the residual

For Path 1: P(propose x) × P(accept x) = p_draft(x) × min(1, p_target(x)/p_draft(x))

When p_target ≥ p_draft, this equals p_draft(x). The remaining probability mass (p_target(x) - p_draft(x)) comes from Path 2, which is what the residual distribution provides.

When p_target < p_draft, Path 1 contributes p_target(x) directly. The residual is zero for x, so Path 2 contributes nothing. Total: p_target(x).

In both cases, the final probability of outputting x equals p_target(x).

When Does It Help?

Speedup depends on acceptance rate and the cost ratio between models. The technique works best when:

The draft aligns well with the target (distilled versions, same-family smaller models)
The size ratio is large (70B/7B gains more than 7B/1B)
Generation is predictable (code, structured output, common phrases)

Expect 2-3× speedup with a well-matched draft model. Worst case is worse than standard generation: you pay for both models but get just one token.

Calculating the speedup

Suppose:

Target model: 50ms per forward pass
Draft model: 5ms per forward pass
Draft length K = 5 tokens
Average acceptance: 3 tokens

Cost per iteration: 50ms (target) + 5×5ms (drafts) = 75ms

Tokens generated: 3 accepted + 1 resampled = 4 tokens

Effective rate: 75ms / 4 = 18.75ms per token

Compared to 50ms baseline: 2.67× speedup

The optimal K depends on acceptance rate. Too few drafts wastes verification parallelism. Too many wastes effort on tokens that will be rejected.

Summary

Generating tokens is slow. Verifying them is fast. A small draft model guesses, a large target model checks in parallel. 2-3× speedup, identical output.

‍

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter YouTube