Speculative Decoding Explained: How AI Generates Text Faster

The inference optimization technique powering faster LLMs

Ad placeholder (leaderboard)

Why generating text is slow

Large language models generate text autoregressively: they produce one token, append it to the input, and run the entire model again to produce the next token. Each of those steps requires a full forward pass through billions of parameters, and the passes happen strictly in sequence because each token depends on the one before it. On modern hardware this is memory-bandwidth bound — most of the time is spent moving the model’s weights, not doing arithmetic — so a long answer means many slow, serial passes. Speculative decoding is the clever trick that breaks the one-pass-per-token rule without changing what the model says.

The core idea: draft, then verify

Speculative decoding pairs two models: a small, fast draft model and the large, accurate target model you actually want output from. The process repeats in rounds:

  1. Draft. The small model quickly proposes the next several tokens — say four or five — guessing where the text is going.
  2. Verify. The large model processes all of those proposed tokens in a single parallel forward pass, scoring each one against its own probabilities.
  3. Accept or correct. Every leading token the large model agrees with is accepted in bulk. At the first disagreement, the draft is truncated there and the large model supplies the correct token itself.

Because verifying many tokens at once costs roughly the same as generating one token, every round can advance the text by several tokens for the price of a single expensive pass.

Why the output stays identical

The crucial property is that speculative decoding is lossless. The acceptance rule uses the target model’s exact probability distribution, so a token is only kept if it is consistent with what the large model would have sampled anyway; rejected guesses are replaced by the large model’s own choice. The mathematics (sometimes called speculative sampling) guarantees the final sequence is drawn from the same distribution as standard decoding. You are not approximating the big model — you are running it more efficiently. That is what makes the technique safe to ship in production, unlike quality-trading shortcuts such as aggressive quantisation.

Speedups, tradeoffs, and where it is used

In practice speculative decoding delivers 2-3x faster generation, occasionally up to 4x. The gain depends entirely on the acceptance rate — how often the draft model agrees with the target. Predictable, boilerplate-heavy text (code, structured output, common phrasing) yields high acceptance and big speedups; surprising or highly technical text rejects more guesses and gains less. The tradeoffs are modest extra memory to host the draft model and some engineering to tune the draft length.

The technique now underpins fast inference across the industry, with variants such as Medusa (predicting multiple tokens from extra model heads) and EAGLE (a lightweight learned drafter) pushing acceptance rates higher. Self-speculative methods even use early layers of the same model as the drafter, avoiding a separate model entirely. Whenever a chat model feels snappy despite its size, speculative decoding is often part of the reason.

Ad placeholder (rectangle)