Question 1

What is speculative decoding?

Accepted Answer

Speculative decoding speeds up large language model generation by using a small, fast draft model to guess several upcoming tokens, then having the large target model verify all of them in a single parallel pass. Tokens the target model agrees with are accepted at once, so multiple tokens can be produced for roughly the cost of one large-model step.

Question 2

Does speculative decoding change the output?

Accepted Answer

No. When implemented correctly, the verification step guarantees the final output is identical to what the large model would have produced on its own. The draft model only proposes candidates; the target model accepts or rejects them so the resulting distribution is exactly the target model's. It is a pure speedup, not a quality trade-off.

Question 3

What is the acceptance rate?

Accepted Answer

The acceptance rate is the fraction of the draft model's proposed tokens that the target model accepts. A higher acceptance rate means more tokens are produced per verification step, so the speedup is larger. It depends on how well the draft model approximates the target model and on how predictable the text is.

Question 4

What makes a good draft model?

Accepted Answer

A good draft model is much faster than the target model yet agrees with it often. It is usually a smaller version from the same family, a distilled model, or even a few extra prediction heads on the target itself. The art is balancing speed against agreement so accepted tokens outpace the overhead of running two models.

What Is Speculative Decoding? Faster LLM Inference With a Draft Model

The problem it solves

The draft-then-verify algorithm

Why the output is identical

Acceptance rate, the key metric

Choosing a draft model and the gains