The problem it solves
Large language models generate text one token at a time, and each token requires a full forward pass through a very large network. This sequential bottleneck makes generation slow and expensive: you cannot produce the tenth token until the ninth is done, and every step pays the cost of the whole model. Speculative decoding breaks this bottleneck by letting the model produce several tokens per expensive step — without changing the text it ultimately generates.
The draft-then-verify algorithm
The technique uses two models: a small, fast draft model and the large target model whose output you actually want. The draft model quickly proposes a short run of candidate tokens, guessing what comes next. The target model then processes all of those candidates in a single parallel forward pass, checking each one. Tokens the target model agrees with are accepted immediately; at the first disagreement, the target model’s own token is used instead and the rest are discarded. The draft then resumes from there. Because the target model verifies many tokens at once instead of generating them one by one, several tokens can emerge from roughly the cost of a single target-model step.
Why the output is identical
A natural worry is that letting a weaker model propose tokens would degrade quality. Speculative decoding avoids this through a careful acceptance rule: a candidate token is accepted with a probability derived from comparing the draft and target distributions, and rejected candidates are replaced by sampling from a corrected distribution. The mathematics guarantees that the sequence of accepted tokens follows exactly the target model’s distribution. The draft model only influences speed, never the final result — so the output is indistinguishable from running the target model alone.
Acceptance rate, the key metric
The amount of speedup hinges on the acceptance rate: the fraction of proposed tokens the target model accepts. If the draft model is a good approximation and the text is predictable, acceptance is high and many tokens clear per step, giving large gains. If the draft model guesses poorly, most proposals are rejected, the system falls back toward one token per step, and the overhead of running two models can even slow things down. Tuning the number of tokens proposed per round trades off the benefit of accepting many at once against wasted work on rejected guesses.
Choosing a draft model and the gains
A good draft model is dramatically faster than the target yet frequently agrees with it. Common choices include a smaller model from the same family, a distilled version of the target, or self-speculative methods that add lightweight prediction heads to the target model itself. In practice, well-matched setups achieve roughly two-to-four times faster generation with no change in output quality. This makes speculative decoding one of the most widely adopted inference optimisations, especially for latency-sensitive, interactive applications where every millisecond of response time matters.