Question 1

What is speculative decoding in simple terms?

Accepted Answer

A small, fast draft model guesses several upcoming tokens, then the large target model checks all of those guesses in a single forward pass. Correct guesses are accepted in bulk, so the expensive model runs far fewer times. It is like a junior writer drafting a sentence that a senior editor approves at a glance.

Question 2

Does speculative decoding change the output?

Accepted Answer

No — that is its key property. The verification step uses the large model's own probabilities, so the final text is mathematically identical to what the large model would have produced alone. You get the speed without any quality tradeoff, which is why it is widely deployed in production.

Question 3

How much faster does it make generation?

Accepted Answer

Typically 2-3x, sometimes up to 4x, depending on how often the draft model guesses correctly. Speedup is highest on predictable text where the small model agrees with the large one, and lower on hard, surprising content where more guesses are rejected.

Question 4

Why is generation slow without it?

Accepted Answer

LLMs generate text autoregressively — one token at a time, each requiring a full forward pass through billions of parameters. That serial, memory-bandwidth-bound process is the bottleneck. Speculative decoding breaks the strict one-pass-per-token limit by verifying several tokens per expensive pass.

Speculative Decoding Explained: How AI Generates Text Faster

Why generating text is slow

The core idea: draft, then verify

Why the output stays identical

Speedups, tradeoffs, and where it is used