Definition
Top-k sampling is a decoding strategy that limits a language model’s next token to the K most probable candidates. At each generation step the model ranks every possible token by probability, keeps only the top K, throws away the rest, renormalises, and samples one token from that shortlist. It is a simple, direct way to inject controlled randomness while preventing the model from picking wildly unlikely words.
How it works step by step
- The model produces a probability for every token in its vocabulary.
- Those tokens are sorted from most to least likely.
- Only the top K are retained; everything below is set to zero probability.
- The retained probabilities are rescaled to sum to 1.
- The next token is sampled from this truncated distribution.
Because the cut-off is a fixed count, top-k always considers exactly K options — no more, no fewer — regardless of the model’s confidence at that step.
The role of K
The value of K trades focus against variety:
- K = 1 is identical to greedy decoding — always take the single most likely token. Output is deterministic but can be flat or repetitive.
- Small K (≈2–10) keeps output tight and on-topic, good for factual answers.
- Larger K (≈40–100) allows more diverse, creative continuations.
- Very large K approaches pure sampling from the full distribution.
Common defaults sit around 40–50, balancing coherence with enough variety to avoid robotic repetition.
Top-k vs top-p
The main alternative is top-p (nucleus) sampling, which keeps the smallest set of tokens whose probabilities sum to a threshold p. The key difference is adaptivity: top-k uses a fixed count, so it can keep many junk tokens when the model is uncertain or cut off good ones when it is confident. Top-p uses a dynamic count that grows and shrinks with the model’s certainty, which is why many practitioners prefer it for general-purpose generation.
Combining with temperature
Top-k is usually applied alongside temperature, which reshapes the probability distribution before truncation. A typical pipeline scales logits by temperature, then trims to the top K, then samples. The controls compound, so the practical advice is to pick one primary dial — temperature, top-k, or top-p — and leave the others near their defaults so that output behaviour stays easy to reason about and reproduce.