Top-P / Nucleus Sampling (AI Glossary)

Keep only the smallest set of tokens whose probabilities sum to P

Ad placeholder (leaderboard)

Definition

Top-p sampling, also called nucleus sampling, is a decoding strategy that restricts a language model’s next token to the smallest set of candidates whose combined probability reaches a threshold p. The model ranks tokens by likelihood, adds them up until the running total crosses p (say 0.9), discards everything after that point, renormalises, and samples from the remaining “nucleus.” Unlike a fixed cut-off, the number of candidates changes automatically based on how confident the model is.

How it works step by step

  1. The model assigns a probability to every token in its vocabulary.
  2. Tokens are sorted from most to least likely.
  3. Their probabilities are accumulated until the sum first reaches p.
  4. That set — the nucleus — is kept; everything else is dropped.
  5. The probabilities are rescaled and the next token is sampled from the nucleus.

If the top token already carries most of the probability, the nucleus may be a single token; if probability is spread thinly, it may include dozens.

Why adaptivity matters

This dynamic sizing is top-p’s key advantage over top-k. Top-k always keeps exactly K tokens: when the model is certain, K may include junk; when it is unsure, K may chop off perfectly good options. Top-p sidesteps both problems by following the shape of the distribution — it narrows where the model is confident and widens where it is uncertain. The result is generally more natural variety without straying into nonsense.

Choosing a value

The threshold p ranges from 0 to 1:

  • 0.9 – 0.95 — a widely used default; keeps lively variety, trims the unlikely tail. Good for general chat and writing.
  • 0.5 – 0.8 — more conservative and focused; useful when you want fewer surprises but not full determinism.
  • 1.0 — effectively no truncation; the full distribution is in play.

For strictly factual or structured output, pair a lower top-p with a low temperature to keep results tight and consistent.

Top-p with temperature

Top-p is typically applied after temperature has reshaped the distribution. Because both parameters govern randomness, their effects stack and can become hard to reason about together. The standard guidance is to tune one, not both — choose top-p or temperature as your main control and leave the other at its default so generation stays predictable and easy to reproduce.

Ad placeholder (rectangle)