What is top-k sampling?

Top-k sampling restricts the model to its k most likely next tokens, discards the rest, renormalises their probabilities, and samples from just that fixed-size pool. A small k keeps output safe and focused; a large k allows more variety. The pool size is constant regardless of how confident the model is.

What is top-p (nucleus) sampling?

Top-p sampling keeps the smallest set of top tokens whose probabilities add up to a threshold p (say 0.9), then samples from that set. Unlike top-k, the pool size adapts: when the model is confident it may keep just one or two tokens, and when it is uncertain it keeps many. This makes it more flexible than a fixed k.

Should I use top-p or top-k?

Top-p (nucleus) sampling is generally preferred because its candidate pool adapts to the model's confidence, which produces more natural results across varied contexts. Top-k is simpler and still effective. Many APIs default to top-p around 0.9–1.0, and you rarely need to set both at once.

How do these interact with temperature?

Temperature rescales the whole probability distribution before sampling, while top-p and top-k truncate which tokens are eligible. They are complementary: temperature changes how sharp the distribution is, and top-p/top-k decide how much of the tail to cut off. Tuning them all simultaneously is discouraged because their effects overlap and become hard to reason about.

Top-P and Top-K Sampling Explained

Why sampling strategies exist

When a language model picks the next token, it does not simply take its single most likely guess every time — that would make output repetitive and dull. Instead it samples from the probability distribution over possible tokens. But sampling from the entire distribution is risky: the long tail of very unlikely tokens occasionally gets picked and derails the text. Top-k and top-p sampling are the two standard ways to trim that tail — keeping enough good candidates for variety while cutting off the junk that causes incoherence. The explorer below lets you see how each method decides which tokens stay in play.

Top-k sampling: a fixed-size shortlist

Top-k sampling is the simpler idea. You choose a number k, and the model keeps only its k most likely next tokens, throws away everything else, renormalises the survivors so their probabilities sum to one, and samples from that shortlist. With k = 1 the model is fully greedy (always its top choice); with k = 40 it samples from a wider pool and produces more variety. The limitation is that the pool size is fixed. When the model is very confident, k=40 may sweep in clearly wrong tokens; when it is uncertain, k=40 may cut off good ones. The shortlist does not adapt to how sure the model is.

Top-p (nucleus) sampling: an adaptive cutoff

Top-p, also called nucleus sampling, fixes that. Instead of a fixed count, you choose a probability mass p (commonly 0.9). The model sorts tokens by likelihood and keeps the smallest set whose probabilities add up to p, then samples from that set. The key advantage is that the pool adapts to confidence: when the model is sure, a single token might already cover 0.9 of the mass, so the pool stays tiny and output stays safe; when the model is uncertain, the mass spreads across many tokens, so the pool grows and allows more exploration. This context-sensitive behaviour is why top-p is usually preferred and why many APIs default to a top-p near 0.9 to 1.0.

Combining them with temperature — and with each other

These knobs are complementary, not interchangeable. Temperature rescales the whole distribution (sharpening or flattening it), while top-p and top-k decide which tokens survive the cut. A clean mental model: temperature sets how adventurous the model feels, and top-p/top-k set how far into the tail it is allowed to wander. In practice you rarely tune all three at once — pick one truncation method (usually top-p) and adjust temperature alongside it. Stacking a low top-k, a tight top-p, and a high temperature produces effects that interact in confusing ways. The reliable workflow is to keep top-p at a sensible default (~0.9), adjust temperature for the task, and only reach for top-k when you want a simple, hard cap on the candidate pool.