Why sampling strategies exist
When a language model picks the next token, it does not simply take its single most likely guess every time — that would make output repetitive and dull. Instead it samples from the probability distribution over possible tokens. But sampling from the entire distribution is risky: the long tail of very unlikely tokens occasionally gets picked and derails the text. Top-k and top-p sampling are the two standard ways to trim that tail — keeping enough good candidates for variety while cutting off the junk that causes incoherence. The explorer below lets you see how each method decides which tokens stay in play.
Top-k sampling: a fixed-size shortlist
Top-k sampling is the simpler idea. You choose a number k, and the model
keeps only its k most likely next tokens, throws away everything else,
renormalises the survivors so their probabilities sum to one, and samples from
that shortlist. With k = 1 the model is fully greedy (always its top choice);
with k = 40 it samples from a wider pool and produces more variety. The
limitation is that the pool size is fixed. When the model is very confident,
k=40 may sweep in clearly wrong tokens; when it is uncertain, k=40 may cut off
good ones. The shortlist does not adapt to how sure the model is.
Top-p (nucleus) sampling: an adaptive cutoff
Top-p, also called nucleus sampling, fixes that. Instead of a fixed count, you choose a probability mass p (commonly 0.9). The model sorts tokens by likelihood and keeps the smallest set whose probabilities add up to p, then samples from that set. The key advantage is that the pool adapts to confidence: when the model is sure, a single token might already cover 0.9 of the mass, so the pool stays tiny and output stays safe; when the model is uncertain, the mass spreads across many tokens, so the pool grows and allows more exploration. This context-sensitive behaviour is why top-p is usually preferred and why many APIs default to a top-p near 0.9 to 1.0.
Combining them with temperature — and with each other
These knobs are complementary, not interchangeable. Temperature rescales the whole distribution (sharpening or flattening it), while top-p and top-k decide which tokens survive the cut. A clean mental model: temperature sets how adventurous the model feels, and top-p/top-k set how far into the tail it is allowed to wander. In practice you rarely tune all three at once — pick one truncation method (usually top-p) and adjust temperature alongside it. Stacking a low top-k, a tight top-p, and a high temperature produces effects that interact in confusing ways. The reliable workflow is to keep top-p at a sensible default (~0.9), adjust temperature for the task, and only reach for top-k when you want a simple, hard cap on the candidate pool.