Two different ways to control randomness
When a language model generates text, it produces a probability for every possible next token and then samples one. Temperature and top_p are two different knobs for controlling that sampling step — and confusing them is one of the most common mistakes in prompt and API tuning. The key insight is that they act at different stages. Temperature reshapes the whole probability distribution before sampling. Top_p truncates the distribution to a subset of candidates and then samples from that subset. Both ultimately make output more or less varied, but they get there by different mechanisms, which is why changing both at once produces results that are hard to predict.
How temperature works
Temperature is a scaling factor applied to the model’s raw scores (logits) before they are turned into probabilities. A low temperature (near 0) sharpens the distribution: the most likely token becomes overwhelmingly likely, so the model almost always picks the obvious choice and output is repetitive and deterministic. A high temperature (above 1) flattens the distribution: less likely tokens get a real chance of being selected, so output becomes more surprising, varied, and sometimes incoherent. At temperature 0 the model is effectively greedy — it picks the single most probable token every time. Think of temperature as a “creativity dial” that affects all tokens proportionally.
How top_p (nucleus sampling) works
Top_p, also called nucleus sampling, works by cutting off the long tail of
improbable tokens. The model sorts candidate tokens by probability and keeps adding
them to a pool until their cumulative probability reaches the value p; everything
outside that “nucleus” is discarded, and the model samples only from what remains. So
top_p = 0.9 means “consider only the most likely tokens that together make up 90% of
the probability mass.” A low top_p (say 0.5) keeps the pool tight and output focused; a
top_p of 1.0 considers every token and effectively disables the truncation. Crucially,
the size of the pool changes per token — when the model is confident, the nucleus is
tiny; when it is uncertain, the nucleus is wide.
Practical guidance: pick one and set sensible defaults
The standard advice from OpenAI and others is to tune one parameter, not both.
They interact in non-obvious ways, and adjusting both can compound or cancel each
other unpredictably. Most people find temperature the more intuitive knob, so a good
habit is to leave top_p at its default of 1.0 and adjust temperature for the task:
near 0 for factual extraction, classification, and code; around 0.7 for balanced
writing; 0.9 and above for creative brainstorming. If you prefer top_p, leave
temperature at 1.0 and lower top_p (0.5–0.9) to restrict the candidate pool. Finally,
do not assume temperature 0 gives bit-for-bit reproducibility — it gets close, but
floating-point and batching effects mean tiny variation can remain. Choose the knob
that matches your mental model, set a default that fits the task, and only reach for
the second parameter if the first cannot get you where you need to be.