What does temperature actually do?

Temperature scales the model's output probabilities before sampling. Low values (near 0) make the model pick the most likely token almost every time — predictable and repetitive. High values (toward 1.5–2) flatten the distribution so less likely tokens get picked, giving variety and creativity at the cost of coherence.

Should I change temperature or top-p?

Change one, not both. Temperature reshapes the whole distribution; top-p (nucleus sampling) keeps only the smallest set of tokens whose probabilities sum to p, then samples among those. Most teams pick a temperature and leave top-p at 1.0, or vice versa.

Why is temperature 0 recommended for code and extraction?

For code, structured output, and data extraction you want the single most likely, correct continuation every time — not creative variety. Near-zero temperature maximises determinism and reproducibility, which matters for tests and pipelines.

Does a higher max-tokens cost more?

Max-tokens only caps how long the response can be; you pay for tokens actually generated. Setting it generously is fine, but a tight cap protects you from runaway responses and surprise bills on completion-heavy tasks.

What is the LLM Temperature & Parameter Guide?

Interactive reference that recommends temperature, top-p, max-tokens, and other sampling parameters per task type — creative writing, coding, summarisation, Q&A, classification — with reasoning and examples. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

LLM Temperature & Parameter Guide

Name: LLM Temperature & Parameter Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

LLM temperature & parameter guide

Sampling parameters quietly decide whether your LLM feels reliable or unhinged. Set temperature too high on a coding task and you get plausible-looking nonsense; set it too low on a brainstorm and you get the same three ideas every time. This guide maps the common task types to sensible starting parameters and explains the reasoning so you can adjust with intent instead of guessing.

What temperature actually does to the output distribution

Temperature rescales the raw logit scores the model produces before the softmax that converts them into probabilities. At temperature 1.0, the distribution is unchanged from what the model learned — the most likely token might be 30× more probable than a rare one. At temperature 0.1, that gap is amplified enormously; the top token becomes near-certain. At temperature 1.5, the gap is compressed; unlikely tokens get a meaningful share of the probability mass.

The practical effect:

Low temperature (0–0.3): Predictable, consistent, closely follows the most likely continuation. Ideal when correctness matters more than variety. Results converge quickly across multiple runs.
Medium temperature (0.4–0.7): Balanced. Good for Q&A, summaries, and explanations where you want the model to follow the most likely path while occasionally using a more natural word choice.
High temperature (0.8–1.2): Creative and varied. Different runs produce meaningfully different outputs. Risk of incoherence rises as you approach and exceed 1.0 on complex tasks.
Above 1.2: Useful only for narrow creative tasks with short outputs; coherence breaks down on longer generations for most models.

How it works

Pick the task type and how deterministic you need the output to be. The tool returns a recommended temperature, top-p, and max-tokens, plus a short rationale. The logic follows the well-established trade-off: deterministic, correctness-critical tasks (code, extraction, classification) sit near temperature 0, balanced tasks (Q&A, summarisation) sit in the 0.2–0.5 band, and generative, divergent tasks (storytelling, brainstorming, marketing copy) climb toward 0.8–1.2. The determinism slider nudges the recommendation within the band so you can lean more consistent or more varied.

Quick reference by task type

Task	Suggested temperature	Rationale
Code generation	0 – 0.1	Correctness is binary; variety is unwanted
Data extraction / classification	0 – 0.2	Schema compliance matters; creativity does not
Factual Q&A	0.2 – 0.4	Accurate answers with natural phrasing
Summarization	0.3 – 0.5	Balanced fidelity and readability
Writing assistance / editing	0.5 – 0.7	Helpful suggestions without losing voice
Brainstorming / ideation	0.8 – 1.0	Variety is the point
Creative writing	0.9 – 1.2	Distinctive, surprising outputs

Tips and notes

Change temperature or top-p, not both — combining them makes results hard to reason about. For anything you will run repeatedly (tests, pipelines, eval suites) pin temperature to 0 so results are reproducible. If outputs feel repetitive at moderate temperature, a small bump plus a frequency_penalty often helps more than a large temperature jump. And remember that the “right” setting is the one that passes your own evaluation on real inputs — treat these as informed starting points, then measure.