Make your LLM calls as repeatable as possible
Flaky, non-reproducible model output makes testing and debugging miserable. This helper builds a provider-specific checklist and a ready-to-paste configuration snippet — covering temperature, top_p, seed and the often-missed details — so you remove every controllable source of randomness from your calls.
What actually controls determinism
Several things have to line up for an LLM to return the same answer twice:
- Temperature = 0 is the single biggest lever. It tells the model to pick its most likely token instead of sampling, which removes most run-to-run variation.
- A fixed seed (OpenAI) asks the API to sample identically across calls. Anthropic and some others do not expose a seed, so you rely on temperature 0 there.
- Identical inputs — the exact same prompt, message order, and any tool or function definitions. A single changed character can change the output.
- A pinned model version. Provider model aliases (like
latest) drift over time; pin a dated snapshot so an upgrade does not silently change your results. - Unchanged decoding parameters — top_p, max_tokens, stop sequences and penalties all influence the outcome.
Even with all of this, distributed inference can introduce rare floating-point differences, so reproducibility is best-effort, not absolute.
Tips
- Pin the model snapshot rather than a moving alias — this is the most commonly missed source of drift.
- Log the seed and model version alongside each response so you can reproduce a specific output later.
- Hold the whole request constant when comparing prompts; change one variable at a time.
- For Anthropic, lean entirely on temperature 0 and stable inputs, since no seed is available.