Run a prompt N times and keep the majority answer
Large language models are non-deterministic: ask the same question twice at a normal temperature and you can get two different answers. Self-consistency turns that variance into a reliability signal. This tool sends your prompt to your own OpenAI or Anthropic key N times, groups the responses by content and surfaces the majority answer along with how strongly the runs agreed.
How it works
Each of the N requests is an independent API call at the model’s default sampling. When all responses return, the tool normalizes every answer — trimming whitespace, lowercasing and stripping wrapping punctuation — and then buckets identical normalized strings together. The bucket with the most votes is the majority answer, and the agreement percentage (winning votes ÷ N) tells you whether the model is confident (one dominant cluster) or unstable (many small clusters). High disagreement is itself useful: it means the task is ambiguous, the prompt is under-specified, or the answer is genuinely uncertain.
Tips and notes
For short, factual or classification-style answers, majority voting works best — normalize the task so the model returns a single token or short phrase (“yes/no”, a label, a number). For long free-form generations exact-match clustering will rarely agree, so use this on the final extracted answer rather than full prose. Odd values of N avoid ties. Five to seven runs is a practical balance of cost and stability; push to nine or more only for genuinely high-stakes decisions. Remember every run is billed separately against your key.