What is self-consistency / majority voting?

Self-consistency is a prompting technique where you sample the same prompt several times at non-zero temperature and keep the answer the model produces most often. It typically improves accuracy on reasoning tasks versus a single sample.

How are answers grouped into votes?

Each response is normalized — trimmed, lowercased and stripped of surrounding punctuation and whitespace — then exact-matched. Responses that normalize to the same string share a vote cluster, so trivial formatting differences do not split the vote.

Does this cost more than one call?

Yes. Running N samples makes N separate API calls and bills N times the tokens of a single request. The tool runs them in parallel against your own key, so you pay your provider directly.

Your key stays in the browser tab and is sent only to the provider's official endpoint in each request. It is never logged, stored or transmitted to Gera.

What is the Multi-Output Majority Voter (BYO-key)?

Send the same prompt to your own OpenAI or Anthropic API key N times, then use majority voting over normalized answers to pick the most consistent response. A self-consistency tool for high-stakes LLM QA — runs in your browser, key never stored. It runs free in your browser on Gera Tools, with nothing uploaded.

Multi-Output Majority Voter (BYO-key)

Name: Multi-Output Majority Voter (BYO-key)
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Run a prompt N times and keep the majority answer

Large language models are non-deterministic: ask the same question twice at a normal temperature and you can get two different answers. Self-consistency turns that variance into a reliability signal. This tool sends your prompt to your own OpenAI or Anthropic key N times, groups the responses by content and surfaces the majority answer along with how strongly the runs agreed.

The self-consistency technique

Self-consistency was introduced as a prompting strategy for improving accuracy on reasoning tasks. Instead of trusting a single model sample, you collect multiple samples and keep the most common answer. The intuition is that the correct answer should appear more often than incorrect ones across runs, because there are fewer paths to a wrong answer than to the right one. Agreement across runs is evidence of reliability; disagreement is evidence that the prompt or task is genuinely ambiguous.

How it works

Each of the N requests is an independent API call at the model’s default sampling temperature. When all responses return, the tool normalizes every answer — trimming whitespace, lowercasing, and stripping wrapping punctuation — and then buckets identical normalized strings together. The bucket with the most votes is the majority answer, and the agreement percentage (winning votes divided by N) tells you whether the model is confident (one dominant cluster) or unstable (many small clusters).

High disagreement is itself useful information: it means the task is ambiguous, the prompt is under-specified, or the answer is genuinely uncertain — in which case a higher-quality or more constrained prompt may be needed before deploying.

Choosing N

N	Notes
3	Minimum for a majority; cheap, fast
5	Good default balance of cost and stability
7	More stable; still reasonable cost
9+	For high-stakes or genuinely uncertain tasks only

Use odd numbers to avoid ties. Every run is billed separately against your key, so estimate cost before running large N on expensive models.

Tips on prompt design for voting

For short, factual, or classification-style answers, majority voting works best — design the prompt so the model returns a single token or a short phrase (yes/no, a label, a number). For long free-form answers, exact-match clustering rarely converges; apply the voter to the final extracted answer rather than full prose, and use a prompt that explicitly asks for a specific format (for example, “respond with only: agree or disagree”).

Interpreting disagreement

A cluster with no clear majority is not a failure of the tool — it is meaningful information about the task. High disagreement usually means one of: the prompt is ambiguous and the model is interpreting it multiple ways, the task is inherently uncertain and the model reflects that uncertainty, or the model lacks sufficient knowledge to answer reliably. In each case, refining the prompt or providing more context is more productive than simply increasing N and hoping a majority emerges.