How is the required sample size calculated?

It uses the standard two-proportion power formula. The required sample per variant is n = (z_alpha + z_beta)^2 × [p1(1-p1) + p2(1-p2)] / (p2 - p1)^2, where z_alpha comes from your significance level and z_beta from the desired power (default 80%). Smaller effects need quadratically more samples.

Why does a small effect cost so much more to detect?

Sample size scales with the inverse square of the effect size, so halving the effect you want to detect roughly quadruples the calls — and the cost. Detecting a 1% lift can cost 100x more than detecting a 10% lift, which is why deciding the minimum meaningful effect upfront is the most important budgeting step.

What significance and power should I use?

The conventional defaults are a 5% significance level (alpha = 0.05) and 80% power (beta = 0.20). Use a stricter alpha (0.01) for high-stakes decisions, and higher power (90%) when missing a real effect is costly. Both raise the required sample size and therefore the cost.

Does each variant need the full sample?

Yes. The computed n is per variant, so a two-variant A/B test needs 2n calls total. The tool's total-cost figure accounts for both arms.

Is anything uploaded?

No. All statistics are computed locally in your browser. Nothing you enter is transmitted.

What is the Prompt A/B Test Cost Calculator?

Given a dollar budget, cost per call, expected effect size, and significance level, this tool computes the required sample size for a prompt A/B test and tells you whether your budget can afford statistical significance — or what budget you would need. Runs in your browser. It runs free in your browser on Gera Tools, with nothing uploaded.

Prompt A/B Test Cost Calculator

Name: Prompt A/B Test Cost Calculator
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Get one useful tool a week

Like this tool? Enter your email and we'll send you one genuinely useful Gera tool a week — plus a link to come back to this one. No spam, one-click unsubscribe any time.

Prompt A/B test cost calculator

Running prompt A versus prompt B is easy; knowing whether the difference you saw is real — and whether your budget can even afford to find out — is the hard part. This tool turns a budget into a detectable effect: it computes how many calls a statistically valid test needs, what that costs, and whether your budget clears the bar.

How it works

For a two-proportion test the required sample size per variant is:

n = (z_α + z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₂ − p₁)²

where p₁ is the baseline success rate, p₂ = p₁ + effect, z_α is the critical value for your significance level (one- or two-sided), and z_β is the value for your power (default 80%). The tool computes n, doubles it for the two arms, multiplies by your cost-per-call, and compares the result to your budget. If the budget is short, it reports the largest effect you can detect within budget and the budget required to detect your target effect.

The critical z-values are obtained by inverting the standard normal CDF using a rational approximation (Acklam’s algorithm), which is accurate to better than 1e-9 — far more than enough for experiment planning.

Worked example

For example, you are testing two summarisation prompts. The baseline produces correct summaries 70% of the time (p₁ = 0.70). You want to detect an improvement to 80% (effect = 0.10) with 80% power at a 5% significance level. Each API call costs $0.02.

Applying the formula gives approximately 290 calls per variant, or 580 total calls. Total cost: 580 × $0.02 = $11.60 — a realistic test for a 10% improvement costs about $12.

Now consider detecting a 2% improvement (effect = 0.02):

n per variant ≈ 7,250; total ≈ 14,500 calls
Total cost: 14,500 × $0.02 = $290

The same test costs 25× more when the target effect is 5× smaller. This is the 1/effect² relationship in practice — the most important budgeting insight for LLM experiments.

Choosing realistic inputs

Baseline rate. Run a sample of 50–100 calls on the current prompt before setting up the test. A measured baseline is more accurate than a guess, and small errors near extremes (below 0.3 or above 0.7) can shift sample size significantly.

Effect size. The minimum effect that would actually change a decision — not the smallest you could imagine measuring. If a 3% improvement would not change which prompt you deploy, testing for 3% is wasted budget.

Cost per call. Include both variant costs if prompt lengths differ. A longer prompt costs more per call, which should be reflected in the per-call estimate for that arm.

Tips and notes

The single biggest cost lever is the minimum effect size you decide is worth detecting. Picking a realistic, business-meaningful effect keeps the test affordable.
Sample size scales with 1/effect², so a tiny target effect explodes the budget — confirm the effect would actually change a decision before paying to detect it.
Cheaper calls (smaller model for the eval, cached system prompt) directly lower the test cost without changing the statistics.
Once you have run the test, validate the result with the companion A/B significance calculator, which computes the actual p-value and confidence interval from your observed counts.