Prompt A/B test cost calculator
Running prompt A versus prompt B is easy; knowing whether the difference you saw is real — and whether your budget can even afford to find out — is the hard part. This tool turns a budget into a detectable effect: it computes how many calls a statistically valid test needs, what that costs, and whether your budget clears the bar.
How it works
For a two-proportion test the required sample size per variant is:
n = (z_α + z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₂ − p₁)²
where p₁ is the baseline success rate, p₂ = p₁ + effect, z_α is the critical value for your
significance level (one- or two-sided), and z_β is the value for your power (default 80%). The
tool computes n, doubles it for the two arms, multiplies by your cost-per-call, and compares the
result to your budget. If the budget is short, it reports the largest effect you can detect
within budget and the budget required to detect your target effect.
The critical z-values are obtained by inverting the standard normal CDF using a rational approximation (Acklam’s algorithm), which is accurate to better than 1e-9 — far more than enough for experiment planning.
Tips and notes
- The single biggest cost lever is the minimum effect size you decide is worth detecting. Picking a realistic, business-meaningful effect (not the smallest you can imagine) keeps the test affordable.
- Sample size scales with
1/effect², so a tiny target effect explodes the budget — confirm the effect would actually change a decision before paying to detect it. - Cheaper calls (smaller model for the eval, cached system prompt) directly lower the test cost without changing the statistics.
- Once you have run the test, validate the result with the companion A/B significance calculator, which computes the actual p-value and confidence interval from your observed counts.