A/B Test Sample Size Calculator

A/B testing is fundamental to data-driven decision making, but running an underpowered test wastes resources and risks false conclusions. This calculator determines the exact number of visitors you need in each test group to detect a meaningful difference with statistical confidence. By inputting your baseline conversion rate, desired lift, and power requirements, you'll get precise sample size recommendations that balance statistical rigor with practical feasibility. Whether you're optimizing landing pages, email campaigns, or product features, understanding sample size requirements ensures your experiments deliver reliable insights.

How it works

The calculator uses the two-proportion z-test framework, converting conversion rates into effect sizes using Cohen's h statistic. This approach accounts for the non-normal distribution of proportion data. The formula integrates critical values from the standard normal distribution for both Type I error (alpha) and Type II error (beta). For multi-group tests, the calculation adjusts sample size requirements upward to maintain power across multiple comparisons. The significance level determines how confidently you reject the null hypothesis, with 0.05 (95% confidence) being industry standard. Statistical power represents your probability of detecting the true effect if it exists, typically set at 80% or 90%. The minimum detectable effect reflects the smallest practical improvement worth detecting given your business context.

Formula

n = (Z_α + Z_β)² × p(1-p) / h² × k / (k-1), where h = 2 × arcsin(√p₁) - 2 × arcsin(√p₀)

n = sample size per group, Z_α and Z_β are critical values for significance and power, p is average conversion rate, h is Cohen's h effect size for proportions, k is number of groups

💡

Worked example

An e-commerce site sees 5% conversion on its current checkout flow. Product managers want to test a new single-page checkout, requiring 80% power to detect a 20% relative lift to a 6% conversion rate. With standard 95% confidence (5% significance), the calculator returns 1,556 visitors per group, or 3,112 total. This means the site needs roughly 1,556 conversions from each variant to distinguish whether the new checkout genuinely outperforms the original.

Why Sample Size Matters in A/B Testing

Running an A/B test without sufficient sample size is like checking weather with a broken thermometer. Undersized samples lead to high variance in results, making it difficult to distinguish signal from noise. You risk Type II errors, concluding no difference exists when a true effect does, resulting in missed opportunities. Conversely, over-large samples waste traffic and time. The sweet spot maximizes statistical power while respecting practical constraints. Professional data teams calculate required sample size before launching tests, ensuring decisions rest on solid evidence rather than random fluctuation. This upfront investment in planning saves money by eliminating underpowered experiments and false discoveries.

Understanding Statistical Significance and Power

Statistical significance (alpha) measures the risk of falsely rejecting the null hypothesis when no true difference exists—this is Type I error. Setting alpha to 0.05 means accepting a 5% chance of calling a random fluctuation significant. Statistical power (1-beta) measures your ability to detect a true effect when it exists, with 80% being a common minimum. These two parameters control your false discovery and false negative rates. Higher power requires larger samples, more observations necessary to reliably detect true effects. The trade-off between alpha and power reflects your tolerance for different error types. Conservative industries might demand 90% power, while experimental startups may accept 70% to move faster.

Interpreting Effect Size and Practical Significance

Effect size quantifies the magnitude of difference between variants, independent of sample size. Cohen's h converts proportion differences into a standardized metric: effect size 0.2 is small, 0.5 medium, 0.8 large. Larger effects require smaller samples to detect with confidence. Practical significance differs from statistical significance—a 0.1% conversion improvement might be statistically significant with huge samples but commercially irrelevant. Define your minimum detectable effect based on business impact: what lift justifies implementation costs? If your profit margin is 20%, a 1% conversion lift might be worthwhile; if 2%, you'd need 2%+ lifts. Balancing statistical power with practical significance prevents wasting resources on statistically valid but business-irrelevant improvements.

Multi-Group and Multivariate Testing

The calculator supports testing multiple variants simultaneously. A/B testing compares two groups; A/B/C testing adds a third. Multivariate testing explores combinations of factors. As group count increases, required sample size grows to maintain power across all comparisons. This accounts for multiple comparison problems: with more groups, random chance produces more spurious significant results. A/B/C/D testing requires roughly 2x the sample of A/B testing to achieve equivalent power. Many practitioners start with two-group tests for speed and clarity, then layer multivariate experiments once foundational insights solidify. Budget tests strategically to balance exploration speed against statistical rigor.

Practical Implementation and Time Horizons

Sample size calculations determine visitor requirements, but actual test duration depends on traffic volume and seasonal patterns. A website attracting 10,000 daily visitors needs 3,112 total sample size across roughly 5 hours; a low-traffic site might need 2-3 weeks. Account for traffic seasonality and day-of-week effects that introduce bias. Run tests for at least 1-2 business cycles to capture natural variation. Avoid 'peeking' at results mid-test, as this inflates false positive rates. Document baseline metrics, expected effect, and sample size targets before launching. After reaching the calculated sample size, analyze results, but resist the urge to extend if results approach significance—this introduces bias favoring borderline results over truly strong effects.

Frequently asked questions

What conversion rate baseline should I use?

Use your historical average for the metric you're testing. If new, benchmark against competitors or industry data. Be conservative if uncertain—using lower baselines increases required sample size, providing margin of safety. Update baselines periodically as your product evolves.

What's a realistic minimum detectable effect?

Define lift based on business impact and cost. E-commerce sites often target 5-20% relative lifts; SaaS platforms might target 10-30%. Specify relative lift (percentage improvement) rather than absolute points. A 5% baseline becoming 5.5% is 10% relative lift. Discuss with stakeholders to set realistic, actionable targets.

Why does significance level matter?

Lower alpha (stricter significance) requires larger samples to detect effects with equivalent power. Standard 0.05 balances false positive risk against practical feasibility. Clinical trials use 0.01 for drug safety; tech companies often use 0.05 or 0.10. Choose based on cost of false positives in your context.

Should I always use 80% power?

80% power is conventional but negotiable. Higher power (90%+) detects smaller effects but requires larger samples. Lower power (70%) saves sample size but misses true effects 30% of the time. Discuss trade-offs with stakeholders, considering traffic constraints and business priorities.

How do I account for multiple testing corrections?

The calculator adjusts for multiple groups using the ratio k/(k-1), where k is group count. For sequential testing or multiple metrics, consult statistical resources on alpha spending and corrections like Bonferroni. Each additional hypothesis tested increases the probability of false discoveries.

What if I can't reach the calculated sample size?

If traffic limits sample size, either increase minimum detectable effect to realistic levels, reduce statistical power requirements, or extend test duration. Ensure you're not forced into detecting only large, obvious effects. Consider segmentation to increase sample rate within high-value audiences.

Can I stop testing early if results look significant?

No. Stopping early inflates false positive rates by exploiting random fluctuation. Run tests to predetermined sample sizes regardless of interim results. If you must peek for safety reasons, use sequential testing methods with pre-specified spending schedules, not traditional significance thresholds.