A/B testing is fundamental to data-driven decision making, but running an underpowered test wastes resources and risks false conclusions. This calculator determines the exact number of visitors you need in each test group to detect a meaningful difference with statistical confidence. By inputting your baseline conversion rate, desired lift, and power requirements, you'll get precise sample size recommendations that balance statistical rigor with practical feasibility. Whether you're optimizing landing pages, email campaigns, or product features, understanding sample size requirements ensures your experiments deliver reliable insights.
How it works
The calculator uses the two-proportion z-test framework, converting conversion rates into effect sizes using Cohen's h statistic. This approach accounts for the non-normal distribution of proportion data. The formula integrates critical values from the standard normal distribution for both Type I error (alpha) and Type II error (beta). For multi-group tests, the calculation adjusts sample size requirements upward to maintain power across multiple comparisons. The significance level determines how confidently you reject the null hypothesis, with 0.05 (95% confidence) being industry standard. Statistical power represents your probability of detecting the true effect if it exists, typically set at 80% or 90%. The minimum detectable effect reflects the smallest practical improvement worth detecting given your business context.
Worked example
An e-commerce site sees 5% conversion on its current checkout flow. Product managers want to test a new single-page checkout, requiring 80% power to detect a 20% relative lift to a 6% conversion rate. With standard 95% confidence (5% significance), the calculator returns 1,556 visitors per group, or 3,112 total. This means the site needs roughly 1,556 conversions from each variant to distinguish whether the new checkout genuinely outperforms the original.
Why Sample Size Matters in A/B Testing
Running an A/B test without sufficient sample size is like checking weather with a broken thermometer. Undersized samples lead to high variance in results, making it difficult to distinguish signal from noise. You risk Type II errors, concluding no difference exists when a true effect does, resulting in missed opportunities. Conversely, over-large samples waste traffic and time. The sweet spot maximizes statistical power while respecting practical constraints. Professional data teams calculate required sample size before launching tests, ensuring decisions rest on solid evidence rather than random fluctuation. This upfront investment in planning saves money by eliminating underpowered experiments and false discoveries.
Understanding Statistical Significance and Power
Statistical significance (alpha) measures the risk of falsely rejecting the null hypothesis when no true difference exists—this is Type I error. Setting alpha to 0.05 means accepting a 5% chance of calling a random fluctuation significant. Statistical power (1-beta) measures your ability to detect a true effect when it exists, with 80% being a common minimum. These two parameters control your false discovery and false negative rates. Higher power requires larger samples, more observations necessary to reliably detect true effects. The trade-off between alpha and power reflects your tolerance for different error types. Conservative industries might demand 90% power, while experimental startups may accept 70% to move faster.
Interpreting Effect Size and Practical Significance
Effect size quantifies the magnitude of difference between variants, independent of sample size. Cohen's h converts proportion differences into a standardized metric: effect size 0.2 is small, 0.5 medium, 0.8 large. Larger effects require smaller samples to detect with confidence. Practical significance differs from statistical significance—a 0.1% conversion improvement might be statistically significant with huge samples but commercially irrelevant. Define your minimum detectable effect based on business impact: what lift justifies implementation costs? If your profit margin is 20%, a 1% conversion lift might be worthwhile; if 2%, you'd need 2%+ lifts. Balancing statistical power with practical significance prevents wasting resources on statistically valid but business-irrelevant improvements.
Multi-Group and Multivariate Testing
The calculator supports testing multiple variants simultaneously. A/B testing compares two groups; A/B/C testing adds a third. Multivariate testing explores combinations of factors. As group count increases, required sample size grows to maintain power across all comparisons. This accounts for multiple comparison problems: with more groups, random chance produces more spurious significant results. A/B/C/D testing requires roughly 2x the sample of A/B testing to achieve equivalent power. Many practitioners start with two-group tests for speed and clarity, then layer multivariate experiments once foundational insights solidify. Budget tests strategically to balance exploration speed against statistical rigor.
Practical Implementation and Time Horizons
Sample size calculations determine visitor requirements, but actual test duration depends on traffic volume and seasonal patterns. A website attracting 10,000 daily visitors needs 3,112 total sample size across roughly 5 hours; a low-traffic site might need 2-3 weeks. Account for traffic seasonality and day-of-week effects that introduce bias. Run tests for at least 1-2 business cycles to capture natural variation. Avoid 'peeking' at results mid-test, as this inflates false positive rates. Document baseline metrics, expected effect, and sample size targets before launching. After reaching the calculated sample size, analyze results, but resist the urge to extend if results approach significance—this introduces bias favoring borderline results over truly strong effects.