A/B Test Duration Calculator -- Sample Size Tool

Calculate how long to run your A/B test for statistical significance

Estimate A/B Test Duration

This A/B test duration calculator estimates the sample size and test length needed to reach statistical significance. Enter your traffic, current conversion rate, and the smallest improvement you want to detect.

Worked Examples

Three real-world scenarios showing how traffic, baseline rate, and MDE interact to determine test duration.

Example 1 — E-commerce Product Page

  • Daily visitors: 2,000 per variation
  • Baseline rate: 3.5%
  • MDE: 10% relative (detect 3.5% → 3.85%)
  • Result: ~25,500 visitors per variation
  • Duration: ~13 days → run 14 days (2 weeks minimum)

With decent traffic and a moderate baseline, this test finishes in two weeks.

Example 2 — SaaS Free Trial Page

  • Daily visitors: 500 per variation
  • Baseline rate: 2%
  • MDE: 20% relative (detect 2.0% → 2.4%)
  • Result: ~12,200 visitors per variation
  • Duration: ~25 days

Lower traffic and lower baseline mean the test needs about a month — even with a generous 20% MDE.

Example 3 — High-Traffic Landing Page

  • Daily visitors: 10,000 per variation
  • Baseline rate: 8%
  • MDE: 5% relative (detect 8.0% → 8.4%)
  • Result: ~17,900 visitors per variation
  • Duration: ~2 days → run 14 days (weekly pattern minimum)

Even though the math says 2 days, running for 14 days captures weekly traffic patterns and avoids time-based bias.

How A/B Test Sample Size Is Calculated

The required sample size per variation is determined by four inputs:

  • Baseline conversion rate (p₁): Your current conversion rate before the test
  • Minimum Detectable Effect (MDE): The smallest relative improvement worth detecting. A 10% MDE on a 5% baseline means detecting the difference between 5.0% and 5.5%
  • Confidence level (1 − α): How certain you need to be that a detected difference is real, not random noise. 95% is standard (5% false positive risk)
  • Statistical power (1 − β): The probability of detecting a real effect when it exists. 80% is standard (20% false negative risk)

The formula:

n = 2 × [(Zα/2 + Zβ)² × (p₁(1−p₁) + p₂(1−p₂))] / δ²

Where:

  • n = required sample size per variation
  • Zα/2 = Z-score for the confidence level (1.96 for 95%)
  • Zβ = Z-score for statistical power (0.84 for 80%)
  • p₁ = baseline conversion rate
  • p₂ = expected new conversion rate (p₁ × (1 + MDE))
  • δ = absolute difference (p₂ − p₁)

Test duration is then: sample size ÷ daily visitors per variation (rounded up, minimum 14 days).

What Changes Test Duration Most?

These factors have the biggest impact on how long your test needs to run:

Factor Change Effect on Duration
Daily traffic2× more visitors~2× shorter test
MDE5% → 10% (more lenient)~4× shorter test
Baseline rate2% → 5%~2× shorter test
Confidence95% → 90%~20% shorter test
Power80% → 90%~30% longer test

Key insight: MDE has the largest impact because it's squared in the denominator. Relaxing your MDE from 5% to 10% cuts sample size by roughly 75%. If a test requires more time than you can afford, raising MDE is usually the most practical lever.

Common A/B Testing Mistakes

⚠️ Peeking at results: Checking results mid-test and stopping when you see significance is the most common A/B testing error. It inflates false positive rates from 5% to 25–50%. Decide your sample size upfront and commit to it.
  • Running tests too short: A test that runs for only a few days may hit "significance" by chance. Without 1–2 full business cycles (weekday + weekend), results are unreliable.
  • Testing too many variations: Each additional variation requires more total traffic. A test with 4 variations needs roughly 3× the traffic of a standard A/B test to maintain statistical power.
  • Changing the test mid-run: Modifying the control or variation during the test invalidates the results. If you need to make changes, restart the test with a new sample.
  • Ignoring novelty effect: Users may interact differently with new designs simply because they are new. Run tests long enough (3–4 weeks) for the novelty to wear off.
  • Not segmenting results: A test might show no overall lift but a significant lift for mobile users. If you plan to analyze segments, account for this in your sample size (more visitors needed).
  • Using the wrong metric: If your goal is revenue, optimize for revenue per visitor — not just conversion rate. A variation might convert more visitors but at a lower order value.

A/B Testing Best Practices

  • Calculate sample size before starting. Use this calculator. Write down the required visitors and commit to the duration.
  • Run for at least 2 full weeks to capture weekday and weekend traffic patterns, regardless of what the math says.
  • Test one change at a time for clear causation. If you change the headline, button, and layout simultaneously, you cannot attribute the result to any single element.
  • Use a 50/50 traffic split for maximum statistical efficiency. Uneven splits (90/10) require dramatically more visitors.
  • Avoid testing during unusual periods: holidays, sales events, product launches, or marketing campaigns can skew results.
  • Document everything: hypothesis, expected MDE, sample size, start date, end date, and decision. This builds institutional knowledge and prevents revisiting settled questions.
  • Accept null results gracefully. Most tests do not produce a winner. A null result is still valuable — it tells you the change does not meaningfully impact the metric.

When You Don't Have Enough Traffic

Many sites cannot run statistically rigorous A/B tests. If your calculator shows a test would take 6+ months, consider these alternatives:

  • Increase MDE. If you can only detect 30%+ lifts, focus on bold changes (complete redesigns, new value propositions) rather than minor tweaks.
  • Use a higher-traffic page. Test on your homepage or landing page instead of a low-traffic settings page.
  • Combine traffic from multiple pages. Some tools support site-wide A/B tests that aggregate traffic across pages.
  • Run pre/post tests. Compare metrics before and after a change (less rigorous, but better than guessing).
  • Use qualitative methods. User interviews, session recordings, and usability tests can reveal improvements without statistical testing.

Frequently Asked Questions

How long should I run an A/B test?

It depends on your daily traffic, baseline conversion rate, and the smallest improvement you want to detect. Most tests need 2–6 weeks. Always run for at least 14 days to capture weekly patterns. Use this calculator to get an exact estimate for your numbers.

How do I calculate A/B test sample size?

Use the formula: n = 2 × [(Zα/2 + Zβ)² × (p₁(1−p₁) + p₂(1−p₂))] / δ². At 95% confidence and 80% power, with a 5% baseline and 10% relative MDE, you need approximately 31,000 visitors per variation. This calculator handles the math for you.

What is Minimum Detectable Effect (MDE)?

MDE is the smallest relative improvement in conversion rate your test is designed to detect. A 10% MDE on a 5% baseline means detecting a lift from 5.0% to 5.5%. Smaller MDEs require larger sample sizes. Most teams use 5–20% MDE depending on traffic and expected impact.

What confidence level should I use?

95% is the industry standard (5% false positive risk). Use 90% for exploratory tests where speed matters. Use 99% for high-stakes decisions like pricing changes. Lower confidence means shorter tests but higher risk of declaring false winners.

What is statistical power in A/B testing?

Statistical power is the probability of detecting a real effect. At 80% power (standard), there is a 20% chance of missing a real improvement. Increasing power to 90% adds roughly 30% more required visitors but reduces false negatives.

Can I stop an A/B test early if results look significant?

No. "Peeking" — stopping early when results look good — inflates your false positive rate from 5% to as high as 25–50%. Decide your sample size before starting and commit to it. If you need flexibility to stop early, use sequential testing methods specifically designed for it.

Why is 14 days the minimum test duration?

User behavior differs between weekdays and weekends. Conversion rates, traffic volume, and visitor intent can shift significantly. A test running Tuesday through Friday misses weekend patterns entirely. Two full weeks captures at least two of each day.

What happens if my test runs too short?

An underpowered test produces unreliable results. You might declare a winner that is actually no better (false positive), or miss a real improvement (false negative). Both outcomes waste time and can lead to worse decisions than not testing at all.

How does baseline conversion rate affect duration?

Lower baselines require more visitors. A site converting at 1% needs roughly 5× more visitors than a site converting at 5% to detect the same relative improvement. Lower rates have higher variance relative to the effect size.

What if my test shows no significant result?

A null result means the change did not produce a detectable effect at your MDE level. This is valuable information — it means the change is unlikely to have a meaningful impact. Do not re-run the same test hoping for a different result. Move on to a higher-impact hypothesis.

How many variations should I test?

Start with two (control + one variation). Each additional variation multiplies the required total traffic. A test with 4 variations needs roughly 3× the traffic of a standard A/B test. Only add variations when you have enough traffic to support them.

Does this calculator send my data to a server?

No. All calculations run entirely in your browser using JavaScript. Nothing is transmitted, stored, or logged.

Related Tools

Learn More

For a deeper explanation of sample size calculations, statistical significance, and practical tips for running your first A/B test, read our guide: How Long Should You Run an A/B Test? A Practical Guide to Sample Size and Duration.

Privacy & Limitations

  • Client-side only. No data is sent to any server. No cookies, no tracking.
  • Two-tailed test. This calculator assumes a two-sided hypothesis test (detecting both positive and negative effects).
  • Two variations only. The formula assumes a standard A/B test with one control and one variation. For multivariate tests, sample size requirements increase.
  • Assumes equal traffic split. An uneven split (e.g., 90/10) requires a larger total sample size than a 50/50 split.
  • Does not account for multiple comparisons. If you plan to analyze multiple metrics or segments, consider applying a Bonferroni correction or using a lower significance threshold.

Related Tools

View all tools

A/B Test Duration Estimator FAQ

How long should I run an A/B test?

The required duration depends on your daily traffic, baseline conversion rate, and the smallest improvement you want to detect (MDE). As a rule of thumb, most A/B tests need at least 2 weeks to capture weekly traffic patterns, and often 4–6 weeks for sites with moderate traffic. Use a sample size calculator to get an exact estimate for your numbers.

How do I calculate A/B test sample size?

Sample size is calculated using the formula: n = 2 × [(Zα + Zβ)² × (p₁(1−p₁) + p₂(1−p₂))] / δ², where p₁ is the baseline conversion rate, p₂ is the expected new rate, δ is the absolute difference, Zα is the Z-score for confidence level, and Zβ is the Z-score for statistical power. At 95% confidence and 80% power with a 5% baseline and 10% relative MDE, you need approximately 31,000 visitors per variation.

What is Minimum Detectable Effect (MDE)?

Minimum Detectable Effect is the smallest relative improvement in conversion rate that your test is designed to detect. For example, a 10% MDE on a 5% baseline means you can detect a lift from 5.0% to 5.5%. Smaller MDEs require larger sample sizes. Most teams use 5–20% MDE depending on their traffic and the expected impact of the change.

What confidence level should I use for A/B testing?

The industry standard is 95% confidence, which means a 5% chance of a false positive (declaring a winner when there is no real difference). Use 90% for exploratory tests where speed matters more than certainty. Use 99% for high-stakes decisions like pricing changes or major redesigns.

What is statistical power in A/B testing?

Statistical power is the probability that your test will detect a real effect when one exists. At 80% power (the standard), there is a 20% chance of a false negative — missing a real improvement. Higher power (90% or 95%) requires larger sample sizes but reduces the risk of missing real wins.

Why should I run an A/B test for at least 2 weeks?

Running for at least 2 full weeks (14 days) ensures your results capture weekly traffic patterns. User behavior differs between weekdays and weekends — conversion rates, traffic volume, and visitor intent can shift significantly. A test that runs only Tuesday to Friday might miss weekend effects entirely.

Can I stop an A/B test early if results look significant?

No. Stopping early based on intermediate results (called 'peeking') inflates your false positive rate dramatically — from the intended 5% to as high as 25-50%. Decide your sample size before starting and commit to it. If you need flexibility, use sequential testing methods designed for early stopping.

What happens if my A/B test runs too short?

An underpowered test (too few visitors) produces unreliable results. You might declare a winner that is not actually better (false positive), or miss a real improvement because the test lacked the statistical power to detect it (false negative). Both waste time and can lead to worse business decisions than not testing at all.

How does baseline conversion rate affect test duration?

Lower baseline rates require more visitors. A site converting at 1% needs roughly 5× more visitors than a site converting at 5% to detect the same relative improvement. This is because lower rates have higher variance relative to the effect size, making it harder to distinguish signal from noise.

Does this calculator send my data to a server?

No. All calculations happen entirely in your browser using JavaScript. No data is transmitted, stored, or logged.

Request a New Tool
Improve This Tool