How Long Should You Run an A/B Test? A Practical Guide

Learn how to calculate the right sample size and duration for A/B tests. Includes the formula, worked examples, and the most common mistakes that lead to unreliable results.

The Quick Answer

Most A/B tests need 2–6 weeks. The exact duration depends on three things: your daily traffic, your current conversion rate, and the smallest improvement you want to detect.

A site with 1,000 daily visitors and a 3% conversion rate, trying to detect a 10% relative lift, needs roughly 28,000 visitors per variation — about 28 days at full traffic. Lower traffic or smaller effects take longer. Higher traffic finishes faster, but always run for at least 14 days to capture weekly patterns.

Why Test Duration Matters

Running an A/B test for the wrong amount of time is the most common cause of unreliable results.

Too short and you get false positives — declaring a winner that is not actually better. Too long and you waste time and traffic that could be used on the next test.

The goal is to calculate the right sample size before starting the test, then commit to it. This is not optional — it is the foundation of valid A/B testing.

The Four Inputs That Determine Duration

1. Daily Traffic (per Variation)

This is how many visitors each version of your page receives per day. If your page gets 2,000 daily visitors and you split 50/50, each variation gets 1,000.

More traffic = shorter test. It is the most intuitive input.

2. Baseline Conversion Rate

Your current conversion rate before the test. Lower baselines require more visitors because there is more variance per visitor. A page converting at 1% needs roughly 5× more visitors than a page converting at 5% to detect the same relative change.

3. Minimum Detectable Effect (MDE)

The smallest relative improvement you want the test to be able to detect. This is the most important parameter and the one most people get wrong.

  • 5% MDE means detecting a lift from 5.0% to 5.25% — very precise but requires huge sample sizes
  • 10% MDE means detecting 5.0% to 5.5% — the most common choice
  • 20% MDE means detecting 5.0% to 6.0% — faster but only catches large effects

MDE has the largest impact on sample size because it is squared in the formula. Doubling MDE from 5% to 10% cuts required visitors by roughly 75%.

4. Confidence Level and Statistical Power

Confidence level (typically 95%) controls false positives — declaring a winner when there is no real difference. At 95% confidence, there is a 5% chance of this happening.

Statistical power (typically 80%) controls false negatives — missing a real improvement. At 80% power, there is a 20% chance of a true winner going undetected.

These defaults (95% confidence, 80% power) are standard in the industry. Changing them is rarely necessary.

The Formula

The sample size per variation for a two-sided test is:

n = 2 × [(Zα/2 + Zβ)² × (p₁(1−p₁) + p₂(1−p₂))] / δ²

Where:

  • n = required visitors per variation
  • Zα/2 = 1.96 (for 95% confidence)
  • Zβ = 0.84 (for 80% power)
  • p₁ = baseline conversion rate
  • p₂ = expected new rate (p₁ × (1 + MDE))
  • δ = absolute difference (p₂ − p₁)

Test duration = n ÷ daily visitors per variation, rounded up to the nearest whole day, with a minimum of 14 days.

Worked Examples

Example 1: Mid-Size E-commerce Site

Input Value
Daily visitors 3,000 total (1,500 per variation)
Baseline conversion rate 3.2%
MDE 10% relative
Confidence 95%
Power 80%

Calculation:

  • Expected new rate: 3.2% × 1.10 = 3.52%
  • Absolute difference: 0.32 percentage points
  • Sample size per variation: ~24,200 visitors
  • Duration: 24,200 ÷ 1,500 = 17 days

A reasonable test — finishable in about 2.5 weeks.

Example 2: Low-Traffic SaaS Page

Input Value
Daily visitors 400 total (200 per variation)
Baseline conversion rate 2.0%
MDE 15% relative
Confidence 95%
Power 80%

Calculation:

  • Expected new rate: 2.0% × 1.15 = 2.3%
  • Absolute difference: 0.3 percentage points
  • Sample size per variation: ~27,500 visitors
  • Duration: 27,500 ÷ 200 = 138 days

At nearly 5 months, this test is impractical. Options: increase MDE to 25–30%, test on a higher-traffic page, or use qualitative methods instead.

Example 3: High-Traffic Landing Page

Input Value
Daily visitors 20,000 total (10,000 per variation)
Baseline conversion rate 8.5%
MDE 5% relative
Confidence 95%
Power 80%

Calculation:

  • Expected new rate: 8.5% × 1.05 = 8.925%
  • Absolute difference: 0.425 percentage points
  • Sample size per variation: ~15,400 visitors
  • Duration: 15,400 ÷ 10,000 = 2 days → 14 days (minimum)

The math says 2 days, but you should still run for 14 days to capture a full set of weekday and weekend patterns.

The 14-Day Minimum Rule

Even if your traffic lets you reach the required sample size in 3 days, run the test for at least two full weeks. Here is why:

  • Weekday vs. weekend behavior. Conversion rates often differ by 20–40% between weekdays and weekends. A test that runs only Tuesday to Friday would miss this.
  • Pay cycles. B2C purchases can spike around paydays (1st and 15th of the month).
  • Email campaigns. If you send a weekly newsletter, one variation might happen to coincide with the send while the other does not.
  • Ad schedule changes. Paid traffic composition may vary by day of week.

Running for 14+ days smooths these effects across both variations equally.

The Peeking Problem

The single most damaging mistake in A/B testing is checking results before the test is done and stopping early when you see significance.

This is called peeking, and it dramatically inflates your false positive rate.

At 95% confidence, your intended false positive rate is 5%. But if you check results every day and stop when you first see significance, research shows the actual false positive rate can climb to 25–50%. You are essentially rolling the dice every day and stopping when you get lucky.

The rule: Calculate your sample size. Start the test. Do not look at the results until the test is complete.

If you truly need the ability to stop early, use sequential testing methods (like SPRT or alpha-spending functions) that are specifically designed to control error rates under repeated analysis.

When A/B Testing Is Not the Right Tool

A/B testing requires sufficient traffic to produce reliable results. If your calculator shows a test duration of 3+ months, consider alternatives:

Pre/post analysis. Implement the change and compare metrics to the same period before. Less rigorous, but practical for low-traffic sites.

User testing. Watch 5–10 people use both versions. Qualitative insights often reveal problems that no amount of statistical testing would identify.

Heuristic evaluation. Apply established UX principles (clarity, scannability, trust signals) without testing. Not every change needs an A/B test.

Focus on bigger changes. If you can only detect 30%+ lifts, test bold hypotheses: entirely different page layouts, new value propositions, or fundamental flow changes. Small copy tweaks are not worth testing at low traffic.

How to Choose Your MDE

Choosing MDE is a business decision, not a statistical one. Consider:

  1. What lift would be worth the development effort? If a change took 2 weeks of engineering, a 2% lift on a page that generates $10K/month adds $200/month. Is that worth it?

  2. What lift is realistic? Headline changes typically produce 5–15% lifts. Complete redesigns can produce 20–50%. Expect most tests to show no significant effect at all.

  3. What can you afford to detect? Use the calculator. If detecting a 5% lift requires 6 months, set MDE at 15% and focus on bolder changes.

A practical default: 10% MDE for most tests, 20% when traffic is limited, 5% only when you have high traffic and the decision is high-stakes.

Quick Reference Table

Approximate sample size per variation at 95% confidence and 80% power:

Baseline Rate 5% MDE 10% MDE 20% MDE
1% 315,000 79,000 20,000
2% 156,000 39,000 10,000
5% 60,000 15,000 3,900
10% 28,000 7,100 1,800
20% 12,500 3,200 800

These are estimates. Use the A/B test duration calculator for exact numbers with your inputs.

Checklist Before Starting an A/B Test

  1. Define the hypothesis. "Changing X will improve Y because Z."
  2. Choose one primary metric. Revenue per visitor, conversion rate, or another specific goal.
  3. Calculate sample size. Use a calculator. Write down the number.
  4. Set the MDE. Realistic, affordable, and meaningful for the business.
  5. Confirm traffic is sufficient. If the test requires more than 8 weeks, reconsider.
  6. Plan for at least 14 days. Even if math says less.
  7. Agree to not peek. Seriously. Do not check until it is done.
  8. Document start date and expected end date. Share with stakeholders.
  9. Verify tracking works. Run both variations for 24 hours and confirm data collection before counting toward results.
  10. Accept that most tests produce null results. That is normal and valuable.
Try It Yourself

A/B Test Duration Calculator

Enter your traffic, baseline rate, and MDE to see exactly how long your test needs to run.

Open Calculator

Frequently Asked Questions

How long should an A/B test run?

Most tests need 2–6 weeks. The exact duration depends on your daily traffic, baseline conversion rate, and minimum detectable effect. Always run for at least 14 days regardless of sample size to capture weekly behavior patterns.

What happens if I stop an A/B test early?

Stopping early based on intermediate results ("peeking") inflates your false positive rate from the intended 5% to as high as 25–50%. You may declare a winner that is not actually better. Always commit to the pre-calculated sample size.

How do I know if my traffic is high enough for A/B testing?

Use a sample size calculator. If the required test duration exceeds 8 weeks at your current traffic, rigorous A/B testing may not be practical. Consider qualitative research, pre/post analysis, or testing bolder changes with a higher MDE.

What is a good MDE to use?

10% relative MDE is a practical default. Use 20% for low-traffic sites or exploratory tests. Use 5% only when you have high traffic and the business decision is high-stakes (e.g., pricing changes).

Do I need to test everything?

No. A/B test when the decision is close, the stakes are meaningful, and you have enough traffic. Many improvements — fixing bugs, improving load speed, adding missing information — can be implemented without testing.

Can I run multiple A/B tests at the same time?

Yes, if the tests are on different pages or different user segments with no overlap. If they overlap, interactions between tests can confound results. Some platforms support multi-test environments with proper isolation.

What is the difference between one-tailed and two-tailed tests?

A two-tailed test checks for effects in both directions (improvement or regression). A one-tailed test only checks one direction. Two-tailed is standard and recommended — you always want to know if a change made things worse.

Should I use Bayesian or frequentist A/B testing?

Frequentist testing (what this guide covers) is the industry standard and well-understood. Bayesian methods offer different advantages (probability of being best, flexible stopping rules) but require more statistical knowledge. Both are valid when applied correctly.

How do I report A/B test results?

Report: the hypothesis, sample size achieved, conversion rates for each variation, absolute and relative lift, confidence level, p-value (or credible interval for Bayesian), and the decision made. Include null results — they prevent future teams from retesting the same idea.

What is the minimum number of conversions needed?

A common rule of thumb is at least 100 conversions per variation before results stabilize. With fewer conversions, even "significant" results may be unreliable. The formal answer is to use the sample size formula, which accounts for conversion rate.

Related Tools