Statistical Significance in A/B Testing: A Practical Guide
Why Statistical Significance Matters
You've been running an A/B test for a week. Variant B has a 3.2% conversion rate vs variant A's 2.8%. Is B actually better, or did you just get lucky? This is exactly the question statistical significance answers.
Without proper statistical analysis, you might make decisions based on random noise. A test with 100 visitors per variant could easily show a 15% difference due to chance alone. Statistical significance tells you the probability that your observed difference is real.
P-Values Explained Simply
A p-value answers the question: "If there were truly no difference between the variants, what's the probability of seeing a difference this large (or larger) by chance?"
Convention uses p < 0.05 as the threshold (95% confidence), meaning there's less than a 5% chance the result is due to random variation. Some key points:
- A p-value of 0.03 means: there's a 3% chance you'd see this result if the variants were actually identical
- A p-value of 0.15 means: there's a 15% chance—not confident enough to declare a winner
- P-values do not tell you the probability that your hypothesis is true
- P-values do not tell you the magnitude of the effect
The Z-Test for Proportions
For A/B testing conversion rates, the most common test is the two-proportion z-test. Here's the math:
// Calculate statistical significance
pooledRate = (conversionsA + conversionsB) / (visitorsA + visitorsB)
standardError = sqrt(pooledRate * (1 - pooledRate) * (1/visitorsA + 1/visitorsB))
zScore = (rateB - rateA) / standardError
// z > 1.96 corresponds to p < 0.05 (95% confidence)
// z > 2.58 corresponds to p < 0.01 (99% confidence)
Experiment Flow calculates this automatically for every experiment and displays it on your dashboard in real-time.
Sample Size: How Long Should You Run Tests?
The most common question in A/B testing. The answer depends on three factors:
1. Baseline Conversion Rate
A 2% baseline conversion rate needs more samples to detect a change than a 20% baseline. This is because the signal-to-noise ratio is lower at low conversion rates.
2. Minimum Detectable Effect (MDE)
How small a difference do you want to detect? Detecting a 1% relative improvement requires many more samples than detecting a 20% relative improvement.
3. Statistical Power
Power is the probability of detecting a real effect. Convention uses 80% power, meaning if there truly is a difference, you'll detect it 80% of the time.
Rule of thumb for 95% confidence and 80% power:
- Detecting a 5% relative change from a 3% baseline: ~85,000 visitors per variant
- Detecting a 10% relative change from a 3% baseline: ~22,000 visitors per variant
- Detecting a 20% relative change from a 3% baseline: ~5,700 visitors per variant
- Detecting a 10% relative change from a 10% baseline: ~6,400 visitors per variant
If you don't have enough traffic to reach your required sample size within 2-4 weeks, consider testing bolder changes with larger expected effects, or use multi-armed bandits to reduce the cost of exploration.
Common Statistical Mistakes
Peeking Problem
The #1 mistake in A/B testing. If you check your results every day and stop the test the first time you see p < 0.05, your actual false positive rate can be as high as 30% (not 5%). This is because you're running multiple comparisons without correcting for it.
Solutions:
- Set a sample size in advance and don't stop early
- Use sequential testing methods that are designed for continuous monitoring
- Use Experiment Flow's auto-promote feature, which handles this correctly
Multiple Comparisons
Testing 20 variants against a control? At p < 0.05, you'd expect one to appear significant purely by chance. Use the Bonferroni correction: divide your significance threshold by the number of comparisons (0.05/20 = 0.0025).
Simpson's Paradox
A variant can appear to win overall while actually losing in every individual segment. This happens when traffic isn't evenly distributed across segments. Always segment your results by device, traffic source, and other key dimensions.
Beyond P-Values: Confidence Intervals
P-values tell you if there's a difference; confidence intervals tell you how big the difference is. A 95% confidence interval of [0.5%, 4.2%] means the true difference is likely between 0.5% and 4.2%. This is often more useful for business decisions than a bare p-value.
When reporting results, always include:
- The point estimate (e.g., "+2.3% conversion rate")
- The confidence interval (e.g., "[+0.5%, +4.2%]")
- The sample size
- The test duration
Practical Recommendation
For most businesses, use 95% confidence (p < 0.05) as your threshold. If the cost of a wrong decision is high (e.g., a complete redesign), use 99% confidence. If you're running many low-stakes tests (e.g., headline variations), 90% confidence is acceptable.
And if statistical significance feels like too much overhead for your pace of experimentation, consider using multi-armed bandits instead. They automatically optimize traffic allocation without requiring you to calculate sample sizes or wait for predetermined thresholds.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Free plan available.
Start Free