The Peeking Problem: Why Your A/B Test Results Might Be Wrong
The Experiment That Looked Successful Until It Wasn't
Here's a scenario that plays out constantly in growth teams. You launch an A/B test on your pricing page. After 4 days, you check the results. The variant is at 3.2% conversion vs. the control's 2.7%—an 18% lift with a p-value of 0.03. It looks significant. You call the experiment, ship the variant, and move on.
Three months later, pricing page conversion has returned to baseline. What happened?
You fell victim to the peeking problem—the most common and costly statistical mistake in A/B testing.
What the Peeking Problem Actually Is
A p-value of 0.05 (the standard significance threshold) means: if there were no real difference between control and variant, you'd see results this extreme by chance 5% of the time. That sounds like a small risk. But it only applies when you test the data once, at a pre-determined sample size.
When you check your experiment results repeatedly while the experiment is running—daily, multiple times per day—you're running multiple statistical tests. Each time you check, you have another 5% chance of seeing a false positive. If you check an experiment 10 times before reaching your planned sample size, your actual false positive rate is closer to 40%, not 5%.
This is the peeking problem. By repeatedly "peeking" at results and stopping early when results look good, you dramatically inflate your false positive rate. You convince yourself that changes are working when they're not.
Why It's So Hard to Resist
The peeking problem is hard to fix because it requires resisting a natural human impulse. You've launched an experiment. You care whether it's working. Checking the dashboard feels productive. And when you see p < 0.05, the brain releases a small dose of satisfaction that makes you want to act.
The problem is that early in an experiment, data is noisy. Small sample sizes mean that a 5-conversion difference between groups can appear statistically significant when it's actually just random variation. As more data accumulates, the noise averages out and the true effect (or lack of effect) becomes clear. But if you stop the experiment at the first significant result, you're measuring noise, not signal.
How to Calculate the Right Sample Size Upfront
The solution to peeking is to decide how long your experiment runs before you launch it—based on statistical power calculations, not impatience.
The inputs to a sample size calculation are:
- Baseline conversion rate: Your current conversion rate for the metric you're measuring
- Minimum detectable effect (MDE): The smallest improvement worth detecting. If your baseline is 3% and you only care about improvements of 0.5% or more, your MDE is 0.5 percentage points (or ~17% relative lift).
- Statistical power: Typically set at 80%—meaning you want an 80% chance of detecting a real effect of your MDE size
- Significance threshold: Typically 0.05 (95% confidence)
With these inputs, you can calculate the required sample size per variant. Many online calculators exist for this. The result tells you how many visitors each variant needs before you can trust the results.
Run your experiment until you hit this sample size. Don't stop early because the result looks significant. Don't extend it because the result isn't significant yet (extending has its own statistical problems). Decide the sample size upfront and stick to it.
What Happens When You Extend an Experiment
Extending an experiment after it hasn't reached significance is also problematic, for the opposite reason: it inflates false negatives (missing real effects). If your experiment ran to its planned sample size and found no significant effect, extending it to "give it more time" is a sign that you're fishing for a result you want, not measuring a real effect.
Legitimate reasons to extend an experiment: if you discover that your sample size calculation was wrong (perhaps your baseline conversion rate was different than assumed), or if there's a genuine operational reason the experiment was disrupted (a major traffic spike, a site outage).
Sequential Testing: A Better Way
If you genuinely need to peek at results early—for operational reasons, or because a variant appears to be causing harm—sequential testing provides a statistically valid framework for doing so.
Sequential tests (also called always-valid p-values or sequential probability ratio tests) maintain the desired false positive rate regardless of when you look. Unlike standard p-values, which assume you test at a single point in time, sequential p-values account for the fact that you're testing continuously.
The tradeoff: sequential tests typically require larger sample sizes to reach the same confidence level as fixed-horizon tests, because they pay a statistical price for the flexibility of early stopping.
The Bayesian Approach
Bayesian A/B testing is an alternative framework that sidesteps many of the peeking problem issues because it doesn't use p-values at all. Instead of asking "is the probability of this result low enough to reject the null hypothesis?" Bayesian methods ask "given the data we've seen, what's the probability that Variant B is better than Variant A?"
Bayesian results are directly interpretable: "There's an 87% probability that Variant B increases conversion." You can update this probability as data accumulates, and you can stop the experiment when you're sufficiently confident without the same false positive inflation that plagues frequentist peeking.
Multi-armed bandits, like Thompson Sampling used in Experiment Flow, are inherently Bayesian. They continuously update their beliefs about each variant's performance and allocate traffic accordingly—which means they're making statistically valid decisions throughout the experiment, not just at the end.
Practical Rules to Avoid Peeking Problems
- Calculate sample size before launch. Use a sample size calculator and commit to running the experiment until you hit it.
- Set an experiment end date. Calculate when you'll hit your sample size based on current traffic, and put it in your calendar. Don't check results before that date.
- Use automated significance monitoring. Let the platform notify you when an experiment has reached significance, rather than checking manually. Experiment Flow can auto-promote winners when both sample size and significance thresholds are met.
- Document your hypotheses and sample sizes before launch. Pre-registration (writing down your prediction before running the experiment) makes it much harder to rationalize early stopping as "we already knew this would work."
- Treat a non-significant result as information, not failure. If your experiment ran to plan and found no significant effect, that's a valid result. The change doesn't help. Move on to the next idea.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started