The Complete Guide to Running Successful Online Experiments
Why Most Experiments Fail
Studies from major experimentation platforms consistently report that 70–90% of A/B tests show no statistically significant result. That doesn't mean experimentation is broken—it means most teams are testing the wrong things, measuring the wrong outcomes, or running experiments incorrectly.
This guide covers the full lifecycle of a successful online experiment, from forming your initial hypothesis through analyzing results and iterating. Whether you're running your first test or your hundredth, these principles will help you extract more value from every experiment.
Step 1: Form a Strong Hypothesis
A good experiment starts with a good hypothesis, not just an idea. The difference matters:
- Weak: "Let's test a new homepage design."
- Strong: "Replacing the feature list with customer testimonials on the homepage will increase signup rate because visitors need social proof before committing."
A strong hypothesis has three components:
- The change: What exactly are you modifying?
- The expected outcome: What metric will move, and in which direction?
- The reasoning: Why do you believe this change will have this effect?
The reasoning is the most important part. It forces you to articulate a theory about user behavior that can be validated or invalidated. Without it, you're just randomly changing things and hoping for improvement.
Prioritizing Hypotheses
Most teams have more ideas than testing capacity. Use a simple prioritization framework:
- Impact: How many users does this affect? How much could the metric move?
- Confidence: How strong is the evidence that this change will work? (User research, heatmaps, competitor analysis, past experiments)
- Effort: How hard is this to implement and measure?
High-impact, high-confidence, low-effort experiments go first. This seems obvious, but many teams skip prioritization entirely and test whatever the most senior person suggests.
Step 2: Calculate Your Sample Size
Before launching, know how long the experiment will need to run. This prevents two common mistakes: stopping too early (unreliable results) and running too long (wasted time).
The key inputs are:
- Baseline conversion rate: Your current metric value
- Minimum detectable effect (MDE): The smallest improvement worth detecting
- Statistical power: Typically 80% (the probability of detecting a real effect)
- Significance level: Typically 5% (the false positive rate you'll tolerate)
For a more detailed treatment of the statistics, see our guide to statistical significance in A/B testing.
A common mistake: setting the MDE to 1% when your baseline is 3%. Detecting a 1 percentage point lift on a 3% base rate requires enormous sample sizes. Be realistic about what effect size matters to your business.
Step 3: Choose the Right Metrics
Every experiment needs a primary metric (the one you'll use to make your decision) and ideally several guardrail metrics (metrics that should not degrade).
Primary Metric
- Should be directly influenced by your change
- Should be measurable within the experiment's timeframe
- Should matter to the business
- Should have a reasonable base rate (see our guide on resolving experiments faster)
Guardrail Metrics
- Page load time (did your change slow things down?)
- Error rates (did you introduce bugs?)
- Upstream and downstream metrics (if signups increase but activation drops, you may be attracting lower-quality users)
Avoid Vanity Metrics
Metrics like "time on page" or "pages per session" are often ambiguous. More time on page could mean users are engaged—or it could mean they're confused and can't find what they need. Choose metrics with clear directional interpretation.
Step 4: Design the Experiment
Good experimental design controls for confounding variables and ensures clean measurement:
Randomization
Assignment must be truly random at the user level (not session level). A visitor who sees variant A on Monday should see variant A on Tuesday. Experiment Flow handles this automatically using consistent hashing on visitor IDs.
Isolation
Your experiment should be the only thing that changes for the user. If the marketing team launches a new campaign midway through your test, that external change adds noise to your results. Coordinate with other teams.
Duration
Run experiments for complete business cycles. For most websites, this means at least one full week to capture day-of-week effects. For B2B products with longer consideration cycles, you may need two to four weeks.
Traffic Allocation
A standard 50/50 split maximizes statistical power for a two-variant test. If you're risk-averse, start with a 90/10 split (10% to the new variant) and ramp up once you've confirmed no negative impact. For multiple variants, consider using bandit algorithms to allocate traffic dynamically.
Step 5: Avoid Common Pitfalls
Peeking (the #1 Mistake)
Checking results daily and stopping when you see significance is the most common error in A/B testing. Early results are noisy, and "significant" results often regress to the mean as more data accumulates. If you checked a fair coin after 20 flips, you'd see "significant" deviations from 50/50 surprisingly often.
Solutions:
- Set your sample size in advance and commit to it
- Use sequential testing methods that account for multiple looks
- Use Experiment Flow's auto-promote feature, which only promotes when the configured confidence threshold is genuinely reached
Multiple Comparisons
If you test 5 variants, you're making 10 pairwise comparisons. At a 5% significance level, you'd expect at least one false positive even with no real differences. Apply corrections like Bonferroni (divide your significance level by the number of comparisons) or use Bayesian methods that handle this more naturally.
Novelty and Primacy Effects
When you change something, returning users notice. A new button design might get more clicks simply because it's different—not because it's better. This "novelty effect" fades over time. Conversely, some users prefer what they're used to ("primacy effect").
To account for this:
- Segment results by new vs. returning visitors
- Let experiments run long enough for novelty to wear off (typically 2–3 weeks)
- Compare the trend over time, not just the aggregate
Selection Bias
If your experiment only triggers on a specific page, your results only apply to visitors who reach that page. Be careful about generalizing. A change that works for checkout-page visitors may not work when applied to all visitors.
Simpson's Paradox
A variant can appear to win overall while actually losing in every segment. This happens when traffic distribution shifts between segments during the test. Always check segment-level results before declaring a winner.
Step 6: Analyze Results
When your experiment reaches its target sample size, it's time to analyze:
- Check the primary metric first. Is the difference statistically significant? What's the confidence interval around the lift?
- Check guardrail metrics. Did anything degrade? A 5% lift in signups isn't worth it if error rates doubled.
- Look at segments. Does the effect hold across devices, traffic sources, and user types? Or is it driven entirely by one segment?
- Consider practical significance. A statistically significant 0.1% lift may not be worth the engineering cost to maintain the new variant.
- Document everything. Record the hypothesis, results, and learnings. This institutional knowledge compounds over time.
Step 7: Iterate
A single experiment is a data point. An experimentation program is a learning system. The real value comes from iteration:
- If the experiment won: Can you amplify the effect? What does this tell you about user behavior that you can apply elsewhere?
- If the experiment lost: Why? Was the hypothesis wrong, or was the implementation flawed? What's the next hypothesis?
- If the result was flat: The change probably doesn't matter. Ship whichever version is simpler and move on to higher-impact tests.
The most successful experimentation programs run continuously, maintain a backlog of prioritized hypotheses, and treat flat results as valuable information rather than failures.
Building an Experimentation Culture
Tools are necessary but not sufficient. The teams that get the most from experimentation share certain traits:
- Decisions are data-informed: No one ships a major change without testing it, regardless of seniority.
- Failure is expected: Most tests don't win. That's normal. The team celebrates learning, not just wins.
- Results are shared broadly: Experiment results are visible to the whole company, not locked in an analytics team's dashboard. Experiment Flow's public share links make this easy.
- Testing velocity is tracked: The number of experiments completed per month is a leading indicator of optimization maturity.
Getting Started
If you're early in your experimentation journey, start simple:
- Pick one high-traffic page and one clear metric.
- Form a hypothesis based on user research or analytics data.
- Run a single A/B test with two variants.
- Analyze the results honestly, including what you got wrong.
- Repeat, expanding scope as you build confidence.
Experiment Flow is designed to make this process as frictionless as possible. Create an experiment, integrate the lightweight SDK, and start collecting data in minutes. With built-in Thompson Sampling, contextual personalization, and automatic winner promotion, you can focus on the hypotheses rather than the infrastructure.
Start experimenting for free or explore the pricing page to learn more.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started