The Scientific Method of A/B Testing: From Hypothesis to Discovery
A/B Testing Is Applied Science
When you run an A/B test, you're doing exactly what a scientist does in a lab: forming a hypothesis, designing a controlled experiment, collecting data, and drawing conclusions. The only difference is your lab is a website and your subjects are visitors.
This isn't an analogy. The statistical methods used in A/B testing—hypothesis testing, p-values, confidence intervals—were invented by scientists like Ronald Fisher and Jerzy Neyman for agricultural and medical experiments in the early 20th century. The web just gave us a way to run experiments at massive scale with near-zero marginal cost.
Step 1: Observation
Every experiment begins with an observation. In science, it might be "plants grow taller near the window." In CRO, it might be "our pricing page has a 70% bounce rate" or "mobile users convert at half the rate of desktop users."
Good observations come from data, not hunches. Before hypothesizing, examine:
- Analytics: Where do visitors drop off? Which pages underperform?
- Heatmaps and session recordings: What are visitors actually doing?
- User feedback: What do customers complain about or ask for?
- Competitor analysis: What patterns do successful competitors use?
The key discipline: separate observation from interpretation. "Our signup form has 8 fields and a 12% completion rate" is an observation. "Our form is too long" is already an interpretation.
Step 2: Hypothesis
A scientific hypothesis is a testable, falsifiable prediction. In A/B testing, a well-formed hypothesis follows this template:
If we [make specific change], then [measurable outcome] will happen, because [reasoning based on observation].
Examples of strong hypotheses:
- "If we reduce the signup form from 8 fields to 3, completion rate will increase by at least 20%, because analytics show 60% of users abandon after the 4th field."
- "If we add customer logos above the fold, trial signups will increase by 10%, because our post-purchase surveys show trust is the #1 barrier to conversion."
- "If we change the CTA from 'Submit' to 'Start Free Trial', click-through rate will increase by 15%, because 'Submit' implies effort while 'Start Free Trial' emphasizes the benefit."
What Makes a Hypothesis Testable?
A hypothesis must be falsifiable—there must be a possible outcome that would prove it wrong. "Making the site better will improve conversions" isn't testable because "better" isn't defined. "Changing the headline to emphasize speed will increase signups by 5% or more" is testable because we can measure the exact outcome and compare it to the prediction.
Step 3: Experiment Design
This is where A/B testing mirrors laboratory science most closely. A well-designed experiment requires:
Control and Treatment
The control (variant A) is your current experience. The treatment (variant B) is the change you're testing. By comparing them under identical conditions, you can isolate the effect of your change.
The critical rule: change only one variable. If you simultaneously change the headline, button color, and form length, you won't know which change caused the result. This is the principle of controlled experimentation—the same principle that makes double-blind clinical trials reliable.
Randomization
Visitors must be randomly assigned to control or treatment. This ensures the groups are statistically equivalent on all dimensions: device, time of day, traffic source, intent. Randomization is what allows you to attribute any difference in outcomes to your change rather than to pre-existing differences between the groups.
Sample Size
Before starting, calculate the sample size needed to detect your minimum detectable effect (MDE) at your desired confidence level. This is the experiment's power analysis. Running a test without a predetermined sample size is like running a clinical trial without knowing how many patients you need—you'll either stop too early (false positive) or run too long (wasted resources).
// Sample size formula for two-proportion z-test
// n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²
//
// For 95% confidence, 80% power, 5% baseline, 10% relative lift:
// n ≈ 31,000 per variant
Step 4: Data Collection
Once the experiment is running, the key discipline is don't peek. In science, this is called avoiding interim analysis bias. If you check results daily and stop the test the first time p < 0.05, your actual false positive rate can be 3-5x higher than you think.
Why? Because statistical fluctuations mean that at some point during almost any test, the results will cross the significance threshold by chance. If you stop at that moment, you're selecting for noise, not signal.
There are three valid approaches:
- Fixed-horizon: Set your sample size in advance and only look at results when it's reached
- Sequential testing: Use methods like the Sequential Probability Ratio Test (SPRT) that are mathematically designed for continuous monitoring
- Multi-armed bandits: Let the algorithm handle exploration-exploitation tradeoffs, removing the need for a fixed endpoint
Step 5: Analysis
When your experiment reaches its predetermined sample size, analyze the results rigorously:
Statistical Significance
Calculate the p-value. If p < 0.05 (or whatever threshold you predetermined), the result is statistically significant. This means there's less than a 5% probability that the observed difference is due to chance alone.
Effect Size
Significance tells you the effect is real; effect size tells you if it matters. A 0.1% improvement might be statistically significant with enough traffic, but is it worth the engineering effort to implement? Always report the confidence interval alongside the point estimate.
Segment Analysis
Check whether the effect is consistent across segments (mobile vs desktop, new vs returning, different traffic sources). Be cautious here—running many sub-group analyses increases the chance of spurious findings. Pre-register which segments you plan to analyze.
Step 6: Conclusion and Replication
In science, a single experiment is never conclusive. The same is true in A/B testing. Before rolling out a big winner:
- Consider the effect size: Is it large enough to be practically meaningful?
- Check for novelty effects: New designs sometimes win initially because they're different, not because they're better. Run the test long enough to account for this.
- Document everything: Record the hypothesis, methodology, results, and conclusions. Build an institutional knowledge base of what works and why.
- Generate new hypotheses: Every experiment—win or lose—teaches you something about your visitors. Use that knowledge to design the next experiment.
The Experimenter's Mindset
The deepest lesson from the scientific method isn't about statistics or sample sizes. It's about intellectual honesty. Good experimenters:
- Seek to disprove their own hypotheses, not confirm them
- Accept that most tests won't produce winners (industry average is about 1 in 7)
- Value surprising results more than expected ones
- Never cherry-pick metrics after the fact to make a test "win"
- Share negative results as openly as positive ones
The goal of A/B testing isn't to prove you were right. It's to find out what's true.
This mindset, more than any tool or technique, is what separates teams that continuously improve from teams that just run tests.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started