Experiment Velocity: How to Double the Number of Tests You Run
Volume Beats Perfection
The uncomfortable truth about A/B testing is that individual experiment results are often wrong. False positives, regression to the mean, novelty effects, and external confounders mean that any single experiment's result should be held with moderate confidence, not certainty.
What protects you from this is volume. Teams that run 50 experiments per quarter accumulate enough signal to build a reliable picture of what works. Teams that run 5 experiments per quarter are dangerously reliant on each individual result.
Moreover, the compounding math of experimentation is unforgiving of low velocity. A team running 50 experiments per quarter with a 30% win rate ships 15 improvements per quarter. A team running 10 experiments per quarter with a 50% win rate ships only 5. Volume wins, even with a lower win rate.
The Five Bottlenecks to Experiment Velocity
When teams run fewer experiments than they should, the constraint is almost always one of five things:
- Idea generation: Not having enough testable hypotheses in the backlog
- Prioritization: Not knowing which ideas to run first, so nothing moves forward
- Implementation: Each experiment requires significant engineering time to build
- Measurement: Waiting too long for statistical significance before moving on
- Culture: Fear of shipping a losing variant, so experiments aren't run
Identify which bottleneck is yours before trying to solve it. Fixing the wrong bottleneck produces no improvement.
Building a High-Output Hypothesis Backlog
The best experimentation teams maintain a living hypothesis backlog with 50–100 testable ideas at any given time. Generating that volume of ideas requires multiple input channels:
- Qualitative research: User interviews, support ticket analysis, and session recordings surface friction points that are invisible in quantitative data. One round of 10 user interviews typically generates 10–20 testable hypotheses.
- Funnel analysis: Where are users dropping off? Each significant drop point is an experiment opportunity. Track funnel conversion rates at each step weekly.
- Competitive analysis: What do high-converting competitors do differently? Not every difference is worth copying, but they're all worth hypothesizing about.
- Heatmaps and scroll maps: Do users scroll far enough to see your main CTA? Are they clicking things that aren't links? Are they ignoring things you think are important? Each of these observations generates experiment ideas.
- Customer success insights: Your CS team talks to churning customers every week. What do churning customers say they wished the product did? Each answer is a hypothesis.
Prioritization: ICE and PIE Frameworks
With a large backlog, prioritization determines which ideas actually get tested. Two frameworks work well for most teams:
ICE Score
Rate each hypothesis on three dimensions, 1–10 each:
- Impact: If this wins, how much does it move the needle?
- Confidence: How confident are you that the variant will outperform the control?
- Ease: How easy is this to implement?
ICE Score = (Impact + Confidence + Ease) / 3. Run the highest-scoring ideas first.
PIE Score
Similar to ICE but with different dimensions:
- Potential: How much improvement is possible here?
- Importance: How much traffic or revenue does this affect?
- Ease: How easy is this to implement?
Both frameworks share a key insight: ease matters. A 7-Impact, 10-Ease idea beats a 9-Impact, 2-Ease idea in most situations because the first one ships in a day while the second waits for a sprint.
Reducing Implementation Time
Engineering bottlenecks kill experiment velocity faster than any other factor. The solution is a combination of tooling, patterns, and organizational structure:
- Standardized experiment components: Build reusable UI components (CTA buttons, banners, modal dialogs, form fields) that can be A/B tested with a config change rather than a code change. Once the infrastructure is in place, many experiments take hours rather than days to implement.
- Server-side experiment infrastructure: Server-side experiments let you change any aspect of your product's behavior—not just the UI—without shipping new client-side code. This dramatically expands what's testable without new deployments.
- Non-engineering experiments: Many high-impact experiments don't require engineering at all. Email copy tests, ad creative tests, pricing page copy tests, and outreach sequence tests can all be run by marketers independently. Don't let engineering be the bottleneck for work that doesn't require engineering.
- Experiment templates: Document your most common experiment types (headline test, CTA test, pricing page test, onboarding step test) with standard implementation patterns. New experiments that fit a known pattern take 80% less time to implement.
Reaching Significance Faster
Low-traffic pages can bottleneck velocity because experiments run for weeks or months before reaching significance. Strategies for reaching significance faster:
- Concentrate on high-traffic pages first: A 2% improvement on a page that gets 100K visits/month reaches significance in days. The same test on a page with 1K visits/month takes months. Run experiments on your highest-traffic surfaces first.
- Multi-armed bandits: Bandit algorithms reach confident conclusions faster than traditional A/B tests because they adaptively allocate traffic to better-performing variants. On the same traffic volume, bandits typically reach confidence 30–50% faster.
- Fewer, larger experiments: Testing 5 variants simultaneously on one page reaches significance on all five in the same time it would take 5 sequential A/B tests. Multi-variant tests are more efficient when you have multiple competing ideas.
- Segment the right audience: If your hypothesis is specific to a user segment (mobile users, returning visitors, users on a specific plan), run the experiment only on that segment. The smaller, more homogeneous sample reaches significance faster.
Building an Experimentation Culture
The hardest bottleneck to fix is culture. In organizations that punish "failures," experiments don't get run—because any experiment that doesn't win is labeled a failure. This is exactly backwards.
A losing experiment is not a failure. It's evidence. It tells you what doesn't work, which is as valuable as knowing what does. Teams that run 50 experiments per quarter and win 30% of them have learned more than teams that run 5 experiments and win 80% of them—because the first team has evidence about 35 things that don't work, which eliminates those ideas from the roadmap forever.
To build a healthy experimentation culture:
- Celebrate learnings from losing experiments explicitly. "This didn't work, and here's what we learned" should be rewarded, not shameful.
- Define experiment success as reaching a confident conclusion (in either direction), not as "the variant won."
- Share experiment results widely—wins and losses. A shared knowledge base of what works and what doesn't is a competitive advantage.
- Set experiment velocity targets (e.g., "10 experiments shipped per quarter per team") alongside outcome targets.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started