How is this different from Google Optimize?

Google Optimize was sunset in 2023. Experiment Flow is a modern replacement with better multi-armed bandit support, faster setup, and more affordable pricing.

Do I need to install anything?

Just add one script tag to your site. No npm packages, no build steps, no dependencies.

How does statistical significance work?

We use two-proportion z-tests with 95% confidence intervals. Results only show as significant when there is less than 5% chance the difference is due to randomness.

Can I use this server-side?

Yes. The REST API works with any language. There are official SDKs for Node.js, Python, Go, and Ruby.

← Back to Blog

March 13, 2026 14 min read

Game Theory Meets the Scientific Method: How Strategic Thinking Accelerates A/B Testing

game theoryscientific methoda/b testingthompson samplingexperimentation

Science Isn't Just for Labs

The scientific method is humanity's most reliable engine for generating knowledge. You observe something, form a hypothesis, design a controlled experiment, collect data, and draw conclusions. Then you do it again. Each cycle tightens your understanding of how the world actually works.

A/B testing is the scientific method applied to products and businesses. But most teams stop there. They test a headline, measure the conversion rate, and call it a day. They're doing science, but they're not doing it strategically.

What if you could model why different groups of users behave differently? What if you could predict which variables matter most before you test them? What if you could reach statistical significance in half the time?

That's where game theory comes in.

Building Theories, Not Just Running Tests

The scientific method isn't just about individual experiments. It's about building theories—coherent models of how a system works that let you make predictions about experiments you haven't run yet.

In physics, Newton didn't just observe that apples fall. He built a theory of gravity that predicted planetary orbits, tidal patterns, and cannonball trajectories. The theory was more valuable than any single experiment because it compressed a vast space of possible observations into a small set of principles.

The same principle applies to your product. Every A/B test result is a data point, but a theory of your users is the thing that ties those data points together and tells you what to test next. A team with a good user theory can skip dozens of tests that a team running blind would need to run.

Here's what a user theory might look like:

"Enterprise buyers care most about security and compliance messaging. SMB buyers care most about speed-to-value. Free trial users are sensitive to friction, paid users are sensitive to capability gaps."

This theory is testable and falsifiable. It generates specific, prioritized hypotheses. And it gets stronger (or gets revised) with every experiment you run.

Game Theory: Modeling People as Strategic Agents

Game theory is the branch of mathematics that studies how rational agents make decisions when their outcomes depend on other agents' choices. It was formalized by John von Neumann and Oskar Morgenstern in the 1940s and later expanded by John Nash.

The key insight: people don't behave randomly. They respond to incentives. If you understand the incentive structure facing each group of users, you can predict their behavior—and design experiments that cut straight to the variables that matter.

Your Users Are Players in a Game

Every visitor to your site is a rational agent (approximately) making decisions based on their own objectives, constraints, and information:

The price-sensitive shopper is playing a cost-minimization game. They'll respond to pricing changes, free tiers, and discount messaging.
The enterprise evaluator is playing a risk-minimization game. They'll respond to security certifications, case studies, and SLA guarantees.
The developer is playing an effort-minimization game. They'll respond to API quality, documentation, and integration speed.
The executive sponsor is playing a status-maximization game. They'll respond to brand recognition, competitor adoption, and ROI narratives.

Each group has a different payoff function—the thing they're trying to maximize (or minimize). When you understand these payoff functions, you stop testing blindly and start testing strategically.

Incentive Mapping

Before designing your next experiment, map out the incentive structure:

User Segment	Primary Incentive	What They Respond To	What to Test
Price-sensitive	Minimize cost	Free tiers, discounts, ROI proof	Pricing page layout, trial length
Risk-averse enterprise	Minimize risk	Security, compliance, uptime SLA	Trust badges, case studies placement
Technical evaluator	Minimize effort	Docs, API quality, quick start	Onboarding flow, code examples
Executive buyer	Maximize status/ROI	Brand logos, metrics, analyst reports	Social proof, ROI calculator

This isn't just segmentation. It's strategic segmentation—segmentation informed by a model of what each group is trying to achieve.

Nash Equilibria and the "Why" Behind User Behavior

In game theory, a Nash equilibrium is a state where no player can improve their outcome by changing their strategy, given what everyone else is doing. Understanding this concept helps explain why some A/B tests produce surprising results.

Consider a pricing experiment. You test a lower price and conversions go up, but revenue goes down. A higher price and conversions drop, but revenue per customer increases. The optimal price isn't simply the one that maximizes either metric in isolation—it's the equilibrium point where the marginal customer you'd gain from a price decrease is worth less than the marginal revenue you'd lose from existing customers.

This kind of reasoning lets you predict the shape of results before you run the test, which means you can:

Set tighter bounds on your expected effect size
Choose more precise metrics that capture the real tradeoff
Design tests that need fewer samples because you're not exploring blindly

Controlling for Variables: The Heart of Good Science

The most common failure mode in A/B testing isn't a wrong conclusion from good data—it's a right conclusion about the wrong question. You tested headline A vs headline B, and B won. But what you actually changed was the headline and the page layout and the load time, because the new headline was shorter and collapsed the hero section.

Controlling for variables means isolating the thing you actually want to measure.

Strategies for Better Variable Control

1. One variable at a time (OVAT). The classic approach: change exactly one thing. Simple, clean, unambiguous. But slow—if you have 5 hypotheses, that's 5 sequential tests.

2. Factorial design. Test multiple variables simultaneously using a structured design matrix. A 2x2 factorial tests two variables (each with two levels) in just 4 variants, revealing not only the effect of each variable but also their interaction effects. If headline and CTA button interact (the best headline depends on the button text), factorial design catches this while OVAT misses it.

3. Stratified randomization. Ensure that known confounding variables (device type, traffic source, time of day) are evenly distributed across variants. If 60% of your Monday traffic is mobile but only 30% on Saturday, a test that starts on Monday and ends on Friday may confuse a device effect with a variant effect. Stratification prevents this.

4. Pre/post analysis with holdout groups. For changes that can't be randomly assigned (like site-wide performance improvements), use a holdout group that doesn't receive the change and compare trends before and after.

Game-Theoretic Variable Control

Here's where game theory adds something new: if you model user segments by their incentives, you can identify which variables matter for which segments and design more efficient experiments.

Example: you're testing a new pricing page. A game-theoretic analysis suggests that price-sensitive users respond to the price itself while enterprise users respond to how the price is framed (per-seat vs. flat rate, monthly vs. annual). Instead of running a single test, you run two targeted tests and reach significance faster because each test has a larger expected effect size for its target segment.

Thompson Sampling: The Bayesian Shortcut

Traditional A/B testing is frequentist: you predetermine a sample size, split traffic 50/50, wait until the test is done, then analyze. This is rigorous, but it has a cost—during the test, you're sending half your traffic to the losing variant.

Thompson Sampling takes a Bayesian approach. Instead of fixed traffic allocation, it maintains a probability distribution over each variant's true conversion rate and adapts traffic in real time.

How It Works

Start with a prior. Before any data, assume each variant's conversion rate follows a Beta(1,1) distribution (uniform prior—you have no idea which is better).
Sample from the posterior. For each incoming visitor, draw a random sample from each variant's current Beta distribution.
Assign the winner. Send the visitor to whichever variant had the highest sampled value.
Update the posterior. After observing the outcome (conversion or not), update the winning variant's Beta distribution: Beta(α + conversions, β + non-conversions).

The result: Thompson Sampling automatically explores early (when uncertainty is high) and exploits later (when one variant is clearly winning). It converges to the optimal variant while minimizing regret—the total cost of showing inferior variants to visitors.

Why Thompson Sampling Resolves Faster

Thompson Sampling doesn't just reduce regret—it often reaches actionable conclusions faster than fixed-horizon tests:

Adaptive allocation. As evidence accumulates, traffic shifts toward the winning variant. You don't waste half your traffic on a clear loser for the entire test duration.
No peeking problem. Unlike frequentist tests where early peeking inflates false positive rates, Thompson Sampling's Bayesian framework handles continuous monitoring naturally.
Informative priors. If your game-theoretic model predicts that a change should have a large effect on a specific segment, you can encode that as an informative prior. This gives the algorithm a head start and reduces the samples needed to converge.

// Thompson Sampling with informative prior
// Game theory predicts enterprise users will respond strongly to security messaging
//
// Instead of: Beta(1, 1)  -- uniform, no information
// Start with: Beta(3, 7)  -- prior belief: ~30% conversion for control
//             Beta(5, 5)  -- prior belief: ~50% for security-focused variant
//
// This encodes your strategic model into the algorithm,
// letting it converge faster if the model is right
// while still adapting if it's wrong.

Combining Game Theory + Thompson Sampling: A Worked Example

Let's walk through how these ideas work together on a real problem.

Scenario

You're a SaaS company with a 14-day free trial. Your signup page converts at 4%. You want to improve it.

Step 1: Map the Incentive Structure

You analyze your traffic and identify three main segments by their referral source and behavior:

Organic search visitors (45%): They found you by searching for a solution. High intent, but comparison-shopping. Incentive: find the best tool for the job.
Paid ad visitors (35%): They clicked an ad promising a specific benefit. Moderate intent, skeptical. Incentive: verify the ad's promise quickly.
Referral visitors (20%): Someone they trust recommended you. High intent, pre-sold on value. Incentive: minimize signup friction.

Step 2: Generate Hypotheses from Incentives

Instead of guessing what to test, your incentive model generates targeted hypotheses:

Organic: "Adding a comparison table against competitors will increase conversion for organic visitors by 15%, because they're in evaluation mode."
Paid: "Matching the landing page headline to the ad copy will increase conversion for paid visitors by 20%, because it confirms they're in the right place."
Referral: "Reducing the signup form to email-only will increase conversion for referral visitors by 25%, because they're already convinced and just need less friction."

Step 3: Design Controlled Experiments

You set up three experiments, one per segment, using multi-armed bandits in Thompson Sampling mode:

Each experiment targets a specific segment and tests a specific variable
The control for each is the current page, unchanged
Variables are isolated: each test changes exactly one thing
Thompson Sampling allocates traffic adaptively within each experiment

Step 4: Set Informative Priors

Because your game-theoretic model predicts large effects for the targeted segments, you set informative priors that let Thompson Sampling converge faster:

Control: Beta(4, 96) — prior belief of ~4% conversion (your known baseline)
Treatment: Beta(6, 94) — prior belief of ~6% conversion (your hypothesis predicts a lift)

If your model is right, the algorithm confirms it quickly. If it's wrong, the data overrides the prior within a few hundred observations.

Step 5: Analyze and Iterate

After running the tests, you find:

Organic: comparison table increased conversion by 18%. Hypothesis confirmed, theory strengthened.
Paid: headline matching increased conversion by 22%. Hypothesis confirmed, theory strengthened.
Referral: simplified form had no effect. Hypothesis rejected. The theory needs updating—maybe referral visitors want the full form because they need to enter team information.

The rejected hypothesis is just as valuable as the confirmed ones. It tells you that your model of referral users was wrong and needs refinement. This is the scientific method at work: theories get updated, not defended.

Deductions You Can Make Sooner Than You Think

When you combine strategic user modeling with adaptive algorithms, several things happen that let you reach conclusions faster:

1. Smaller Effect Sizes Become Detectable

By segmenting users based on incentives, you increase the expected effect size within each segment. A change that produces a 3% overall lift might produce a 15% lift for the specific segment it targets. Larger effect sizes need fewer observations to detect.

2. Cross-Experiment Learning

When you have a theory linking your experiments, results from one experiment inform your interpretation of others. If enterprise users respond to security messaging on the pricing page, they'll probably respond to it on the features page too. You can test this with a smaller sample because your prior is already informed.

3. Elimination by Mechanism

Game theory doesn't just predict what will happen—it predicts why. If your model says price-sensitive users respond to discounts because of cost-minimization, and you test a discount and it doesn't work, you can make a deduction: either the users aren't actually price-sensitive (segment model is wrong) or they don't believe the discount is real (trust is the actual barrier). Each rejected hypothesis narrows the possibility space, letting you converge on the truth faster.

4. Equilibrium Predictions

Some things don't need to be tested at all. If your game-theoretic model predicts that two strategies lead to the same equilibrium outcome, you can skip the test and focus your experimentation budget elsewhere. This is the scientific equivalent of a theoretical prediction that's confirmed by the math before the experiment is run.

Practical Playbook: Running Your First Game-Theoretic Experiment

Identify your user segments. Use analytics data to define 3-5 segments. Don't segment by demographics—segment by what they're trying to achieve (their incentive).
Map each segment's payoff function. What is each segment trying to maximize or minimize? Cost? Time? Risk? Status? Write it down.
Generate hypotheses from payoff functions. For each segment, ask: "If this segment is trying to maximize X, what change to my product would make X more obvious or easier to achieve?" Each answer is a testable hypothesis.
Prioritize by expected impact. Rank hypotheses by (segment size) x (expected effect size). Test the highest-impact hypothesis first.
Run with Thompson Sampling. Use a multi-armed bandit in Thompson Sampling mode to adaptively allocate traffic. Set informative priors based on your model's predictions.
Update your theory. After each experiment, update your model of user incentives. Confirmed hypotheses strengthen the model. Rejected ones tell you where the model is wrong.
Repeat. Each cycle makes your theory more accurate and your experiments more efficient. Over time, you'll reach conclusions that would take a purely empirical approach months longer to discover.

The Compounding Advantage

Teams that run A/B tests without a theory are doing empirical science—collecting facts one at a time. Teams that combine testing with game-theoretic reasoning are doing theoretical science—building predictive models that make each subsequent experiment more efficient.

The advantage compounds. After 10 experiments, the empirical team has 10 isolated results. The theory-driven team has 10 results plus a refined model that tells them what to test next, which segments to target, and what effect sizes to expect. They're running half as many tests and getting twice the insight.

This is exactly how science progresses. Darwin didn't just observe finches. He built a theory of evolution that predicted what he'd find on every subsequent island. Einstein didn't just measure light. He built a theory of relativity that predicted gravitational lensing decades before anyone observed it.

Your product is your laboratory. Your users are your subjects. Game theory gives you the theoretical framework. Thompson Sampling gives you the adaptive engine. And the scientific method ties it all together into a system that gets smarter with every experiment you run.

Get started with Experiment Flow — built-in Thompson Sampling, automatic winner promotion, and the tools you need to run theory-driven experiments. Learn more about Thompson Sampling or see how to resolve experiments faster.

Ready to optimize your site?

Start running experiments in minutes with Experiment Flow. Plans from $29/month.

Get Started