Online Personalization and Optimization with Machine Learning: A Technical Deep Dive
Beyond A/B Testing: The Case for Machine Learning
A/B testing is the gold standard for comparing two specific variants. You show version A to half your visitors, version B to the other half, and after enough data, you pick the winner. It's simple, statistically rigorous, and well-understood.
But A/B testing has a fundamental limitation: it treats every visitor the same. A 25-year-old developer browsing on a phone at midnight has different preferences than a 55-year-old executive on a desktop at 9 AM. Traditional A/B testing picks one winner for everyone. Machine learning personalization picks the best variant for each visitor, in real time.
In this post, we'll walk through the ML techniques that make online personalization work—from the algorithms to the engineering—and show how Experiment Flow implements them in production.
The Personalization Stack
A production personalization system has four layers:
- Context collection — gathering visitor features (device, location, time, behavior history)
- Model inference — predicting which content variant will perform best for this visitor
- Action selection — choosing what to show, balancing exploitation of known-good options with exploration of uncertain ones
- Online learning — updating the model from each new interaction, without retraining from scratch
Each layer involves distinct ML techniques. Let's walk through them.
Contextual Bandits: The Core Algorithm
The contextual bandit framework is the workhorse of online personalization. Unlike a standard multi-armed bandit (which ignores context), a contextual bandit uses visitor features to make decisions.
The setup:
- A visitor arrives with a context vector x (e.g., device type, browser, time of day, referrer, past interactions)
- The system has K actions (content variants, product recommendations, CTA styles)
- A model predicts the expected reward for each action given this context: f(x, a) → r
- The system selects an action, the visitor interacts, and the model learns from the outcome
This is fundamentally different from collaborative filtering ("users like you also liked...") because it optimizes for a specific business outcome (conversion, revenue, engagement) rather than similarity.
Why Not Just Use a Classifier?
You could train a classifier to predict conversions and use it to rank variants. The problem is the exploration-exploitation tradeoff. A classifier will always recommend what it thinks is best—but "what it thinks" might be wrong, especially early on or for unusual visitor segments. Without exploration, the model gets stuck in local optima, never discovering that a different variant might work better.
Contextual bandits solve this by explicitly balancing exploration (trying uncertain options) with exploitation (using known-good options). The most common strategies:
- Thompson Sampling: Sample from the posterior distribution of each action's reward. Actions with higher uncertainty get explored more naturally.
- Upper Confidence Bound (UCB): Pick the action with the highest upper confidence bound on its reward estimate. Optimistic in the face of uncertainty.
- Epsilon-greedy: Usually pick the best action, but with probability ε pick a random one. Simple but effective.
Experiment Flow supports all three, but defaults to Thompson Sampling because it adapts exploration rate automatically—exploring more when uncertain, exploiting more as confidence grows.
Neural Networks for Reward Prediction
The key ML component is the model that predicts rewards. Linear models work for simple cases, but real visitor behavior is nonlinear. A developer using Chrome on Linux might respond very differently to a technical CTA than to a casual one, but only during work hours. Capturing these interaction effects requires a function approximator with enough capacity.
We use a shallow neural network (2 hidden layers) as the reward predictor:
Input: context vector x (visitor features + action encoding)
|
Dense(64, ReLU) -- first hidden layer
|
Dense(32, ReLU) -- second hidden layer
|
Dense(1, Sigmoid) -- predicted reward in [0, 1]
Feature Engineering
The context vector is critical. Raw features go through several transformations:
- Categorical encoding: Device type, browser, OS, and country are encoded as learned embeddings rather than one-hot vectors. This lets the model discover that "iPhone Safari" and "iPad Safari" are similar contexts.
- Temporal features: Time of day and day of week are encoded as sine/cosine pairs to capture cyclical patterns (11 PM is close to midnight, not far from it).
- Behavioral features: Page views, session duration, and past conversion history are normalized and included. These capture intent signals that static demographics miss.
- Action encoding: Each variant is represented as an embedding vector, so the model can generalize across similar actions.
Why Shallow Networks?
Deep networks (10+ layers) are overkill for this problem. Our context vectors are typically 20-50 dimensional, and we're predicting a single scalar reward. Two hidden layers with 64 and 32 units provide enough capacity to model nonlinear interactions without overfitting on the relatively small per-visitor data. Shallow networks also have a practical advantage: they're fast enough for real-time inference at <1ms latency.
Online Learning: Training Without Batches
The defining feature of our system is that models update continuously from each interaction, rather than being retrained periodically on batch data.
Why Online Learning?
- Freshness: A batch-trained model is always stale. If visitor behavior shifts (new marketing campaign, seasonal change, trending content), a batch model won't adapt until the next retraining cycle. Online learning adapts in real time.
- Cold start: New personalizers start with no data. Online learning produces useful personalization after tens of interactions, not thousands.
- Simplicity: No training pipelines, no scheduled retraining jobs, no model deployment. The production model is always the latest model.
The Challenge: Single-Sample Gradients
In batch training, you compute gradients over hundreds of examples and average them. The average is a stable, reliable direction for weight updates. In online learning, each gradient comes from a single observation—a single visitor clicking or not clicking. This signal is inherently noisy.
We use three techniques to make single-sample learning stable (covered in detail in our previous post):
- Gradient clipping (clip norm = 1.0) prevents catastrophic updates from outlier observations
- Weight decay (coefficient = 0.0001) prevents overfitting to recent examples and provides implicit forgetting
- Learning rate scheduling (warmup + cosine decay) matches update magnitude to model maturity
Thompson Sampling with Neural Networks
Combining Thompson Sampling with neural reward prediction requires estimating uncertainty. With a linear model, you get uncertainty for free from the Bayesian posterior. With a neural network, you need approximations.
Our approach uses dropout-based uncertainty estimation. During inference, we run the network multiple times with different random dropout masks and measure the variance of predictions:
// Pseudo-code for Thompson Sampling with neural uncertainty
predictions = []
for i in range(num_samples):
pred = model.forward_with_dropout(context, action)
predictions.append(pred)
mean_reward = mean(predictions)
uncertainty = std(predictions)
// Thompson sample: draw from approximate posterior
sampled_reward = normal(mean_reward, uncertainty)
High variance across dropout samples indicates the model is uncertain about this context-action pair, naturally driving more exploration. As the model sees more data for similar contexts, uncertainty decreases and exploitation takes over.
Scaling: Engineering for Real-Time Personalization
A personalization system that's accurate but slow is useless—the ranking decision has to happen before the page renders. Here's how we keep inference under 5ms at the 99th percentile.
Model Architecture Choices
- Small networks: 64→32→1 architecture. Matrix multiplications are tiny and cache-friendly.
- No attention layers: Attention mechanisms are powerful but add O(n²) complexity. For 20-50 dimensional inputs, simple MLPs are faster and equally effective.
- Pre-computed embeddings: Categorical embeddings are looked up, not computed. The embedding tables are small enough to fit in L1 cache.
Batched Ranking
When ranking K actions for a visitor, we don't run K separate forward passes. Instead, we batch all actions into a single matrix multiplication:
// Instead of K separate forward passes:
// for each action: reward = model(context, action)
// Batch all actions into one pass:
context_matrix = repeat(context, K) // [K x context_dim]
action_matrix = stack(action_embeddings) // [K x action_dim]
input_matrix = concat(context_matrix, action_matrix) // [K x total_dim]
rewards = model.forward(input_matrix) // [K x 1] in one pass
This reduces K forward passes to one, leveraging SIMD/vectorized operations for a ~5x speedup when ranking 10+ actions.
Async Learning Updates
Model updates happen asynchronously after the response is sent. The ranking decision is served immediately; the learning step (forward pass, loss computation, backpropagation, weight update) happens in the background. This ensures that learning never adds latency to the user-facing request.
Measuring Personalization Effectiveness
How do you know personalization is actually working? You can't just look at overall conversion rate, because it conflates the model's performance with traffic quality, seasonality, and other factors.
Counterfactual Evaluation
The gold standard is counterfactual evaluation: estimate what would have happened if you'd used a different policy. We use inverse propensity scoring (IPS):
// IPS estimator for policy value
V(new_policy) = (1/N) * sum(
reward_i * new_policy(action_i | context_i) / old_policy(action_i | context_i)
)
This uses logged data from the current policy to estimate the value of any alternative policy, without actually deploying it. It's statistically unbiased (though high-variance), letting you evaluate model improvements offline before shipping them.
Personalization Lift
We also measure personalization lift: the improvement from personalized ranking versus random uniform ranking. This isolates the model's contribution from baseline content quality. A healthy personalizer shows increasing lift as it sees more data, plateauing as it converges on the optimal policy.
Real-World Results
Across customers using Experiment Flow's personalization features, we consistently see:
- 10-30% conversion lift over the best static A/B test winner, because the "best" variant differs by visitor segment
- Cold start in ~50 interactions: The model produces measurably better-than-random rankings within the first 50 visitor interactions
- Continuous improvement: Models keep learning and adapting as visitor behavior evolves, without manual intervention
- <5ms inference latency at the 99th percentile, fast enough for any client-side or server-side integration
Getting Started
If you're currently running A/B tests, you already have the foundation for personalization. The transition path:
- Start with A/B testing to validate that your variants are meaningfully different
- Enable bandit mode on experiments where you want automatic traffic optimization
- Add context features via the Personalizer API for full contextual personalization
- Monitor lift through the dashboard to quantify the value personalization adds
Each step builds on the previous one, and you can mix approaches—running traditional A/B tests for some experiments and ML-powered personalization for others.
// Example: Create a personalizer and rank content
const response = await fetch('/api/personalizers', {
method: 'POST',
headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
body: JSON.stringify({
name: 'homepage-hero',
actions: ['minimal', 'detailed', 'video', 'testimonial']
})
});
// Rank actions for a visitor
const ranking = await fetch('/api/personalize/PERSONALIZER_ID/rank', {
method: 'POST',
headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
body: JSON.stringify({
context: {
device: 'mobile',
hour: 14,
referrer: 'google',
returning: true
}
})
});
// Returns: { action: 'testimonial', action_id: '...', rankings: [...] }
// Record outcome
await fetch('/api/personalize/PERSONALIZER_ID/reward', {
method: 'POST',
headers: { 'Authorization': 'Bearer YOUR_API_KEY' },
body: JSON.stringify({
action_id: ranking.action_id,
reward: 1.0 // conversion happened
})
});
The model learns from every interaction and starts personalizing immediately. No training data required upfront, no model deployment, no infrastructure to manage.
What's Next
We're actively working on several extensions to our ML personalization stack:
- Experience replay: Storing past interactions and replaying them during quiet periods to improve sample efficiency
- Multi-objective optimization: Optimizing for multiple rewards simultaneously (e.g., clicks and purchases) with configurable tradeoffs
- Transfer learning: Using knowledge from one personalizer to warm-start another, reducing cold-start time for new experiments
- Causal inference: Going beyond correlation to estimate the true causal effect of personalization on business outcomes
Machine learning personalization isn't just for tech giants with massive data science teams. With the right abstractions, any team can move beyond static A/B testing to dynamic, per-visitor optimization. Try it free or explore the API docs to get started.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started