← Back to Blog
February 22, 2026 12 min read

Inside Our Neural Contextual Bandit Engine

engineeringneural networkscontextual bandits

Beyond Simple Bandits

Traditional multi-armed bandits treat every visitor the same. They learn that variant B has a higher overall conversion rate and send most traffic there. But what if variant A converts better for mobile users from organic search, while variant B wins for desktop users from paid ads?

This is the problem contextual bandits solve. And at the heart of Experiment Flow's personalizer feature is a neural contextual bandit engine that learns which content works best for each visitor based on their context.

This post is a technical deep-dive into how it works.

Architecture Overview

The system has three core components:

  1. Embedding layer: Converts visitor context and action features into dense vector representations
  2. Neural network: Predicts the expected reward for each context-action pair
  3. Exploration strategy: Balances exploiting the best-known action with exploring alternatives

Embeddings: Making Context Computable

Visitor context arrives as a dictionary of features: {"device": "mobile", "source": "organic", "country": "US"}. The neural network can't work with raw strings, so we convert everything into a 384-dimensional dense vector using a text embedding model.

Each action (content variant) is also embedded the same way from its feature dictionary. Then we concatenate the context embedding with each action embedding to create a 768-dimensional input vector for each context-action pair.

// Embed context and actions
contextEmb = embed("device: mobile; source: organic; country: US")  // 384D
actionEmb  = embed("cta_text: Start Free Trial; color: blue")       // 384D
combined   = concat(contextEmb, actionEmb)                          // 768D

The embedding approach has a powerful property: generalization. Contexts the model hasn't seen before still get useful embeddings because semantically similar features land near each other in embedding space. A visitor from "Germany" on "mobile" benefits from what the model learned about visitors from "France" on "mobile" because those contexts are close in the 384-dimensional space.

The Neural Network

The prediction model is a 4-layer feedforward neural network:

Input (768D) → Dense(256, ReLU) → Dense(128, ReLU) → Dense(64, ReLU) → Dense(1, Sigmoid)
                                                                              ↓
                                                                   Reward prediction [0, 1]

The architecture is deliberately simple. With online learning (one sample at a time), complex architectures like transformers would overfit catastrophically. The MLP strikes the right balance: expressive enough to capture non-linear context-action interactions, simple enough for stable online updates.

Why Xavier Initialization Matters

Weights are initialized using Xavier initialization: random values drawn from a distribution scaled by the layer dimensions. This prevents the activations from exploding or vanishing in the early forward passes, giving the network a stable starting point before any training data arrives.

Why Sigmoid Output

The output layer uses sigmoid activation to bound predictions to [0, 1], matching the reward range. This prevents the network from predicting impossible values (like negative rewards or rewards greater than 1) and keeps the loss gradients well-behaved.

Online Learning: Training One Sample at a Time

Unlike batch machine learning where you train on thousands of examples at once, our contextual bandit learns online—updating the model after every single reward signal. This is challenging because:

  • No replay buffer: Each sample is used exactly once for training
  • Non-stationary distribution: The data distribution changes as the model's exploration policy changes
  • Catastrophic forgetting: New updates can overwrite what the model learned from previous contexts

We address these with three techniques: the Adam optimizer for adaptive learning rates, gradient clipping to prevent large updates from destabilizing the model, and weight decay to keep the weights small and generalizable.

Adam Optimizer

Adam maintains per-parameter moving averages of the gradient (momentum) and squared gradient (velocity). This gives each weight its own effective learning rate that adapts to the gradient history:

// Adam update for each parameter
momentum = β₁ × momentum + (1 - β₁) × gradient
velocity = β₂ × velocity + (1 - β₂) × gradient²
param   -= lr × momentum_corrected / (√velocity_corrected + ε)

For online learning, Adam is particularly valuable because it smooths out the noisy single-sample gradients through the momentum term, and it automatically scales down the learning rate for parameters that receive frequent large updates.

Learning Rate Scheduling

The learning rate follows a warmup + cosine decay schedule. During warmup (the first 50 updates), the learning rate increases linearly from near-zero to the configured value. After warmup, it follows a cosine curve that gradually reduces the rate. Early training makes large updates to quickly learn basic patterns; later training makes smaller, more precise updates to fine-tune.

Exploration: Thompson Sampling with Neural Uncertainty

Every contextual bandit faces the explore-exploit dilemma: should we show the content we believe is best (exploit), or try something different to learn more (explore)?

We use a neural variant of Thompson Sampling. The idea: add Gaussian noise to the network's predictions, scaled by an uncertainty estimate. Actions with uncertain predictions sometimes get boosted above the "best" action, causing the model to explore them and reduce its uncertainty.

// Thompson Sampling with decaying noise
noiseScale = predictionNoise / √(totalUpdates + 1)
for each action:
    prediction = network.forward(context, action)
    noisyPred = prediction + gaussian(0, noiseScale)
select action with highest noisyPred

The noise scale decreases as 1/√n, meaning the model explores aggressively early (when it knows little) and becomes increasingly confident as it accumulates training data. This is a principled approximation of Bayesian uncertainty without the computational cost of maintaining a full posterior distribution over network weights.

Ranking with Softmax Probabilities

When the API receives a rank request, it doesn't just return the top action. It returns all actions ranked with probabilities:

// Rank all actions and convert scores to probabilities
predictions = [network.forward(context, action) for action in actions]
sorted_actions = sort_by(predictions, descending)
probabilities = softmax(predictions)

Softmax converts the raw reward predictions into a probability distribution that sums to 1. Actions with similar predicted rewards get similar probabilities; actions with much higher predictions dominate the distribution. This gives API consumers the flexibility to either always pick the top action (exploit) or sample from the distribution (explore) based on their use case.

Performance in Practice

The entire rank operation—embedding, forward pass, sorting, and probability computation—completes in under 5 milliseconds for typical requests with 5-10 actions. This is achieved through:

  • Flat weight arrays: Row-major contiguous memory layout for cache-friendly matrix multiplication
  • Buffer pooling: Pre-allocated activation buffers reused across requests (zero allocation in the hot path)
  • Read-write locks: Multiple rank requests proceed concurrently; training takes an exclusive lock only briefly
  • Sparse activation skipping: ReLU zeros are detected and skipped during matrix multiplication

Training (the reward endpoint) takes slightly longer due to the backward pass, but still completes in under 10ms. Model weights are saved to the database every 100 updates, asynchronously, so the reward API response isn't blocked by persistence.

The Feedback Loop

The full personalization loop works like this:

  1. Visitor arrives → context features are captured
  2. Rank API embeds context, predicts rewards, returns ranked actions
  3. Your application shows the top-ranked content to the visitor
  4. If the visitor converts, you call the reward API with reward=1.0
  5. The model updates its weights based on this feedback
  6. The next visitor benefits from this updated model

Each iteration tightens the model's understanding of which content works for which visitors. Over hundreds and thousands of interactions, the personalization becomes increasingly precise.

The beauty of online learning is that the model never stops improving. Every visitor interaction, whether a conversion or not, teaches the model something new about what works.

Ready to optimize your site?

Start running experiments in minutes with Experiment Flow. Plans from $29/month.

Get Started