What an Open-Source Trading Lab Teaches Us About Online Experimentation
What an Open-Source Trading Lab Teaches Us About Online Experimentation
The hardest problems in algorithmic trading and the hardest problems in conversion rate optimisation rhyme more than people expect. Both are sequential decision problems under partial information. Both punish stale beliefs ruthlessly. Both reward systems that update in production rather than systems that ship a new model once a quarter. And both, when done well, lean on a small set of techniques that have crossed over from one domain to the other for years: multi-armed bandits, gradient-boosted trees, and continual learning.
The reference point for this post is github.com/lee101/stock-prediction — an open-source monorepo of forecasting and reinforcement-learning experiments that runs against Alpaca and Binance. It is messy in the way real research code is messy: dozens of experiment scripts, a handful of competing model families, a marketsim that is the source of truth, and a deploy pipeline that promotes only models that beat a target on unseen data. The same patterns that make that pipeline work in markets are what make Experiment Flow work on websites.
The Trading Stack in One Page
Stripped to its essentials, the stock-prediction repo does four things on every cycle:
- Forecast next-period returns and volatility for each tradable symbol. The forecaster is a Chronos-style time-series transformer fine-tuned with LoRA adapters and warmed up by a fast tabular model (gradient-boosted trees fill the same role XGBoost does in most production pipelines — a fast, calibrated baseline that a heavier model has to beat to earn its compute).
- Score policies over those forecasts. A PufferLib C-CUDA market simulator runs binary-fill backtests at
decision_lag≥2with realistic fees, slippage, and overnight gaps. Soft-fill simulators are kept around for gradient flow; binary fills are the ground truth. - Allocate capital across competing strategies with an exploration policy — in practice an ε-greedy or Thompson-sampling style mixer that keeps probing under-exploited strategies even when one looks dominant.
- Promote only models whose median monthly PnL on unseen evaluation windows clears a fixed bar (the repo’s rule is 27% per month worst-cell, and a fail-fast drawdown check kills the candidate in seconds if it bleeds early).
That last step is the most under-appreciated piece of any production ML system. It is also exactly what Experiment Flow does for product changes: nothing earns 100% of traffic until it has cleared a statistical bar against a control on unseen visitors.
Multi-Armed Bandits: The Same Algorithm, Two Different Reward Streams
A multi-armed bandit problem is any setting where you must repeatedly choose between options whose payoffs you do not fully know, and where every pull both earns you a reward and teaches you something about the arm you pulled. Trading and experimentation are both bandits in the strictest textbook sense.
In a trading loop
The arms are strategies (or symbols, or position sizes). Each “pull” is a trade. The reward is realised PnL. Thompson sampling shines here because it naturally handles non-stationarity: when an arm starts losing, its posterior tightens around a worse mean and it gets pulled less, but it never gets dropped to zero, so the system can rediscover an old regime when it returns. The stock-prediction repo’s autoresearch loop is a discrete version of this: candidate hyperparameter cells get budgeted exploration, and cells that beat the target on unseen data get more capital.
In an experimentation loop
The arms are variants (button copy, layout, pricing, recommendation rankings). Each “pull” is a visitor assignment. The reward is conversion. Experiment Flow runs the same Thompson-sampling logic in bandit.go — Beta(α, β) posteriors per arm, sample, pick the arm with the highest sample, update α or β from the outcome. The mathematics is identical. What changes is the noise model: trading rewards are heavy-tailed and serially correlated, conversions are Bernoulli but suffer from confounders like time of day and traffic source.
The transferable insight runs both ways. A trading desk learns from CRO that guardrail metrics matter: an arm that prints PnL while quietly increasing tail risk is the analogue of a variant that lifts signups while collapsing day-7 retention. A web team learns from trading that regime shifts are the rule, not the exception: traffic mixes change, and a winner in February can be a loser in April. Bandits handle this gracefully; fixed A/B splits do not.
XGBoost and Gradient Boosting: The Quiet Workhorse
XGBoost (and its cousins LightGBM and CatBoost) is the model people quietly reach for when they need calibrated probabilities from tabular features fast, and when they cannot afford the latency or the brittleness of a deep network. In the stock-prediction stack, gradient-boosted trees show up in two places:
- Feature filters. Before a heavier transformer forecast runs, a boosted-tree gate decides whether the symbol is even worth predicting on. The gate is trained on cheap engineered features — rolling vol, recent return autocorrelations, microstructure ratios — and rejects most of the universe most of the time. This is the same idea as a cascade detector in vision: cheap-and-fast first, expensive-and-slow only when needed.
- Calibration. A neural forecaster is excellent at ranking opportunities but often poorly calibrated in absolute terms. A small boosted-tree post-processor takes the neural model’s output as a feature, blends it with a handful of contextual features, and emits a probability that you can actually feed into a Kelly fraction or risk-budget without it blowing up.
In experimentation, the same pattern applies. Experiment Flow uses gradient-boosted models internally to score lift estimates against contextual features — device class, referrer, hour-of-day, prior session count — so that personalised experiences can be served without overfitting to whichever segment happened to convert most loudly last week. The boosted tree is not the headliner; it is the calibrator that lets the bandit and the neural ranker be honest about what they actually know.
Continual Learning: Why the Production Model Is Always Yesterday’s Model
Continual learning — sometimes called online learning, lifelong learning, or just “updating the model in production” — is the single biggest gap between research codebases and production systems. A research model is trained once on a static dataset and reported as a number. A production model is updated continuously as new data arrives, and the only number that matters is its rolling out-of-sample performance.
The stock-prediction repo treats this as a first-class concern. The Chronos2 forecaster is fine-tuned with LoRA adapters on rolling windows so that recent regime data actually shifts the predictions. The RL trading policies are retrained against a marketsim that itself rolls forward. The deploy script (scripts/deploy_live_trader.sh) is paranoid about exactly which version of the model is live and refuses to swap it in unless the new version has cleared the unseen-data bar. The reason for all that paranoia is simple: in markets, a model that is six months out of date is not a slightly worse model — it is a different model, fitted to a regime that no longer exists.
The same is true for websites. Visitor mix shifts. Competitors move. Seasonal effects rotate. A static A/B test winner from a Black Friday cohort will lose in February. Experiment Flow’s neural contextual bandit engine updates its embeddings and rankers continuously from event streams, with the same disciplines borrowed from production trading: gradient clipping to keep updates from blowing up under noisy reward signals, weight decay to keep the model from overfitting to recent unrepresentative traffic, and learning-rate schedules that decay during stable regimes and warm back up after a detected shift.
What the Trading Repo Borrowed From Web Experimentation
The traffic flows the other direction too. A few patterns the stock-prediction codebase imported from the world of online experimentation:
- Holdouts and decision lag. Backtests at
decision_lag≥2mirror the discipline of holding out a clean cohort in an A/B test — the model never gets to see the outcome of the trade it is currently scoring. - Guardrails and worst-cell evaluation. Promoting on median monthly PnL across a grid of slippage / fee / lag cells is the trading version of guardrail metrics in CRO. A model that wins on average but loses badly in one cell is rejected, just as a variant that wins overall but tanks support-ticket volume is rejected.
- One-writer-at-a-time guarantees. The repo’s singleton lock around the live Alpaca writer is the same pattern as ensuring exactly one variant is bound to a given visitor for the duration of an experiment — consistency under concurrency is non-negotiable when money or trust is on the line.
The Pattern, Stated Cleanly
Strip the domains away and the production loop is the same in both places:
# pseudocode shared by both worlds
while True:
context = observe() # market state | visitor context
arms = candidate_actions(context) # strategies | variants
scores = forecaster(context, arms) # XGBoost + transformer ensemble
arm = bandit_pick(arms, scores) # Thompson sampling / UCB
reward = act(arm) # trade fill | conversion event
update_online(arm, reward) # continual learning step
if rolling_eval(arm) < bar:
retire(arm) # death-spiral guard | guardrail trip
That loop is what the stock-prediction repo runs against markets, and what Experiment Flow runs against your website. The differences are in the noise model, the latency budget, and the regulatory surface — not in the algorithm.
Where to Read More
- github.com/lee101/stock-prediction — the open-source repo this post is drawn from. The bandit-flavoured allocator, Chronos-style forecaster, and PufferLib marketsim are all there, along with the deploy and singleton-lock infrastructure.
- Multi-Armed Bandits vs A/B Testing: When to Use Each — the same algorithm in conversion-rate clothing.
- Inside Our Neural Contextual Bandit Engine — how Experiment Flow runs continual learning for personalisation.
- Personalization at Scale: Using Contextual Bandits for Dynamic Content — contextual bandits when the context vector is non-trivial.
If you are running a trading research loop, the production CRO playbook is worth borrowing from. If you are running an experimentation programme, the disciplines that keep a trading desk honest will keep your A/B testing honest too. The algorithms are the same; the rigour is the lesson.
Ready to optimize your site?
Start running experiments in minutes with Experiment Flow. Plans from $29/month.
Get Started