Backtest Rigor

PBO Calculator

Probability of Backtest Overfitting takes returns from N candidate strategies, splits the time axis into S disjoint subsets, and across every way to halve them into IS and OOS measures how often the in-sample winner fails out-of-sample. Bailey, Borwein, López de Prado, Zhu (2014). Combinatorial symmetric cross-validation, run entirely in your browser.

Combinatorial CV · in-browser · no upload
By Keel Research Team · Updated May 17, 2026
Inputs

Matrix with rows = time periods, columns = strategy variants. First row treated as header if any token is non-numeric.

250 periods × 20 strategies parsed.

Even. Default 16. Max 32.

Result

Adjust inputs and click Compute PBO. Runs entirely in your browser — no upload.

How it works

Methodology

A single Sharpe number from your "best" strategy variant is uninformative if you searched over 200 candidates to find it. PBO answers the falsifiable question: across many ways of splitting the time axis into halves, how often does the variant that wins on the first half lose on the second? If the answer is roughly 50%, your selection process is no better than a coin flip — the IS winner's edge was sample-path luck. The construction is non-parametric and does not assume normality of returns.

partition T periods into S disjoint subsets
for each split of S into IS_half | OOS_half:
  score_IS  = sharpe(strategy_n on IS_half)  for n in 1..N
  n*         = argmax(score_IS)
  r_OOS      = rank of strategy_n* on OOS_half
  rel        = r_OOS / (N + 1)
  lambda     = log(rel / (1 - rel))
PBO = fraction of splits with lambda < 0

Why combinatorial, not k-fold. Standard k-fold CV uses each fold as OOS exactly once. Symmetric CV uses every C(S, S/2) partition of the S subsets, so each strategy is scored under many more IS/OOS configurations. With S = 16 you get 12,870 IS/OOS pairs instead of 16. The resulting PBO distribution is much tighter than what k-fold gives and is the construction proposed in the original Bailey, Borwein, López de Prado, Zhu paper.

Keel context. Keel does not yet ship PBO as a built-in diagnostic alongside Sharpe and max drawdown — it is on the backtest-rigor roadmap, no committed ship date. Today, after running a parameter sweep in a workspace, export per-variant returns as a wide CSV and drop them here. Once native PBO ships in the metrics card you will get the same number directly from the optimize command. For the longer-form treatment of why combinatorial CV beats single-split OOS on short crypto samples, see the PBO methodology page.

Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Build, backtest, and run live strategies with realistic fees, slippage, and funding modeled. Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding modeled.
  • Optimize across parameter grids — Sharpe, drawdown, hit rate.
  • Deploy live to Hyperliquid with stop-loss + position limits.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Calculator questions

What does PBO measure?

Probability of Backtest Overfitting (PBO) measures the probability that the strategy variant with the best in-sample (IS) performance underperforms the median variant out-of-sample (OOS). High PBO means picking-the-best is little better than picking-at-random — the IS ranking does not carry to OOS, which is the operational definition of selection bias from running many trials. Defined in Bailey, Borwein, López de Prado, Zhu (2014).

How does the combinatorial split work?

The T time periods are partitioned into S equal disjoint subsets. For each of the C(S, S/2) ways to split S subsets into an IS half and an OOS half, the calculator ranks all N strategies on IS, picks the IS-best, then records that same strategy's rank on OOS. The OOS rank is converted to a relative rank in (0,1) and then a logit lambda = log(r/(1-r)). PBO is the fraction of splits where lambda < 0, i.e. the IS-best fell below the OOS median.

How many strategy variants and periods do I need?

Minimum 6 variants, ideally 10+ — PBO is only meaningful when there is meaningful selection from many candidates. Minimum 50 periods so each of S subsets has enough samples to compute a stable Sharpe. The default S = 16 yields C(16, 8) = 12,870 splits, which runs in well under a second in-browser. For S > 20 the calculator samples 10K splits at random rather than enumerating all C(S, S/2) — at S = 32 that would be 601M splits.

How do I interpret the PBO score?

Lower is better. PBO < 0.1: the IS-best variant is robust — selection from this set is informative. 0.1–0.3: healthy, normal range for a well-designed candidate set with modest variant count. 0.3–0.5: concerning, the IS ranking is barely informative about OOS performance. > 0.5: severely overfit — picking the IS-best is worse than picking at random. The threshold to walk away depends on how many variants you tested, but anything above 0.5 means the selection process is destroying value.

What does PBO not catch?

PBO catches selection bias from running N > 1 trials — picking the lucky-looking winner from a basket. It does not catch look-ahead bias (using future data in an indicator), survivorship bias (testing only assets that exist today), regime non-stationarity (the underlying process changed between IS and OOS in ways resampling cannot detect), or transaction-cost realism. Use PBO alongside Deflated Sharpe Ratio (single-strategy multiple-testing correction) and walk-forward out-of-sample tests, not instead of them.

Limits of PBO. The metric captures selection bias from running many parameter trials and picking the winner. It does not catch look-ahead bias (an indicator secretly using future data), survivorship bias (an asset universe curated post-hoc), regime non-stationarity (training in a high-vol regime and trading in a low-vol one), or unrealistic execution assumptions. Pair PBO with explicit walk-forward, fee/slippage modeling, and a real OOS holdout — not as a substitute for any of those. Reference: The Probability of Backtest Overfitting (Bailey, Borwein, López de Prado, Zhu, 2014).