Probability of Backtest Overfitting takes returns from N candidate strategies, splits the time axis into S disjoint subsets, and across every way to halve them into IS and OOS measures how often the in-sample winner fails out-of-sample. Bailey, Borwein, López de Prado, Zhu (2014). Combinatorial symmetric cross-validation, run entirely in your browser.
Matrix with rows = time periods, columns = strategy variants. First row treated as header if any token is non-numeric.
250 periods × 20 strategies parsed.
Even. Default 16. Max 32.
Adjust inputs and click Compute PBO. Runs entirely in your browser — no upload.
A single Sharpe number from your "best" strategy variant is uninformative if you searched over 200 candidates to find it. PBO answers the falsifiable question: across many ways of splitting the time axis into halves, how often does the variant that wins on the first half lose on the second? If the answer is roughly 50%, your selection process is no better than a coin flip — the IS winner's edge was sample-path luck. The construction is non-parametric and does not assume normality of returns.
partition T periods into S disjoint subsets
for each split of S into IS_half | OOS_half:
score_IS = sharpe(strategy_n on IS_half) for n in 1..N
n* = argmax(score_IS)
r_OOS = rank of strategy_n* on OOS_half
rel = r_OOS / (N + 1)
lambda = log(rel / (1 - rel))
PBO = fraction of splits with lambda < 0Why combinatorial, not k-fold. Standard k-fold CV uses each fold as OOS exactly once. Symmetric CV uses every C(S, S/2) partition of the S subsets, so each strategy is scored under many more IS/OOS configurations. With S = 16 you get 12,870 IS/OOS pairs instead of 16. The resulting PBO distribution is much tighter than what k-fold gives and is the construction proposed in the original Bailey, Borwein, López de Prado, Zhu paper.
Keel context. Keel does not yet ship PBO as a built-in diagnostic alongside Sharpe and max drawdown — it is on the backtest-rigor roadmap, no committed ship date. Today, after running a parameter sweep in a workspace, export per-variant returns as a wide CSV and drop them here. Once native PBO ships in the metrics card you will get the same number directly from the optimize command. For the longer-form treatment of why combinatorial CV beats single-split OOS on short crypto samples, see the PBO methodology page.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Build, backtest, and run live strategies with realistic fees, slippage, and funding modeled. Free to start — connect a Hyperliquid wallet when you’re ready to go live.
Probability of Backtest Overfitting (PBO) measures the probability that the strategy variant with the best in-sample (IS) performance underperforms the median variant out-of-sample (OOS). High PBO means picking-the-best is little better than picking-at-random — the IS ranking does not carry to OOS, which is the operational definition of selection bias from running many trials. Defined in Bailey, Borwein, López de Prado, Zhu (2014).
The T time periods are partitioned into S equal disjoint subsets. For each of the C(S, S/2) ways to split S subsets into an IS half and an OOS half, the calculator ranks all N strategies on IS, picks the IS-best, then records that same strategy's rank on OOS. The OOS rank is converted to a relative rank in (0,1) and then a logit lambda = log(r/(1-r)). PBO is the fraction of splits where lambda < 0, i.e. the IS-best fell below the OOS median.
Minimum 6 variants, ideally 10+ — PBO is only meaningful when there is meaningful selection from many candidates. Minimum 50 periods so each of S subsets has enough samples to compute a stable Sharpe. The default S = 16 yields C(16, 8) = 12,870 splits, which runs in well under a second in-browser. For S > 20 the calculator samples 10K splits at random rather than enumerating all C(S, S/2) — at S = 32 that would be 601M splits.
Lower is better. PBO < 0.1: the IS-best variant is robust — selection from this set is informative. 0.1–0.3: healthy, normal range for a well-designed candidate set with modest variant count. 0.3–0.5: concerning, the IS ranking is barely informative about OOS performance. > 0.5: severely overfit — picking the IS-best is worse than picking at random. The threshold to walk away depends on how many variants you tested, but anything above 0.5 means the selection process is destroying value.
PBO catches selection bias from running N > 1 trials — picking the lucky-looking winner from a basket. It does not catch look-ahead bias (using future data in an indicator), survivorship bias (testing only assets that exist today), regime non-stationarity (the underlying process changed between IS and OOS in ways resampling cannot detect), or transaction-cost realism. Use PBO alongside Deflated Sharpe Ratio (single-strategy multiple-testing correction) and walk-forward out-of-sample tests, not instead of them.
Long-form methodology: why single-split OOS is misleading on short crypto samples, how combinatorial CV fixes it, and how to read PBO in production.
Sibling rigor tool. DSR corrects a single Sharpe for the N-trials multiple-testing problem. PBO and DSR answer different sides of the same selection-bias question — use both.
Composite scorer combining PBO, DSR, and parameter-stability into a single Overfit Score. Paste a tearsheet, get a single number with the breakdown.
Limits of PBO. The metric captures selection bias from running many parameter trials and picking the winner. It does not catch look-ahead bias (an indicator secretly using future data), survivorship bias (an asset universe curated post-hoc), regime non-stationarity (training in a high-vol regime and trading in a low-vol one), or unrealistic execution assumptions. Pair PBO with explicit walk-forward, fee/slippage modeling, and a real OOS holdout — not as a substitute for any of those. Reference: The Probability of Backtest Overfitting (Bailey, Borwein, López de Prado, Zhu, 2014).