PBO uses combinatorially-symmetric cross-validation to estimate the probability that your best in-sample strategy underperforms the median strategy out-of-sample. A PBO below 0.5 means the selection process beats chance. Introduced by Bailey, Borwein, López de Prado & Zhu (2014).
Backtest overfitting is the central failure mode of systematic research. Run enough parameter combinations and one will look great by luck alone. The standard defenses — out-of-sample holdouts, walk-forward — help but are themselves vulnerable to subtle abuse: re-run the OOS test enough times with slight variations and the OOS becomes IS by leakage. What is needed is a principled, population-level test that asks: given the full set of strategies I ran, is the one I picked as best actually better than the typical strategy out-of-sample?
That is exactly the question that the Probability of Backtest Overfitting (PBO) answers. Bailey, Borwein, López de Prado, and Zhu introduced it in their 2014 paper The Probability of Backtest Overfitting (later expanded as The Probability of Backtest Overfitting in Journal of Computational Finance, 2017). The method is combinatorial, model-free, and rank-based — it does not assume normal returns, it does not require a champion-strategy tearsheet, and it is robust to the time-series leakage that breaks naïve cross-validation.
A researcher runs 200 momentum strategies on a Hyperliquid universe. The champion posts a Sharpe of 2.5 and a 60% in-sample win rate. The runner-up is at 2.3 and the bottom decile is around 0.4. Looks like a clear signal. The researcher reserves the last 20% of history as out-of-sample, applies the champion's parameters, and gets a Sharpe of 1.4 — still positive, still publishable, still about to lose money in production.
The problem is selection. The 2.5 was the maximum of 200 noisy estimates; the OOS 1.4 is the same strategy's regression toward the mean. The 2.5/1.4 degradation looks like the strategy "didn't generalize quite as well as hoped." In fact, the champion may not have been distinguishable from the median strategy in the OOS sample — it just happened to land at rank 1 in-sample by luck. PBO measures exactly this: across many random IS/OOS splits, how often does the IS winner stay above the median in OOS? If the answer is "about half the time," the selection process had no signal — your champion was random.
The engine is Combinatorially-Symmetric Cross-Validation (CSCV). The algorithm:
Two properties make this work for time-series data. Symmetry: every observation appears in IS and OOS equally often across all splits, so the procedure is balanced. Rank-based: the test uses relative rank rather than absolute Sharpe, so it is invariant to changes in volatility regime between IS and OOS halves — a different test would conflate true overfitting with regime shifts.
PBO is a probability between 0 and 1.
Beyond the headline number, the full logit(ω) distribution is informative. A symmetric distribution centered at zero says the search is noise. A distribution skewed positive says the selection process has signal even where PBO does not cross a clean threshold. The PBO calculator visualizes the distribution alongside the headline probability.
Three related diagnostics, three different jobs:
For serious research, run all three. Parameter sensitivity first (cheap filter), then PBO if you have the data, then DSR on the champion as a final sanity check. They are complements, not substitutes.
Crypto research is unusually vulnerable to backtest overfitting for three reasons. Short histories: Hyperliquid has been live since 2023; many listed perps have only months of data. Small T means high variance in Sharpe estimates and a fatter null distribution under CSCV. Regime shifts: the 2024-2026 sample includes meme-coin euphoria, two major drawdowns, the Trump-era policy whiplash, and structural changes in market microstructure — strategies that worked in one regime routinely fail in the next. Survivorship: a "currently listed perp" universe systematically drops names that collapsed.
PBO partly addresses regime shift through its combinatorial symmetry — every chunk pairs with every other — but it cannot recover lost data or eliminate survivorship bias. The honest setup for crypto PBO is to use a point-in-time universe (the perps that were listed at the start of the IS window) and chunk the data finely enough that each IS/OOS pair covers multiple regimes.
A practical setting for Hyperliquid 15-minute strategies: chunk size of one calendar month, S = 16 chunks (≈ 16 months), N = 50-200 strategies. Below 16 months of history, PBO is brittle and parameter sensitivity is the more reliable diagnostic. Above two years, S = 24 with monthly chunks gives a denser combinatorial estimate.
The PBO calculator accepts a CSV of strategy returns (rows = periods, columns = strategies) and runs the full CSCV procedure in the browser. Output: PBO probability, the full logit(ω) distribution, and the IS/OOS Sharpe scatter for the champion of each split.
Keel does not ship PBO as a built-in backtest diagnostic today. The platform's parameter-sensitivity tooling is the closest available defense; PBO may be added as a native diagnostic in future releases on multi-strategy grid runs; no committed ship date.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.
Free to start — connect a Hyperliquid wallet when you’re ready to go live.
PBO below 0.5 means your best in-sample strategy beats the median strategy more than half the time out-of-sample — better than chance, so the selection process has some signal. PBO above 0.5 means your IS champion is no better, and often worse, than the median OOS — the search is fitting noise. Bailey-Borwein-López de Prado argue 0.5 is the natural threshold: that is the random-selection baseline. Strong evidence of robust selection wants PBO below 0.3.
At minimum about 10 strategies, ideally 50 or more. PBO is a population statistic — it needs enough variation in the strategy returns to estimate the overfitting probability. If you only ran two backtest variants, PBO is uninformative. A typical parameter sweep (say 50-200 grid combinations) is enough. The combinatorial cross-validation splits the return matrix into 16 chunks (S=16 is the standard) and pairs each in-sample half with the corresponding out-of-sample half, producing C(16, 8) = 12,870 paired observations to estimate the probability distribution.
Standard k-fold CV trains on k-1 folds and tests on 1. Combinatorially-symmetric CV (CSCV — the engine behind PBO) splits the data into S equal chunks and considers every way of partitioning S into two equal halves. Each half is in-sample for one trial and out-of-sample for its mirror. The symmetry matters because every observation appears equally in IS and OOS partitions across the full combinatorial set, so the procedure does not privilege any time period. This eliminates the path-dependence that plagues sequential train/test splits on time-series data.
Cheap. For S = 16, the procedure evaluates C(16, 8) = 12,870 IS/OOS pairings — but each one only requires ranking the strategy returns over a half-sample, which is trivial for a few hundred strategies. The dominant cost is having the strategy returns in the first place (you needed to have run them anyway). The PBO computation itself is sub-second for typical research setups. The combinatorial blow-up only matters at very large S; for S = 32 you have ~600M pairings, which starts to be slow.
PBO and the Deflated Sharpe Ratio are both Bailey-López de Prado-coauthored corrections for backtest selection bias, published the same year and aimed at the same problem from different angles. DSR adjusts a single Sharpe for the number of trials and the moments of the return distribution. PBO uses the full population of strategy returns to estimate the rank-stability of your best pick. DSR works when you only have the champion's tearsheet; PBO works when you have the whole grid. Reporting both is the rigorous default.
Run the strategy grid on ~220 HL perps with real fees and 1-hour funding, then export the N × T returns matrix into the PBO calculator.
Combinatorially-symmetric cross-validation in the browser. Upload your strategy-return matrix, get a PBO score.
The companion Bailey-López de Prado metric for selection bias when you only have the champion tearsheet.
The HL-specific walk-forward implementation — the complementary defense to PBO for parameter-space overfitting.