Learn

Probability of Backtest Overfitting (PBO)

PBO uses combinatorially-symmetric cross-validation to estimate the probability that your best in-sample strategy underperforms the median strategy out-of-sample. A PBO below 0.5 means the selection process beats chance. Introduced by Bailey, Borwein, López de Prado & Zhu (2014).

By Keel Research Team · Updated May 17, 2026

Backtest overfitting is the central failure mode of systematic research. Run enough parameter combinations and one will look great by luck alone. The standard defenses — out-of-sample holdouts, walk-forward — help but are themselves vulnerable to subtle abuse: re-run the OOS test enough times with slight variations and the OOS becomes IS by leakage. What is needed is a principled, population-level test that asks: given the full set of strategies I ran, is the one I picked as best actually better than the typical strategy out-of-sample?

That is exactly the question that the Probability of Backtest Overfitting (PBO) answers. Bailey, Borwein, López de Prado, and Zhu introduced it in their 2014 paper The Probability of Backtest Overfitting (later expanded as The Probability of Backtest Overfitting in Journal of Computational Finance, 2017). The method is combinatorial, model-free, and rank-based — it does not assume normal returns, it does not require a champion-strategy tearsheet, and it is robust to the time-series leakage that breaks naïve cross-validation.

Why backtest overfitting is hard to detect by eye

A researcher runs 200 momentum strategies on a Hyperliquid universe. The champion posts a Sharpe of 2.5 and a 60% in-sample win rate. The runner-up is at 2.3 and the bottom decile is around 0.4. Looks like a clear signal. The researcher reserves the last 20% of history as out-of-sample, applies the champion's parameters, and gets a Sharpe of 1.4 — still positive, still publishable, still about to lose money in production.

The problem is selection. The 2.5 was the maximum of 200 noisy estimates; the OOS 1.4 is the same strategy's regression toward the mean. The 2.5/1.4 degradation looks like the strategy "didn't generalize quite as well as hoped." In fact, the champion may not have been distinguishable from the median strategy in the OOS sample — it just happened to land at rank 1 in-sample by luck. PBO measures exactly this: across many random IS/OOS splits, how often does the IS winner stay above the median in OOS? If the answer is "about half the time," the selection process had no signal — your champion was random.

The PBO method — combinatorial split, rank-based comparison

The engine is Combinatorially-Symmetric Cross-Validation (CSCV). The algorithm:

  1. Take your N × T matrix of strategy returns. N strategies (columns), T time periods (rows).
  2. Split the T rows into S equal-length chunks. S = 16 is the standard.
  3. Enumerate every way to split S chunks into two equal halves. For S = 16, there are C(16, 8) = 12,870 such splits.
  4. For each split, designate one half "in-sample" and the other "out-of-sample." Compute each strategy's IS Sharpe and OOS Sharpe.
  5. Identify the strategy with the highest IS Sharpe — the IS champion for this split.
  6. Find that champion's rank in the OOS distribution. Compute its relative rank: ω = rank_OOS / (N + 1). A relative rank of 1.0 means the champion was best OOS too; 0.5 means it was at the median; near 0 means it was the worst OOS.
  7. Transform: logit(ω) = log(ω / (1 − ω)). Negative logit means the OOS rank was below median (overfit).
  8. Repeat for all 12,870 splits. PBO is the fraction of splits where logit(ω) < 0 — i.e. where the IS champion ended up below the OOS median.

Two properties make this work for time-series data. Symmetry: every observation appears in IS and OOS equally often across all splits, so the procedure is balanced. Rank-based: the test uses relative rank rather than absolute Sharpe, so it is invariant to changes in volatility regime between IS and OOS halves — a different test would conflate true overfitting with regime shifts.

Interpreting a PBO score

PBO is a probability between 0 and 1.

  • PBO ≈ 0.5 — random. Picking the IS best is no better than picking the IS median. The search process is fitting noise.
  • PBO < 0.5 — better than chance. The IS champion beats the OOS median more than half the time. Mild PBO (~0.3-0.5) is the typical outcome for honest research with disciplined feature engineering.
  • PBO < 0.1 — strong evidence of real signal. The IS ranking is highly informative for OOS ranking. Rare; typically requires either a very strong economic prior or a small parameter grid.
  • PBO > 0.5 — actively perverse. The IS champion underperforms the OOS median more than half the time. Common after extensive search on a sparse signal — the optimizer is reliably picking the worst future strategies. Walk away.

Beyond the headline number, the full logit(ω) distribution is informative. A symmetric distribution centered at zero says the search is noise. A distribution skewed positive says the selection process has signal even where PBO does not cross a clean threshold. The PBO calculator visualizes the distribution alongside the headline probability.

PBO vs DSR vs parameter sensitivity — when to use each

Three related diagnostics, three different jobs:

  • Parameter sensitivity — the cheapest. Vary each parameter ±1 step and check that Sharpe degrades smoothly. Catches narrow-spike overfits with a single afternoon of work. Pre-flight for PBO/DSR, not a replacement.
  • DSR (Deflated Sharpe Ratio) — best when you have the champion's tearsheet but not the full grid. Adjusts a single Sharpe for trial count and distribution shape. One number in, one probability out.
  • PBO — best when you have the full N × T matrix of strategy returns. Uses the full distribution, not just the champion. More information ⇒ tighter inference. The downside is that it requires you to have actually saved all the per-strategy return series.

For serious research, run all three. Parameter sensitivity first (cheap filter), then PBO if you have the data, then DSR on the champion as a final sanity check. They are complements, not substitutes.

Crypto-specific notes

Crypto research is unusually vulnerable to backtest overfitting for three reasons. Short histories: Hyperliquid has been live since 2023; many listed perps have only months of data. Small T means high variance in Sharpe estimates and a fatter null distribution under CSCV. Regime shifts: the 2024-2026 sample includes meme-coin euphoria, two major drawdowns, the Trump-era policy whiplash, and structural changes in market microstructure — strategies that worked in one regime routinely fail in the next. Survivorship: a "currently listed perp" universe systematically drops names that collapsed.

PBO partly addresses regime shift through its combinatorial symmetry — every chunk pairs with every other — but it cannot recover lost data or eliminate survivorship bias. The honest setup for crypto PBO is to use a point-in-time universe (the perps that were listed at the start of the IS window) and chunk the data finely enough that each IS/OOS pair covers multiple regimes.

A practical setting for Hyperliquid 15-minute strategies: chunk size of one calendar month, S = 16 chunks (≈ 16 months), N = 50-200 strategies. Below 16 months of history, PBO is brittle and parameter sensitivity is the more reliable diagnostic. Above two years, S = 24 with monthly chunks gives a denser combinatorial estimate.

Try the calculator

The PBO calculator accepts a CSV of strategy returns (rows = periods, columns = strategies) and runs the full CSCV procedure in the browser. Output: PBO probability, the full logit(ω) distribution, and the IS/OOS Sharpe scatter for the champion of each split.

Keel does not ship PBO as a built-in backtest diagnostic today. The platform's parameter-sensitivity tooling is the closest available defense; PBO may be added as a native diagnostic in future releases on multi-strategy grid runs; no committed ship date.

This article is educational. PBO measures rank stability between in-sample and out-of-sample halves; it cannot detect look-ahead bias, survivorship bias, or unrealistic cost assumptions. Reference: Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). The Probability of Backtest Overfitting. Journal of Computational Finance.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

PBO — questions

What is a 'good' PBO score?

PBO below 0.5 means your best in-sample strategy beats the median strategy more than half the time out-of-sample — better than chance, so the selection process has some signal. PBO above 0.5 means your IS champion is no better, and often worse, than the median OOS — the search is fitting noise. Bailey-Borwein-López de Prado argue 0.5 is the natural threshold: that is the random-selection baseline. Strong evidence of robust selection wants PBO below 0.3.

How many strategies do I need to compute PBO?

At minimum about 10 strategies, ideally 50 or more. PBO is a population statistic — it needs enough variation in the strategy returns to estimate the overfitting probability. If you only ran two backtest variants, PBO is uninformative. A typical parameter sweep (say 50-200 grid combinations) is enough. The combinatorial cross-validation splits the return matrix into 16 chunks (S=16 is the standard) and pairs each in-sample half with the corresponding out-of-sample half, producing C(16, 8) = 12,870 paired observations to estimate the probability distribution.

How is PBO different from a standard cross-validation?

Standard k-fold CV trains on k-1 folds and tests on 1. Combinatorially-symmetric CV (CSCV — the engine behind PBO) splits the data into S equal chunks and considers every way of partitioning S into two equal halves. Each half is in-sample for one trial and out-of-sample for its mirror. The symmetry matters because every observation appears equally in IS and OOS partitions across the full combinatorial set, so the procedure does not privilege any time period. This eliminates the path-dependence that plagues sequential train/test splits on time-series data.

How computationally expensive is PBO?

Cheap. For S = 16, the procedure evaluates C(16, 8) = 12,870 IS/OOS pairings — but each one only requires ranking the strategy returns over a half-sample, which is trivial for a few hundred strategies. The dominant cost is having the strategy returns in the first place (you needed to have run them anyway). The PBO computation itself is sub-second for typical research setups. The combinatorial blow-up only matters at very large S; for S = 32 you have ~600M pairings, which starts to be slow.

How does PBO relate to DSR?

PBO and the Deflated Sharpe Ratio are both Bailey-López de Prado-coauthored corrections for backtest selection bias, published the same year and aimed at the same problem from different angles. DSR adjusts a single Sharpe for the number of trials and the moments of the return distribution. PBO uses the full population of strategy returns to estimate the rank-stability of your best pick. DSR works when you only have the champion's tearsheet; PBO works when you have the whole grid. Reporting both is the rigorous default.