A backtest curve that looks too clean usually is. Two failure modes hide behind the equity line: parameter mining (you tuned the strategy until the past fit) and selection bias (you ran many strategies and reported the best one). PBO and DSR diagnose each. A short checklist catches most of it.
Overfitting is the failure mode that destroys more strategies than any single market event. A backtest with Sharpe 3.5 looks like an opportunity until the live PnL arrives — at which point you discover the historical curve was fitted to a sample, not to a real edge, and the live result regresses to zero or worse. The defense is not more cleverness; it is statistical hygiene applied before going live.
The two failure modes are distinct but compound. Parameter mining is what happens inside one strategy when you tune until the past looks great. Selection bias is what happens across many strategies when you only deploy the one that won the in-sample contest. Both leave the same fingerprint — an in-sample number that does not generalize — but they require different defenses.
Crypto compounds the general overfitting problem in three ways:
The result: equity curves that look stunningly clean over the in-sample window and fall apart immediately in live trading. The most expensive form of overfitting is the kind that survives a single train/test split — because that single split gives you false confidence to deploy.
Failure mode 1 — parameter mining. You build one strategy and tune its parameters (lookback, threshold, stop) on historical data. The optimization finds a combination that produces a great in-sample Sharpe. The combination fits noise, not signal. Tell-tale signs: the optimal parameters sit on a narrow spike rather than a wide plateau in the parameter-sensitivity surface; small parameter changes drastically alter performance.
Defense: walk-forward optimization (validate parameters across multiple rolling out-of-sample windows), parameter-sensitivity heatmaps (require robust plateaus, not spikes), and a smaller-cardinality parameter space (fewer knobs = less room to mine).
Failure mode 2 — selection bias across strategies. You build 500 candidate strategies (or run a grid search that generates them implicitly). You report the best in-sample performer. Even if no individual strategy was overfit, the act of selecting the best of 500 produces a winner with an upward-biased Sharpe estimate — you cherry-picked from a noisy distribution.
Defense: report all candidates honestly, then apply PBO across the candidate set to estimate the probability the winner is noise, or apply DSR to the winning Sharpe with N = number of trials to get a noise-adjusted p-value. WFO does not fix this — WFO addresses parameter robustness within a single strategy, not selection across strategies.
Three complementary diagnostics. None alone is sufficient; together they cover most of the failure surface.
Probability of Backtest Overfitting (PBO). Bailey, Borwein, López de Prado & Zhu (2014). Input: a matrix of N candidate strategies’ per-bar returns over the same window. The algorithm splits the return matrix combinatorially into IS / OOS pairs, identifies the in-sample winner in each split, and counts the fraction of splits where the IS winner underperforms the OOS median. That fraction is PBO. PBO ≥ 0.5 means the in-sample winner is no better than a coin flip out-of-sample; under 0.1 is the conventional pass threshold. See the PBO explainer for the full mechanism.
Deflated Sharpe Ratio (DSR). Bailey & López de Prado (2014). Input: the winning Sharpe, the number of trials N, the skewness and kurtosis of the return series, and the sample size. Output: a noise-adjusted Sharpe with a p-value testing whether it is statistically distinguishable from zero given the search effort. DSR p < 0.05 is the conventional bar. See the DSR explainer.
Parameter sensitivity. Plot Sharpe (or any target metric) as a function of each parameter, holding others at the optimum. Real edges produce wide, flat-ish plateaus — the strategy works for a range of parameter values. Overfit strategies produce narrow spikes — Sharpe collapses one step away from the optimum. The plateau-vs-spike test catches most parameter mining without any statistical machinery.
Before deploying any backtest to live capital, every one of these should pass:
An equity curve with Sharpe 5+ on a single-asset crypto backtest is almost certainly overfit. Real cross-asset Sharpe at the strategy level (after fees and slippage on real HL pairs) tops out around 2.5–3.5 even for well-known carry trades on long histories; single-asset directional strategies rarely sustain above 2.0 net. If your number is meaningfully outside that range, the prior is that something is wrong.
Concrete things to check, in order:
Survivors of this checklist are still candidates, not winners. The honest test is forward live capital at a small size for a meaningful window. Backtests are decision-making tools, not certificates of edge.
The Overfit Tearsheet Checker takes a paste of returns or metrics and produces a composite Overfit Score (0–100) built from PBO, DSR, and parameter-stability components. The score is a quick triage rather than a rigorous certification; the individual diagnostics (PBO and DSR calculators, parameter-sensitivity plots) are where the real interpretation happens.
The Keel backtest engine surfaces parameter-sensitivity views and walk-forward aggregation. Strategy-level PBO and DSR run in the lab-app widgets — Keel for the backtest, lab-app for the rigor diagnostics — is the rigorous workflow today.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.
Free to start — connect a Hyperliquid wallet when you’re ready to go live.
Three reasons. (1) Short, regime-heavy history — most HL perps have months to a couple of years of data, dominated by one or two regime shifts (the 2024 bull, 2025 retracement). A strategy that fits one regime perfectly has effectively been trained on n≈1 underlying environment. (2) Wide-open parameter spaces — crypto's strategy literature is permissive, and a grid search over momentum lookbacks × thresholds × stops can easily span 10,000+ combinations. (3) Survivorship and selection bias — practitioners only share backtest curves that look great, so the published distribution of strategies is already pre-selected for upside outliers. All three amplify the basic statistical problem.
Sharpe above 4 on a single-asset backtest with fewer than 200 trades. An equity curve with no losing months (real edges have losing months). Optimal parameters that change drastically between slightly different time windows. Sensitivity to small parameter changes — Sharpe of 3.2 at lookback 23 but Sharpe of 0.8 at lookback 21. No out-of-sample test, or one that conveniently begins after a major regime shift. Performance that depends heavily on one or two outlier trades. Any of these on its own warrants skepticism; two or more is reason to throw the strategy out.
WFO mitigates the parameter-mining failure mode but does not address selection bias from running many strategies. If you grid-search 500 strategies and pick the best WFO performer, that selection itself overfits — the winning WFO score is biased upward. WFO is a necessary but insufficient defense. The complete stack is: WFO for parameter robustness, PBO for selection-bias correction across strategies, DSR for selection-bias correction on Sharpe specifically, and out-of-sample holdout for final validation.
Both, on different stages. PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) takes a matrix of N candidate strategies' returns and estimates the probability that the in-sample winner underperforms the median out-of-sample. Use it after a grid search to quantify how badly selection bias contaminated your pick. DSR (Deflated Sharpe Ratio, Bailey-López de Prado 2014) takes a single Sharpe estimate plus the count of trials searched, and produces a p-value asking whether the Sharpe is statistically distinguishable from zero given the search effort. PBO answers 'did I overfit?'; DSR answers 'is this Sharpe real?'.
Keel ships single-window parameter optimization and signal-level subsample diagnostics — early/late period IC, regime-conditional IC, parameter sensitivity — which mitigate the parameter-mining failure mode at the signal level. Native walk-forward, PBO, and DSR are on the roadmap. In the meantime the lab-app calculators cover that rigor stack: `/lab/pbo-calculator`, `/lab/deflated-sharpe-calculator`, `/lab/overfit-check`, and `/lab/walk-forward-visualizer` — feed in any backtest output and run PBO, DSR, bootstrap CIs, or fold-by-fold WFO in the browser.
Take a candidate strategy to a full backtest on ~220 HL perps with real fees and 1-hour funding — then run the rigor stack on the output.
Composite 0–100 Overfit Score built from PBO + DSR + parameter stability. Triage tool for backtest output.
The Bailey-Borwein diagnostic for selection bias across N candidate strategies. Algorithm, interpretation, common pitfalls.
The HL-specific walk-forward implementation — window sizing on volatile assets, anchored vs rolling, interpreting aggregate OOS.