Learn

Overfitting in crypto backtests

A backtest curve that looks too clean usually is. Two failure modes hide behind the equity line: parameter mining (you tuned the strategy until the past fit) and selection bias (you ran many strategies and reported the best one). PBO and DSR diagnose each. A short checklist catches most of it.

By Keel Research Team · Updated May 17, 2026

Overfitting is the failure mode that destroys more strategies than any single market event. A backtest with Sharpe 3.5 looks like an opportunity until the live PnL arrives — at which point you discover the historical curve was fitted to a sample, not to a real edge, and the live result regresses to zero or worse. The defense is not more cleverness; it is statistical hygiene applied before going live.

The two failure modes are distinct but compound. Parameter mining is what happens inside one strategy when you tune until the past looks great. Selection bias is what happens across many strategies when you only deploy the one that won the in-sample contest. Both leave the same fingerprint — an in-sample number that does not generalize — but they require different defenses.

What overfitting looks like in crypto specifically

Crypto compounds the general overfitting problem in three ways:

  • Short, regime-heavy history. Most HL perps have 12–24 months of data, dominated by one or two clear regimes. A momentum strategy fitted to the 2024 bull will look spectacular and tell you almost nothing about whether momentum will work in the next regime. Effective sample size is much smaller than the number of bars suggests.
  • Wide-open parameter spaces. Lookback × threshold × stop × take-profit × sizing grid searches casually run 10,000+ combinations. With that many trials, you will find combinations with great in-sample Sharpe by chance even on random data — Bailey-Borwein showed PBO can exceed 50% on as few as 100 trials in some setups.
  • Selection across strategies. Public crypto strategy curves are pre-filtered for upside outliers — nobody publishes the 100 strategies that failed. Building on top of those means you are starting from an already-overfit distribution.

The result: equity curves that look stunningly clean over the in-sample window and fall apart immediately in live trading. The most expensive form of overfitting is the kind that survives a single train/test split — because that single split gives you false confidence to deploy.

The two failure modes

Failure mode 1 — parameter mining. You build one strategy and tune its parameters (lookback, threshold, stop) on historical data. The optimization finds a combination that produces a great in-sample Sharpe. The combination fits noise, not signal. Tell-tale signs: the optimal parameters sit on a narrow spike rather than a wide plateau in the parameter-sensitivity surface; small parameter changes drastically alter performance.

Defense: walk-forward optimization (validate parameters across multiple rolling out-of-sample windows), parameter-sensitivity heatmaps (require robust plateaus, not spikes), and a smaller-cardinality parameter space (fewer knobs = less room to mine).

Failure mode 2 — selection bias across strategies. You build 500 candidate strategies (or run a grid search that generates them implicitly). You report the best in-sample performer. Even if no individual strategy was overfit, the act of selecting the best of 500 produces a winner with an upward-biased Sharpe estimate — you cherry-picked from a noisy distribution.

Defense: report all candidates honestly, then apply PBO across the candidate set to estimate the probability the winner is noise, or apply DSR to the winning Sharpe with N = number of trials to get a noise-adjusted p-value. WFO does not fix this — WFO addresses parameter robustness within a single strategy, not selection across strategies.

Detection — PBO, DSR, parameter sensitivity

Three complementary diagnostics. None alone is sufficient; together they cover most of the failure surface.

Probability of Backtest Overfitting (PBO). Bailey, Borwein, López de Prado & Zhu (2014). Input: a matrix of N candidate strategies’ per-bar returns over the same window. The algorithm splits the return matrix combinatorially into IS / OOS pairs, identifies the in-sample winner in each split, and counts the fraction of splits where the IS winner underperforms the OOS median. That fraction is PBO. PBO ≥ 0.5 means the in-sample winner is no better than a coin flip out-of-sample; under 0.1 is the conventional pass threshold. See the PBO explainer for the full mechanism.

Deflated Sharpe Ratio (DSR). Bailey & López de Prado (2014). Input: the winning Sharpe, the number of trials N, the skewness and kurtosis of the return series, and the sample size. Output: a noise-adjusted Sharpe with a p-value testing whether it is statistically distinguishable from zero given the search effort. DSR p < 0.05 is the conventional bar. See the DSR explainer.

Parameter sensitivity. Plot Sharpe (or any target metric) as a function of each parameter, holding others at the optimum. Real edges produce wide, flat-ish plateaus — the strategy works for a range of parameter values. Overfit strategies produce narrow spikes — Sharpe collapses one step away from the optimum. The plateau-vs-spike test catches most parameter mining without any statistical machinery.

Practical checklist

Before deploying any backtest to live capital, every one of these should pass:

  1. Sample size. At least 200 trades, ideally 500+. Sharpe estimators converge slowly; below 100 trades the CI on Sharpe is wide enough that any reported number is mostly noise.
  2. Walk-forward (approximated today). Aggregate OOS Sharpe across rolling windows is within 30% of the IS Sharpe. Larger gap = parameter mining. Native WFO is roadmap; today approximate by running a series of single-window optimizations rolling the dates manually and inspecting the result in the walk-forward visualizer.
  3. Parameter sensitivity. Adjacent parameter values produce comparable performance — the optimum sits on a plateau, not a spike.
  4. PBO < 0.1. Run after any grid search or strategy-zoo selection. Tracks selection bias that WFO misses.
  5. DSR p < 0.05. The deflated Sharpe given the number of trials searched is still statistically positive.
  6. Final out-of-sample holdout. A piece of data the strategy has never touched in any optimization or selection step. Performance here is your honest estimate.
  7. No single-trade dependence. Remove the top and bottom 1% of trades; Sharpe should not collapse. If it does, the edge rests on outliers and may not repeat.
  8. Realistic costs. Fees, slippage, funding included. Strategies that look great gross often disappear net of HL taker fees + slippage on the actual book.

When your equity curve looks too good

An equity curve with Sharpe 5+ on a single-asset crypto backtest is almost certainly overfit. Real cross-asset Sharpe at the strategy level (after fees and slippage on real HL pairs) tops out around 2.5–3.5 even for well-known carry trades on long histories; single-asset directional strategies rarely sustain above 2.0 net. If your number is meaningfully outside that range, the prior is that something is wrong.

Concrete things to check, in order:

  1. Is there look-ahead bias? Make sure every feature at bar t uses only data available at the close of bar t-1.
  2. Are fees and slippage modeled? An apparent Sharpe 4 with zero costs may be Sharpe 1 net.
  3. Are funding payments included for perp strategies? They’re a real cost or income; ignoring them silently inflates the curve.
  4. How many trades? Below 100, the CI on Sharpe is wide enough that the reported number is mostly noise.
  5. What was the search space? A great in-sample number from a 10,000-trial grid search is the expected outcome even on random data — quantify with DSR.

Survivors of this checklist are still candidates, not winners. The honest test is forward live capital at a small size for a meaningful window. Backtests are decision-making tools, not certificates of edge.

Try the diagnostic

The Overfit Tearsheet Checker takes a paste of returns or metrics and produces a composite Overfit Score (0–100) built from PBO, DSR, and parameter-stability components. The score is a quick triage rather than a rigorous certification; the individual diagnostics (PBO and DSR calculators, parameter-sensitivity plots) are where the real interpretation happens.

The Keel backtest engine surfaces parameter-sensitivity views and walk-forward aggregation. Strategy-level PBO and DSR run in the lab-app widgets — Keel for the backtest, lab-app for the rigor diagnostics — is the rigorous workflow today.

Further reading

  • Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism. SSRN 2308659. The PBO original.
  • Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio. SSRN 2460551.
  • Pardo, R. (2008). The Evaluation and Optimization of Trading Strategies (2nd ed.). The canonical reference on walk-forward methodology.
  • López de Prado, M. (2018). Advances in Financial Machine Learning. Ch. 11–12 cover backtesting under selection bias in detail.
This article is educational. Surviving every diagnostic listed here does not guarantee live profitability — only that you have done the work to rule out the most common statistical failures. Live deployment with small capital remains the final verification step. Keel does not ship native PBO or DSR diagnostics today; the linked widgets are standalone tools.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Overfitting — questions

Why is crypto especially susceptible to overfitting?

Three reasons. (1) Short, regime-heavy history — most HL perps have months to a couple of years of data, dominated by one or two regime shifts (the 2024 bull, 2025 retracement). A strategy that fits one regime perfectly has effectively been trained on n≈1 underlying environment. (2) Wide-open parameter spaces — crypto's strategy literature is permissive, and a grid search over momentum lookbacks × thresholds × stops can easily span 10,000+ combinations. (3) Survivorship and selection bias — practitioners only share backtest curves that look great, so the published distribution of strategies is already pre-selected for upside outliers. All three amplify the basic statistical problem.

What are the common red flags in a backtest?

Sharpe above 4 on a single-asset backtest with fewer than 200 trades. An equity curve with no losing months (real edges have losing months). Optimal parameters that change drastically between slightly different time windows. Sensitivity to small parameter changes — Sharpe of 3.2 at lookback 23 but Sharpe of 0.8 at lookback 21. No out-of-sample test, or one that conveniently begins after a major regime shift. Performance that depends heavily on one or two outlier trades. Any of these on its own warrants skepticism; two or more is reason to throw the strategy out.

Can walk-forward optimization fix overfitting?

WFO mitigates the parameter-mining failure mode but does not address selection bias from running many strategies. If you grid-search 500 strategies and pick the best WFO performer, that selection itself overfits — the winning WFO score is biased upward. WFO is a necessary but insufficient defense. The complete stack is: WFO for parameter robustness, PBO for selection-bias correction across strategies, DSR for selection-bias correction on Sharpe specifically, and out-of-sample holdout for final validation.

PBO vs DSR — which should I use?

Both, on different stages. PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) takes a matrix of N candidate strategies' returns and estimates the probability that the in-sample winner underperforms the median out-of-sample. Use it after a grid search to quantify how badly selection bias contaminated your pick. DSR (Deflated Sharpe Ratio, Bailey-López de Prado 2014) takes a single Sharpe estimate plus the count of trials searched, and produces a p-value asking whether the Sharpe is statistically distinguishable from zero given the search effort. PBO answers 'did I overfit?'; DSR answers 'is this Sharpe real?'.

What does Keel do today to help with this?

Keel ships single-window parameter optimization and signal-level subsample diagnostics — early/late period IC, regime-conditional IC, parameter sensitivity — which mitigate the parameter-mining failure mode at the signal level. Native walk-forward, PBO, and DSR are on the roadmap. In the meantime the lab-app calculators cover that rigor stack: `/lab/pbo-calculator`, `/lab/deflated-sharpe-calculator`, `/lab/overfit-check`, and `/lab/walk-forward-visualizer` — feed in any backtest output and run PBO, DSR, bootstrap CIs, or fold-by-fold WFO in the browser.