Question 1

What is the Overfit Score actually combining?

Accepted Answer

Three diagnostics, weighted into a single 0-100. PBO (40%, Probability of Backtest Overfitting from Bailey-Borwein-López de Prado-Zhu 2014) — how often the best in-sample strategy under-performs the median out-of-sample. DSR (40%, Deflated Sharpe Ratio from Bailey-López de Prado 2014) — observed Sharpe penalized for the number of trials, sample size, skew, and kurtosis. Parameter stability (20%) — how dispersed Sharpe is across nearby parameter combinations. Mode A (single strategy) skips PBO and re-weights DSR to 70% / stability to 30%.

Question 2

When should I use Mode A vs Mode B?

Accepted Answer

Mode A — single strategy: you have one tearsheet and want to know if it survives selection bias. You input observed Sharpe, sample size, and (critically) the number of parameter combinations you tried during research. Mode B — multi-strategy CSV: you have a parameter grid worth of return series (e.g., 64 momentum lookback × threshold combinations). Upload the CSV and you get full PBO via the combinatorial split. Mode B is stricter and the recommended path when you have the data.

Question 3

Why does the number of trials matter so much?

Accepted Answer

Selection bias compounds geometrically with trial count. The expected maximum Sharpe under the null (no edge) grows with sqrt(2 ln N). If you tried 1000 parameter combinations and picked the top one, your observed Sharpe needs to clear a much higher bar to be statistically real. The DSR formula in this widget makes that bar explicit: Sharpe of 2.5 from a single hypothesis is impressive; Sharpe of 2.5 from picking the best of 1000 is roughly noise.

Question 4

What does the parameter-stability component measure?

Accepted Answer

In Mode A: coefficient of variation across the Sharpe values you paste for nearby parameter combinations (e.g., lookback in [18, 19, 20, 21, 22]). A robust strategy has stable performance across the local parameter neighborhood; an overfit one sits on an isolated peak that crashes one tick in any direction. In Mode B: dispersion of Sharpe across all CSV columns, since each column represents a different parameter configuration. High dispersion (CV > 0.6) means your chosen variant won the lottery.

Question 5

What does this tool NOT catch?

Accepted Answer

Look-ahead bias (your signal used data not available at decision time), survivorship bias (your universe excludes tokens that died), structural breaks (regime changed mid-backtest), data snooping outside the parameter grid (you also picked the asset universe, the rebalance frequency, the position sizing — those choices also have selection bias not captured here), and any form of incorrect P&L accounting. The Overfit Score is necessary but not sufficient. A 90/100 score on a backtest with look-ahead is still garbage.

Question 6

Is this shipped inside Keel as a built-in feature?

Accepted Answer

Not yet. Keel reports point-estimate metrics from the backtest engine. PBO, DSR, and stability composites are on the roadmap — no committed ship date. Use this widget in the meantime: paste any backtest output (from Keel, from a Python notebook, from a commercial tool) and run the composite locally. Once native overfit-detection ships, these three numbers will appear directly in the Keel backtest results card.

Overfit Tearsheet Checker

Methodology

Trade systematically on Keel

Calculator questions

Overfitting in Crypto Backtests

PBO Calculator

Deflated Sharpe Calculator