Composite scoring that folds PBO, Deflated Sharpe, and parameter stability into a single Overfit Score 0-100 with a per-component breakdown. For when you want one number that tells you whether your backtest is real. Paste a single tearsheet (DSR + stability) or upload a multi-strategy CSV for the full PBO treatment. Browser computation only — nothing uploaded.
Mode A scores DSR + parameter stability. PBO is undefined for N=1 strategy; the trial-count input N adjusts DSR for selection bias instead.
Annualized, from your single tearsheet.
Number of return observations.
Default 0; negative for left-tailed.
Default 3 (normal); higher = fat tails.
Critical for selection-bias adjustment. Include EVERY variant you ran, not just the surviving ones.
Match your bar frequency.
Sharpe values across nearby parameter combinations (e.g., lookback 18..22). Blank = neutral stability score.
Adjust inputs and click Compute Overfit Score. All computation runs in your browser — nothing is uploaded.
The Overfit Score is a weighted blend of three diagnostics from the academic backtest-rigor literature, normalized to 0-100 and reported alongside the raw sub-scores. The weighting changes by mode: with a single tearsheet you get DSR + parameter-grid stability (PBO is undefined for N=1); with a multi-strategy CSV you get the full PBO + DSR + cross-variant stability fold.
# Deflated Sharpe Ratio (Bailey & López de Prado 2014)
SR_0 = sqrt(2*ln(N)) - (gamma + ln(ln(N))) / sqrt(2*ln(N))
Z = (SR - SR_0) * sqrt(T-1) / sqrt(1 - skew*SR + (kurt-1)/4 * SR^2)
DSR = Phi(Z) # gamma = Euler-Mascheroni ~ 0.5772
# Probability of Backtest Overfitting (Bailey-Borwein-LdP-Zhu 2014)
For each combinatorial split of T periods into IS/OOS halves:
best_IS = argmax_strategy(IS_Sharpe)
PBO_event = (OOS_rank(best_IS) < median)
PBO = mean(PBO_event)
# Parameter stability (coefficient of variation across local grid)
stability = 1 - clip(stdev(SR_grid) / mean(SR_grid), 0, 1)How the sub-scores combine. DSR is already a probability in [0, 1] — multiplied by 100 it is a score directly. PBO is converted via (1 - PBO) × 100 since low PBO means low overfit risk. Stability uses 1 - CV of Sharpe across the local neighborhood (Mode A) or across CSV columns (Mode B), also × 100. In Mode A the composite is 70% DSR + 30% stability. In Mode B it is 40% PBO + 40% DSR + 20% stability. Bucket cutoffs (80+ robust, 60-80 healthy, 40-60 concerning, 20-40 likely overfit, under 20 almost certainly overfit) are calibrated against the regimes flagged in the source papers — not a formal hypothesis test, a practitioner heuristic.
Method limits. The Overfit Score is a screen, not a verdict. It does not catch look-ahead, survivorship, structural breaks, or selection bias outside the parameter grid you ran. It also assumes your reported Sharpe is computed honestly from realized P&L with realistic fees, slippage, and (for crypto) funding modeled. Composite scoring above all else makes the trade-off explicit: a single backtest with Sharpe 4.0, N=1 trial, T=252 should score very high; the same Sharpe with N=1000 trials should score much lower. The widget makes that selection-bias adjustment visible.
For the longer explainers see Overfitting in crypto backtests, Probability of Backtest Overfitting, and Deflated Sharpe Ratio. References: Bailey & López de Prado (2014); Bailey, Borwein, López de Prado & Zhu (2014); Harvey, Liu & Zhu (2016).
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Build, backtest, and run live strategies with realistic fees, slippage, and funding modeled. Free to start — connect a Hyperliquid wallet when you’re ready to go live.
Three diagnostics, weighted into a single 0-100. PBO (40%, Probability of Backtest Overfitting from Bailey-Borwein-López de Prado-Zhu 2014) — how often the best in-sample strategy under-performs the median out-of-sample. DSR (40%, Deflated Sharpe Ratio from Bailey-López de Prado 2014) — observed Sharpe penalized for the number of trials, sample size, skew, and kurtosis. Parameter stability (20%) — how dispersed Sharpe is across nearby parameter combinations. Mode A (single strategy) skips PBO and re-weights DSR to 70% / stability to 30%.
Mode A — single strategy: you have one tearsheet and want to know if it survives selection bias. You input observed Sharpe, sample size, and (critically) the number of parameter combinations you tried during research. Mode B — multi-strategy CSV: you have a parameter grid worth of return series (e.g., 64 momentum lookback × threshold combinations). Upload the CSV and you get full PBO via the combinatorial split. Mode B is stricter and the recommended path when you have the data.
Selection bias compounds geometrically with trial count. The expected maximum Sharpe under the null (no edge) grows with sqrt(2 ln N). If you tried 1000 parameter combinations and picked the top one, your observed Sharpe needs to clear a much higher bar to be statistically real. The DSR formula in this widget makes that bar explicit: Sharpe of 2.5 from a single hypothesis is impressive; Sharpe of 2.5 from picking the best of 1000 is roughly noise.
In Mode A: coefficient of variation across the Sharpe values you paste for nearby parameter combinations (e.g., lookback in [18, 19, 20, 21, 22]). A robust strategy has stable performance across the local parameter neighborhood; an overfit one sits on an isolated peak that crashes one tick in any direction. In Mode B: dispersion of Sharpe across all CSV columns, since each column represents a different parameter configuration. High dispersion (CV > 0.6) means your chosen variant won the lottery.
Look-ahead bias (your signal used data not available at decision time), survivorship bias (your universe excludes tokens that died), structural breaks (regime changed mid-backtest), data snooping outside the parameter grid (you also picked the asset universe, the rebalance frequency, the position sizing — those choices also have selection bias not captured here), and any form of incorrect P&L accounting. The Overfit Score is necessary but not sufficient. A 90/100 score on a backtest with look-ahead is still garbage.
Not yet. Keel reports point-estimate metrics from the backtest engine. PBO, DSR, and stability composites are on the roadmap — no committed ship date. Use this widget in the meantime: paste any backtest output (from Keel, from a Python notebook, from a commercial tool) and run the composite locally. Once native overfit-detection ships, these three numbers will appear directly in the Keel backtest results card.
The from-scratch explainer: why crypto strategies overfit faster than equity ones (short histories, fewer regimes, more reflexive flow), and what to do about it.
Standalone PBO via combinatorial split. Use when you want the PBO number on its own without the DSR + stability fold.
Standalone DSR. The piece of this composite that handles single-strategy selection-bias adjustment when you do not have a full parameter grid.