Backtest Rigor

Deflated Sharpe Ratio Calculator

Adjust an observed Sharpe for the number of strategy variants tested, sample size, skew, and kurtosis of returns. Output is the probability the true Sharpe is positive after correcting for trial-selection bias and non-normality. Implements Bailey & López de Prado (2014).

Bailey & López de Prado 2014 · in-browser · no upload
By Keel Research Team · Updated May 17, 2026
Inputs

The best Sharpe you reported across all variants.

How many strategy variants you tested on the same data. Honest counts beat optimistic ones.

Bars used to compute the Sharpe. e.g. 252 daily, 8760 hourly.

Negative skew (crash risk) penalises Sharpe; 0 = symmetric.

3 = Gaussian. Crypto strategies typically 5–10.

252 (daily), 365 (calendar), 8760 (hourly). Reference only — both SR and SR₀ are quoted annualized so this cancels in the formula.

Result
Deflated Sharpe Ratio
0.018
SR₀ — expected max under null
1.764

Across 20 trials, this is the Sharpe you would expect by luck alone.

z-statistic
-2.089
Variance term (denominator²)
4.000

1 − skew·SR + (kurt − 1)/4 · SR². Inflated by negative skew or fat tails.

Likely overfit

DSR < 0.05: the observed Sharpe is below what selection alone would have produced under the null. Almost certainly an artifact of trial-selection bias, short sample, or fat-tailed returns. Do not deploy. Reduce N, extend T, or both — and re-test on a fresh sample.

DSR vs number of trials

Holding SR, T, skew, and kurt constant — watch DSR collapse as the trial count climbs. The dot marks your current N.

How it works

Methodology

The Deflated Sharpe Ratio (DSR) is a probability — specifically, the probability that the true Sharpe of a strategy is greater than zero, conditional on the observed Sharpe, the number of variants tested, the sample size, and the higher moments of the return distribution. Formally:

Z = (SR − SR₀) · √(T − 1) / √(1 − skew·SR + (kurt − 1)/4 · SR²)
DSR = Φ(Z)

Where SR₀ is the expected maximum Sharpe across N independent random strategies under the null of zero true edge:

SR₀ = √(2 ln N) − (γ + ln ln N) / √(2 ln N)
where γ ≈ 0.5772 (Euler–Mascheroni)

What the deflator actually does. Subtracting SR₀ penalises the observed Sharpe for the selection bias introduced by running N trials. The variance term in the denominator inflates noise when returns have negative skew or fat tails — both of which make a given Sharpe less trustworthy. The √(T − 1) scaling tightens the estimate with longer samples; with 252 days of data you can distinguish less than you might think.

When N = 1 the deflator SR₀ collapses to zero and DSR reduces to the standard probabilistic Sharpe ratio (PSR) — the probability the true Sharpe exceeds zero given a single, pre-specified strategy. For N > 1 the deflator does the work the t-test cannot.

The widget runs the full computation in your browser; nothing is uploaded. For the background on trial bias, why crypto Sharpes are particularly vulnerable to it, and how to count trials honestly, see the DSR explainer.

Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Build, backtest, and run live strategies with realistic fees, slippage, and funding modeled. Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding modeled.
  • Optimize across parameter grids — Sharpe, drawdown, hit rate.
  • Deploy live to Hyperliquid with stop-loss + position limits.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Calculator questions

What does the Deflated Sharpe Ratio actually deflate?

It deflates the observed Sharpe ratio for four things at once: (1) the number of strategy variants you tested — more trials, higher expected max Sharpe under the null, even with zero edge; (2) sample size T — short samples produce noisy Sharpes; (3) skewness — negative skew (rare crashes) makes a given Sharpe less reliable; (4) excess kurtosis — fat tails make a given Sharpe less reliable. The output is the probability that the true Sharpe is greater than zero given everything you tried and the shape of your returns.

Why do I need to enter the number of trials?

Because the expected max Sharpe across N independent random strategies grows with sqrt(2 ln N), even when no strategy has any real edge. If you tested 100 parameter variants and reported the best Sharpe, your number is contaminated by selection bias. DSR corrects for that by comparing your observed Sharpe to SR_0 — the expected best Sharpe under the null of zero true edge. If you only ran one strategy, set N=1 and the trial-bias correction drops out (SR_0 = 0).

How do I count trials honestly?

Count every parameter combination you evaluated on the same data, including informal ones. If you swept lookback ∈ {10, 20, 50, 100} × threshold ∈ {0.01, 0.02, 0.05} × asset ∈ {BTC, ETH, SOL}, that is 36 trials, not one. Add manual variants ("I also tried with a 7-day filter") — those count too. If anything, undercounting is the bigger risk; most practitioners underestimate N by 5-10x. When unsure, sensitivity-test by entering both your best-guess N and 10×N.

How is DSR different from a regular Sharpe t-test?

A regular Sharpe significance test asks "given a single sample, is the Sharpe distinguishable from zero?" DSR asks "given that I selected this Sharpe from N alternatives, is it distinguishable from what selection alone would have produced?" The two converge at N=1 (no selection); they diverge sharply as N grows. A Sharpe of 1.5 over 252 days is wildly significant by t-test (~3-sigma) but can be statistically noise once you account for testing 30 variants.

Does Keel compute DSR natively on backtest results?

Not yet. Today Keel reports observed Sharpe, Sortino, max drawdown, and other point estimates from the backtest engine — DSR is not built in. This calculator is the bridge. Paste your observed Sharpe and an honest trial count from the optimizer history, and read off the deflated probability. Native DSR alongside the other metrics is on the roadmap; no committed ship date.

DSR is a strong diagnostic, but it has limits. It does not catch look-ahead bias (using data unavailable at signal time), structural breaks (the regime that produced your Sharpe is gone), or non-stationarity (the return distribution itself shifted during the sample). It assumes your trial count is honest and that the N strategies were drawn from a single, exchangeable population. Treat a high DSR as necessary, not sufficient — pair it with bootstrap CIs on the same returns and an out-of-sample period the model never saw.