The Deflated Sharpe Ratio adjusts a backtest Sharpe for the number of trials you ran and the higher moments of the return distribution. Introduced by Bailey & López de Prado (2014). A Sharpe above 1.0 that survives DSR at the 0.95 level is meaningful; one that doesn't is selection bias.
The Sharpe ratio is the most-cited number in quantitative finance, and the most abused. Any researcher who has run a parameter grid knows the feeling — try enough combinations and one of them prints a Sharpe of 3. The problem is not the Sharpe formula; it is what happens when you pick the maximum out of a large set of trials. The maximum of many noisy estimates is biased upward, even when the underlying signals are pure noise. Selection inflates Sharpe. The Deflated Sharpe Ratio is the correction.
Bailey and López de Prado introduced DSR in their 2014 Journal of Portfolio Management paper, The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. The deflation does two jobs at once. First, it accounts for how many trials produced the reported Sharpe — the more strategies you tried, the higher the expected maximum Sharpe under the null hypothesis of zero edge. Second, it corrects for non-normal return distributions — fat tails and skew make the standard Sharpe inference too generous.
Suppose you run a grid of 1,000 random strategies on the same 5-year price history. None has any genuine edge — each is just noise. The Sharpe ratios are approximately Gaussian around zero, with a standard error determined by sample size. The maximum of 1,000 draws from that distribution will land somewhere around 3.1 standard errors above zero — call it Sharpe ≈ 1.5 on annualized 5-year data. That number is the lottery winner, not a signal.
Report only the best strategy and you have implicitly conducted 1,000 tests while pretending you ran one. The standard t-test that gives you a tidy p-value of 0.05 for Sharpe = 1.5 is wrong by orders of magnitude. The Bonferroni-style correction is to divide your significance threshold by 1,000 — but most researchers don't even count their trials, let alone deflate for them. DSR makes the correction explicit and quantitative.
The same dynamic shows up in crypto research more aggressively than in equities. A single Hyperliquid backtest sweep over 20 momentum lookbacks × 15 vol-target levels × 10 universes already has 3,000 trials before you've added a regime filter or a funding overlay. The reported Sharpe of the winner is almost guaranteed to look excellent; whether it generalizes is a separate question.
The DSR test statistic, due to Bailey & López de Prado (2014), is:
DSR = Φ( (SR_obs − SR_0) × √(T − 1) /
√( 1 − γ₃·SR_obs + ((γ₄ − 1)/4)·SR_obs² ) )Where Φ is the standard normal CDF, SR_obs is the observed annualized Sharpe, T is the sample size (number of return observations), γ₃ is the skewness of returns, and γ₄ is the kurtosis of returns. The deflation appears in SR_0, the expected maximum Sharpe under the null for N trials:
SR_0 = √V × ( (1 − γ) · Z⁻¹(1 − 1/N)
+ γ · Z⁻¹(1 − 1/(N·e)) )Where V is the variance of Sharpe estimates across the N trials (or an assumed value if not measurable), γ ≈ 0.5772 is the Euler-Mascheroni constant, and Z⁻¹ is the inverse standard normal. The whole machine reduces to: here is the Sharpe you would expect from pure luck if you had tried N strategies; deflate the observed Sharpe by that benchmark, then turn the t-statistic into a probability accounting for skew and kurtosis.
Five inputs go in: SR_obs (annualized Sharpe), N (number of trials), γ₃ (skew), γ₄ (kurtosis), T (sample size). One probability comes out. A DSR of 0.95 means there is a 95% probability the observed Sharpe is not the product of selection bias on noise.
You ran a momentum sweep on 30 Hyperliquid perps with 15-minute bars over 18 months. The grid was 25 lookback periods × 8 holding periods × 5 vol-target levels = 1,000 strategies. The winner posted an annualized Sharpe of 2.1 with skew −0.4 and excess kurtosis 3.2 over T ≈ 17,500 fifteen-minute observations.
Plugging in: with N = 1,000 trials, the expected-max Sharpe under the null sits near 1.3 (depending on the assumed Sharpe variance across trials). The observed Sharpe of 2.1 deflates to a test statistic that, given the negative skew and elevated kurtosis, lands at a DSR of roughly 0.78.
What that says: there is a 78% probability the strategy has real edge, accounting for the trials run and the non-normality. That is encouraging — but it falls short of the conventional 0.95 threshold. The honest read is "this is suggestive; do not stake material capital before walk-forward and out-of-sample validation." If the same Sharpe had been produced from a single hypothesis-driven backtest (N = 1), the DSR would jump to ~0.99 and the strategy would clear the bar comfortably. Same backtest output; different DSR depending on the search process behind it.
Bailey, Borwein, López de Prado, and Zhu published the Probability of Backtest Overfitting in the same vein, also in 2014. The two metrics overlap but target different problems.
Both are worth running on any non-trivial grid search. If you have only the champion's tearsheet, DSR is the available tool. If you have all N strategy returns, PBO is more powerful — it uses the full distribution rather than just the maximum. Most rigorous research reports both.
DSR is a statistical correction. It does not catch every backtest pathology:
Treat DSR as one item in a robustness battery — alongside walk-forward, out-of-sample holdout, parameter sensitivity, and live-parity verification. The crypto strategy robustness checklist assembles the full set.
The DSR calculator runs the full deflation in the browser. Enter the five inputs (Sharpe, N, skew, kurtosis, sample size) and read the DSR plus the implied p-value. Useful as a sanity check on any backtest tearsheet before you trust it.
Keel does not ship DSR as a built-in diagnostic today — the calculator is the educational complement to the platform's parameter-sensitivity tooling. DSR may be added as a native diagnostic in future releases on backtests where the trial count is known; no committed ship date.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.
Free to start — connect a Hyperliquid wallet when you’re ready to go live.
DSR is interpreted as the probability that the observed Sharpe is statistically distinguishable from zero, after deflating for trials, skew, kurtosis, and sample size. A DSR above 0.95 is the conventional 5% significance bar — the same logic as a p-value of 0.05. A DSR of 0.50 means the strategy is no better than a coin flip after adjusting for the search. In practice, a Sharpe above 1.0 on monthly data that survives DSR at the 0.95 level is meaningful; a Sharpe above 2.0 that fails DSR is the typical fingerprint of an overfit grid search.
Every distinct backtest you ran on this hypothesis — every parameter combination in your grid, every variant you discarded, every related idea you tested on overlapping data. The honest answer is usually much larger than the number of strategies you remember. Bailey and López de Prado argue that N grows non-trivially because related backtests on the same dataset are correlated trials. A working approximation: count the size of the parameter grid you searched. If you tried 100 lookback × 20 threshold combinations, N = 2,000 — not 1.
Both target the same problem (selection bias from running many trials) but answer different questions. DSR adjusts a single Sharpe for the multiplicity of trials and the shape of the return distribution. PBO estimates the probability that your best in-sample strategy will underperform the median strategy out-of-sample. Use DSR when you have one champion strategy and want to test whether its Sharpe is real. Use PBO when you have a population of strategies and want to test whether picking the best is meaningful. The two are complements, not substitutes — Bailey-López de Prado recommend reporting both.
DSR penalizes short samples through the standard error of the Sharpe estimator. The numerator of the test statistic includes √(T − 1) where T is the number of return observations. A Sharpe of 2.0 over 50 daily bars produces a much lower DSR than the same Sharpe over 5,000 bars, because the estimator is much noisier on the short sample. For crypto strategies on 15-minute bars, this is actually favorable — you get thousands of observations per year. For monthly-rebalance strategies on a five-year history, the sample correction is severe.
Yes — the formula is closed-form. You need five inputs: the observed Sharpe (annualized), the number of independent trials N, the skewness of returns, the excess kurtosis of returns, and the sample size T. The formula computes the expected maximum Sharpe under the null of zero true edge across N trials (via the standard maxmium-of-Gaussians approximation), then plugs the observed Sharpe into a non-central-t-style test statistic that accounts for skew/kurtosis. The DSR calculator runs the whole computation in the browser; the worked example below shows the algebra.
Take a candidate strategy to a full backtest on ~220 HL perps with real fees and 1-hour funding, then feed the Sharpe and trial count into DSR.
Compute the Deflated Sharpe Ratio in the browser. Five inputs, one probability out.
The companion Bailey-Borwein metric — PBO uses the full strategy population rather than just the champion.
The HL-specific factor evaluation surface — IC, decay, and the trial-counting discipline DSR is meant to enforce.