Learn

The Deflated Sharpe Ratio

The Deflated Sharpe Ratio adjusts a backtest Sharpe for the number of trials you ran and the higher moments of the return distribution. Introduced by Bailey & López de Prado (2014). A Sharpe above 1.0 that survives DSR at the 0.95 level is meaningful; one that doesn't is selection bias.

By Keel Research Team · Updated May 17, 2026

The Sharpe ratio is the most-cited number in quantitative finance, and the most abused. Any researcher who has run a parameter grid knows the feeling — try enough combinations and one of them prints a Sharpe of 3. The problem is not the Sharpe formula; it is what happens when you pick the maximum out of a large set of trials. The maximum of many noisy estimates is biased upward, even when the underlying signals are pure noise. Selection inflates Sharpe. The Deflated Sharpe Ratio is the correction.

Bailey and López de Prado introduced DSR in their 2014 Journal of Portfolio Management paper, The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. The deflation does two jobs at once. First, it accounts for how many trials produced the reported Sharpe — the more strategies you tried, the higher the expected maximum Sharpe under the null hypothesis of zero edge. Second, it corrects for non-normal return distributions — fat tails and skew make the standard Sharpe inference too generous.

Why a raw Sharpe overstates edge

Suppose you run a grid of 1,000 random strategies on the same 5-year price history. None has any genuine edge — each is just noise. The Sharpe ratios are approximately Gaussian around zero, with a standard error determined by sample size. The maximum of 1,000 draws from that distribution will land somewhere around 3.1 standard errors above zero — call it Sharpe ≈ 1.5 on annualized 5-year data. That number is the lottery winner, not a signal.

Report only the best strategy and you have implicitly conducted 1,000 tests while pretending you ran one. The standard t-test that gives you a tidy p-value of 0.05 for Sharpe = 1.5 is wrong by orders of magnitude. The Bonferroni-style correction is to divide your significance threshold by 1,000 — but most researchers don't even count their trials, let alone deflate for them. DSR makes the correction explicit and quantitative.

The same dynamic shows up in crypto research more aggressively than in equities. A single Hyperliquid backtest sweep over 20 momentum lookbacks × 15 vol-target levels × 10 universes already has 3,000 trials before you've added a regime filter or a funding overlay. The reported Sharpe of the winner is almost guaranteed to look excellent; whether it generalizes is a separate question.

The DSR formula

The DSR test statistic, due to Bailey & López de Prado (2014), is:

DSR = Φ( (SR_obs − SR_0) × √(T − 1) /
        √( 1 − γ₃·SR_obs + ((γ₄ − 1)/4)·SR_obs² ) )

Where Φ is the standard normal CDF, SR_obs is the observed annualized Sharpe, T is the sample size (number of return observations), γ₃ is the skewness of returns, and γ₄ is the kurtosis of returns. The deflation appears in SR_0, the expected maximum Sharpe under the null for N trials:

SR_0 = √V × ( (1 − γ) · Z⁻¹(1 − 1/N)
              + γ · Z⁻¹(1 − 1/(N·e)) )

Where V is the variance of Sharpe estimates across the N trials (or an assumed value if not measurable), γ ≈ 0.5772 is the Euler-Mascheroni constant, and Z⁻¹ is the inverse standard normal. The whole machine reduces to: here is the Sharpe you would expect from pure luck if you had tried N strategies; deflate the observed Sharpe by that benchmark, then turn the t-statistic into a probability accounting for skew and kurtosis.

Five inputs go in: SR_obs (annualized Sharpe), N (number of trials), γ₃ (skew), γ₄ (kurtosis), T (sample size). One probability comes out. A DSR of 0.95 means there is a 95% probability the observed Sharpe is not the product of selection bias on noise.

A worked example

You ran a momentum sweep on 30 Hyperliquid perps with 15-minute bars over 18 months. The grid was 25 lookback periods × 8 holding periods × 5 vol-target levels = 1,000 strategies. The winner posted an annualized Sharpe of 2.1 with skew −0.4 and excess kurtosis 3.2 over T ≈ 17,500 fifteen-minute observations.

Plugging in: with N = 1,000 trials, the expected-max Sharpe under the null sits near 1.3 (depending on the assumed Sharpe variance across trials). The observed Sharpe of 2.1 deflates to a test statistic that, given the negative skew and elevated kurtosis, lands at a DSR of roughly 0.78.

What that says: there is a 78% probability the strategy has real edge, accounting for the trials run and the non-normality. That is encouraging — but it falls short of the conventional 0.95 threshold. The honest read is "this is suggestive; do not stake material capital before walk-forward and out-of-sample validation." If the same Sharpe had been produced from a single hypothesis-driven backtest (N = 1), the DSR would jump to ~0.99 and the strategy would clear the bar comfortably. Same backtest output; different DSR depending on the search process behind it.

When DSR vs PBO

Bailey, Borwein, López de Prado, and Zhu published the Probability of Backtest Overfitting in the same vein, also in 2014. The two metrics overlap but target different problems.

DSR answers: is this Sharpe statistically real after accounting for trials and distribution shape? One strategy, one number out.
PBO answers: is the strategy I picked as the best in-sample actually better than the median strategy out-of-sample? A population of strategies, one probability out.

Both are worth running on any non-trivial grid search. If you have only the champion's tearsheet, DSR is the available tool. If you have all N strategy returns, PBO is more powerful — it uses the full distribution rather than just the maximum. Most rigorous research reports both.

DSR's limits — what it doesn't catch

DSR is a statistical correction. It does not catch every backtest pathology:

Look-ahead bias. If your indicator quietly uses future information, the resulting Sharpe is not noise — it is fake. DSR cannot tell the difference.
Survivorship bias. A universe of "currently listed Hyperliquid perps" excludes the names that were delisted after a 90% drawdown. DSR does not see the missing data.
Cost realism. A backtest with zero slippage and zero funding will print a Sharpe that DSR happily validates as real. The Sharpe was real in the model; the model is wrong.
Regime dependence. A strategy that worked in one volatility regime and does not generalize will pass DSR if N is small enough. DSR is a search-multiplicity correction, not a regime test — that is what walk-forward is for.
Correlated trials. If your N trials are highly correlated (small perturbations on the same idea), the effective N is lower and DSR can over-penalize. Conversely, if you forget related backtests you ran on the same data, the effective N is higher and DSR under-penalizes.

Treat DSR as one item in a robustness battery — alongside walk-forward, out-of-sample holdout, parameter sensitivity, and live-parity verification. The crypto strategy robustness checklist assembles the full set.

Try the calculator

The DSR calculator runs the full deflation in the browser. Enter the five inputs (Sharpe, N, skew, kurtosis, sample size) and read the DSR plus the implied p-value. Useful as a sanity check on any backtest tearsheet before you trust it.

Keel does not ship DSR as a built-in diagnostic today — the calculator is the educational complement to the platform's parameter-sensitivity tooling. DSR may be added as a native diagnostic in future releases on backtests where the trial count is known; no committed ship date.

This article is educational. The Deflated Sharpe Ratio is a statistical correction for selection bias; it cannot rule out look-ahead bias, survivorship bias, or cost-modeling errors. References: Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. Journal of Portfolio Management, 40(5).

Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

Start in Keel Screen HL markets

What you can do

Backtest any strategy with realistic fees, slippage, and funding.
Optimize parameter grids by Sharpe, drawdown, hit rate.
Deploy live to HL with stops + position limits + funding-aware execution.
Iterate with AI — describe a thesis, get a tradeable pipeline.

FAQ

Deflated Sharpe — questions

What is a 'good' DSR?

DSR is interpreted as the probability that the observed Sharpe is statistically distinguishable from zero, after deflating for trials, skew, kurtosis, and sample size. A DSR above 0.95 is the conventional 5% significance bar — the same logic as a p-value of 0.05. A DSR of 0.50 means the strategy is no better than a coin flip after adjusting for the search. In practice, a Sharpe above 1.0 on monthly data that survives DSR at the 0.95 level is meaningful; a Sharpe above 2.0 that fails DSR is the typical fingerprint of an overfit grid search.

How many trials should I count toward N?

Every distinct backtest you ran on this hypothesis — every parameter combination in your grid, every variant you discarded, every related idea you tested on overlapping data. The honest answer is usually much larger than the number of strategies you remember. Bailey and López de Prado argue that N grows non-trivially because related backtests on the same dataset are correlated trials. A working approximation: count the size of the parameter grid you searched. If you tried 100 lookback × 20 threshold combinations, N = 2,000 — not 1.

How does DSR relate to PBO?

Both target the same problem (selection bias from running many trials) but answer different questions. DSR adjusts a single Sharpe for the multiplicity of trials and the shape of the return distribution. PBO estimates the probability that your best in-sample strategy will underperform the median strategy out-of-sample. Use DSR when you have one champion strategy and want to test whether its Sharpe is real. Use PBO when you have a population of strategies and want to test whether picking the best is meaningful. The two are complements, not substitutes — Bailey-López de Prado recommend reporting both.

How does sample size affect DSR?

DSR penalizes short samples through the standard error of the Sharpe estimator. The numerator of the test statistic includes √(T − 1) where T is the number of return observations. A Sharpe of 2.0 over 50 daily bars produces a much lower DSR than the same Sharpe over 5,000 bars, because the estimator is much noisier on the short sample. For crypto strategies on 15-minute bars, this is actually favorable — you get thousands of observations per year. For monthly-rebalance strategies on a five-year history, the sample correction is severe.

Can I compute DSR manually?

Yes — the formula is closed-form. You need five inputs: the observed Sharpe (annualized), the number of independent trials N, the skewness of returns, the excess kurtosis of returns, and the sample size T. The formula computes the expected maximum Sharpe under the null of zero true edge across N trials (via the standard maxmium-of-Gaussians approximation), then plugs the observed Sharpe into a non-central-t-style test statistic that accounts for skew/kurtosis. The DSR calculator runs the whole computation in the browser; the worked example below shows the algebra.

The Deflated Sharpe Ratio

Why a raw Sharpe overstates edge

The DSR formula

A worked example

When DSR vs PBO

DSR's limits — what it doesn't catch

Try the calculator

Trade systematically on Keel

Deflated Sharpe — questions

Run a real HL backtest on Keel

DSR Calculator

Probability of Backtest Overfitting

Alpha Factor Toolkit on Hyperliquid

The Deflated Sharpe Ratio

Why a raw Sharpe overstates edge

The DSR formula

A worked example

When DSR vs PBO

DSR's limits — what it doesn't catch

Try the calculator

Trade systematically on Keel

Deflated Sharpe — questions

Run a real HL backtest on Keel →

DSR Calculator →

Probability of Backtest Overfitting →

Alpha Factor Toolkit on Hyperliquid →

Run a real HL backtest on Keel

DSR Calculator

Probability of Backtest Overfitting

Alpha Factor Toolkit on Hyperliquid