Learn

Crypto strategy robustness: a checklist

The rigor toolkit — walk-forward, Monte Carlo, out-of-sample, PBO, DSR, parameter sensitivity — distilled into a pre-deploy checklist for crypto strategies. Ten items to clear before live capital. A strategy that fails three or more should not deploy.

By Keel Research Team · Updated May 17, 2026

Every quant has a horror story about the backtest that printed Sharpe 3 and then bled out live. The literature on backtest robustness is mature — Bailey, López de Prado, Pardo, Harvey, Liu — and the techniques are well-understood. What is missing in most retail crypto research is the discipline to actually run them all, in order, before staking capital. This page is that discipline as a checklist.

The list is ten items. Each captures a specific failure mode that a single backtest cannot detect. Clearing the list does not guarantee profitability — markets change, regimes shift, and no amount of historical validation forecasts the next structural break. But a strategy that clears all ten has eliminated the obvious self-deception, which is most of what kills retail systematic books.

Why robustness matters more in crypto

Three structural features of crypto make robustness checks especially load-bearing. Short histories: most Hyperliquid perps have less than two years of clean data; some major names have less than six months. Standard statistical inference assumes large T; crypto routinely violates the assumption. Regime shifts: the 2024-2026 sample alone covers two complete macro cycles, two major drawdowns, the meme-coin era, and material market-microstructure changes (HL itself launched its perp DEX in 2023). A strategy fitted to any single regime is unlikely to survive the next. Survivorship and listing churn: Hyperliquid lists and delists assets continuously; a "currently listed" universe excludes the names that collapsed.

The compound effect: a Hyperliquid backtest is roughly an order of magnitude more vulnerable to overfitting than the same study run on an equity index with fifty years of clean data. The robustness toolkit has to do more work, not less.

The 10-item checklist

1. Single-window backtest passes a meaningful bar.

The first cut. On the full sample, with fees, slippage, and funding modeled, the strategy clears whatever bar you set ex ante — typically Sharpe ≥ 1.0 with a Calmar ≥ 0.5. If this fails, nothing else matters. Run it once, set the bar before looking at the result, and discard the strategy if it does not clear.

2. OOS Sharpe is at least 0.5× IS Sharpe.

Holdout. Reserve the most recent 20-25% of the sample, fit and validate on the remaining 75-80%, then run a single forward test on the holdout. If IS Sharpe was 2.0 and OOS is 0.4, the strategy is overfit. The 0.5× threshold is conventional; tighter ratios indicate stronger generalization. See the OOS explainer.

3. Walk-forward degradation is below 30%.

Walk-forward optimization splits the sample into rolling IS/OOS chunks (typical setting: 6-month IS, 3-month OOS on a 24-month sample, anchored or rolling). Aggregate OOS Sharpe should be at least 70% of aggregate IS Sharpe — degradation of 30% or less. See walk-forward optimization.

4. PBO is below 0.5.

If you ran a parameter grid, compute the Probability of Backtest Overfitting over the full N × T strategy-return matrix. PBO above 0.5 means your IS champion is no better than the OOS median — the grid was fitting noise. Below 0.3 is strong evidence of real selection signal.

5. DSR is above 0.95.

The Deflated Sharpe Ratio adjusts the champion's Sharpe for the number of trials, skew, kurtosis, and sample size. DSR ≥ 0.95 corresponds to a 5% significance bar after the multiplicity correction. Strategies with raw Sharpe of 2.0 routinely fail DSR if the grid was large; strategies with raw Sharpe of 1.2 from a small hypothesis-driven grid often pass.

6. Parameter sensitivity is stable.

Vary each parameter by ±1 grid step. Sharpe should degrade smoothly, not collapse. A strategy whose Sharpe drops from 2.0 to 0.3 when the lookback shifts from 20 to 21 is fitting a spike, not a plateau. Robust strategies sit on wide profitable plateaus in parameter space.

7. Survivorship-bias-free universe.

The backtest universe must be point-in-time: only the assets that were tradeable on day t, not the assets that are tradeable today. For Hyperliquid this means using historical listing dates rather than a current `meta` snapshot. Survivorship bias on a crypto universe routinely adds 30-50% to Sharpe by silently excluding the names that collapsed.

8. Funding modeled.

For any perpetual-futures strategy, funding payments are first-order. A backtest that ignores funding can show a positive Sharpe on a strategy that loses money live to carry costs. Hyperliquid settles funding hourly; the backtest engine must accrue funding P&L on every open position. Keel does this natively.

9. Slippage modeled.

Fills at midpoint are a fiction. Use spread-based slippage at minimum, ideally with size-adjustment (the slippage at 1% of average daily volume is not the slippage at 10%). On Hyperliquid, the typical spread on a top-30 perp is 1-3 bps; on tail names it can be 20+ bps. Strategies that rebalance frequently are especially sensitive — a strategy that requires 5% turnover per day at 5 bps spread loses ~9% per year to costs alone.

10. Live-parity verified.

Before sizing up, run the strategy live at minimum size for a calendar quarter. Compare live equity-curve to the backtest engine's prediction over the same period. If they diverge by more than the spread-and-slippage noise band, something is wrong — execution latency, missing fee terms, signal computation drift. Live-parity catches implementation bugs that no backtest can.

How to score yourself

For each item, score pass/fail. A working framework:

  • 10/10: Deploy at full intended size.
  • 8-9/10: Deploy at half size while you investigate the failing items. Common failures: DSR or PBO marginal (1-2 trials short of the bar).
  • 6-7/10: Hold. The strategy may be salvageable with more data or a tighter parameter range, but it is not ready for capital.
  • < 6/10: Discard. The signal is either non-existent or so weak that the operational cost of running the strategy exceeds the expected edge.

Resist the temptation to re-search a failing strategy until it clears the checklist on a different parameter set. That defeats the purpose: each re-search inflates the effective N for DSR/PBO and degrades the OOS holdout into IS by leakage. If a strategy fails the first checklist run, the next thing to try is a different hypothesis, not a different parameter on the same hypothesis.

Common pitfalls

  • Re-using the OOS holdout. Once you have looked at OOS performance and adjusted the strategy, the OOS is now IS by leakage. The holdout is a one-shot tool. Burn it on the final candidate.
  • Under-counting trials. N for DSR is not the size of the published grid; it is the size of every grid you tried on related data, including the ones you discarded. Honest accounting usually doubles or triples the N you remember.
  • Confusing in-sample and validation. A "validation set" used to tune meta-parameters (regime threshold, position size cap) is still part of IS. The OOS must be untouched by any decision made after looking at it.
  • Cherry-picking the universe. Including LINK because it had a clean uptrend in your sample period and excluding DOGE because it was choppy is overfit by selection. The universe must be defined by a pre-commitable rule (top-30 by 30-day volume on day t, etc.).
  • Ignoring transaction cost interactions. A strategy that turns over 200% per day looks fine at 0 bps; at realistic costs it loses money. Always re-run the backtest with cost assumptions varied by 50% as a sensitivity check.
  • Mistaking backtest engine bugs for edge. A look-ahead error in feature computation will print incredible Sharpe and pass every robustness test (the bias is constant across all splits). The defense is code review and live-parity, not statistics.

Try the overfit-check tool

The overfit-check tool bundles PBO, DSR, and parameter-stability into a single composite score from a pasted backtest tearsheet or strategy-return CSV. Use it as a fast pre-flight before running the full ten-item checklist by hand.

For the underlying Hyperliquid backtests — items 1, 7, 8, 9, 10 on the list — see HL backtesting on Keel: ~220 perps, 15-minute bars, 1-hour funding modeled, point-in-time universe, realistic slippage. The platform handles the data-and-engine half of the checklist; the rigor calculators in the lab handle the statistics-and-validation half.

This article is educational. Clearing a robustness checklist reduces but does not eliminate the risk of live underperformance — markets shift, regimes change, and no historical validation guarantees future returns. References: Bailey & López de Prado (2014) for DSR; Bailey, Borwein, López de Prado & Zhu (2014) for PBO; Pardo (1992, 2008) for walk-forward; Harvey, Liu & Zhu (2016) on multiple-testing in finance.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Robustness — questions

What's 'enough' robustness for a crypto strategy?

There is no universal threshold, but a working bar: the strategy clears at least 8 of the 10 checklist items, OOS Sharpe is at least 0.5× IS Sharpe, walk-forward degradation is below 30%, PBO is below 0.5, and DSR clears 0.95. A strategy that fails 3 or more items should not see size; one that fails 1-2 can deploy at a fraction of intended capital while the failing items are investigated.

In what order should I run the tests?

Cheapest-first. Parameter sensitivity is free — run it during the initial backtest. Survivorship, funding, and slippage are setup work — fix them once and they are correct for all subsequent tests. Single-window OOS holdout is the next-cheapest. Walk-forward and Monte Carlo come next. PBO and DSR are last — they require all the prior strategy returns to compute. Live-parity is the final gate and only matters once everything else is clear.

Which test catches what?

Parameter sensitivity catches narrow-spike overfits. OOS holdout catches sample-period luck. Walk-forward catches regime dependence. Monte Carlo catches path-dependence in maxDD. PBO catches selection bias across a grid. DSR catches selection bias on a champion. Survivorship/funding/slippage catch model bias. Live-parity catches implementation bias. Each test has a specific failure mode; no single test substitutes for the others.

Can Keel do all of these today?

Not yet. Keel ships realistic backtests on roughly 220 Hyperliquid perps with 15-minute bars, 1-hour funding, fees and slippage modeled, single-window parameter optimization, and a 199-component pipeline library. Walk-forward, Monte Carlo, PBO, and DSR are on the roadmap as native diagnostics; the lab-app calculators cover them today. The platform handles items 1, 7, 8, 9, 10 on the checklist; the rigor calculators cover items 2-6.

How often should I re-run the checklist?

Re-run the full checklist any time you change the strategy materially — new parameter, new asset, new regime gate. Re-run live-parity and slippage continuously while the strategy is live (Keel's exec layer logs the differences). Re-run walk-forward and PBO quarterly on any production strategy as fresh data accumulates. A passing checklist at deploy is a snapshot, not a permanent certification — markets change.