Learn

Monte Carlo backtests for crypto strategies

Resample your trade-level returns to build a distribution around Sharpe and max-drawdown rather than a single point estimate. Block bootstrap fits autocorrelated crypto returns; plain bootstrap does not. The output is a confidence interval — how robust the reported Sharpe is to sampling noise.

By Keel Research Team · Updated May 17, 2026

A backtest reports one number for Sharpe and one number for max drawdown. Both are point estimates of a noisy quantity. The realized return path is one draw from a distribution; resample the path and the numbers shift. Monte Carlo on a backtest quantifies how much they shift — turning the point estimate into a 5th/50th/95th percentile band so you can tell a robust edge apart from a lucky window.

Two strategies can report identical Sharpe 2.0. One might have a 90% confidence interval of [1.6, 2.4] and the other [0.3, 3.7]. The first is a tradeable edge with noise. The second is mostly noise with a flattering point estimate. You cannot distinguish them from the backtest number alone. Monte Carlo gives you the second column.

Cite: Bailey, Borwein, López de Prado, & Zhu (2014) — Pseudo-Mathematics and Financial Charlatanism lay out the selection-bias problem this and adjacent techniques exist to address.

What Monte Carlo does for a backtest

The procedure resamples the realized series of returns (per-bar or per-trade) many times to construct synthetic equity curves that are statistically consistent with the observed one. For each synthetic curve, recompute Sharpe and max drawdown. Repeat 10,000 times. The empirical distribution of those metrics is the answer.

Concretely you end up with:

  • Sharpe CI — the 5th/50th/95th percentile of Sharpe across resamples. Tight CI: the edge is robust. Wide CI: the reported number is mostly sampling noise.
  • Max-drawdown CI — the same for max DD. Especially useful because realized max DD is famously unstable; a worse drawdown is hiding in the right tail of the resampled distribution.
  • Probability of negative Sharpe — the fraction of resampled paths with Sharpe < 0. A clean way to summarize downside risk on the metric itself.

This is not a forecast of future performance. It is a sensitivity analysis on the in-sample number, holding the underlying distribution fixed. Live Sharpe will generally be worse than the worst end of the CI because regime shifts also degrade the underlying distribution.

Plain bootstrap vs block bootstrap

Plain (i.i.d.) bootstrap resamples individual returns independently. It assumes the underlying return series has no serial correlation — that yesterday’s return tells you nothing about today’s. For crypto, that assumption is flatly wrong:

  • Momentum and mean reversion introduce short-horizon autocorrelation in the sign of returns.
  • Volatility clustering means large moves cluster in time — the magnitude of returns is strongly autocorrelated even when the sign is not.
  • Funding-regime persistence means perp funding rates persist for hours or days, so carry P&L is naturally serially correlated.

Plain bootstrap breaks all of these. It produces synthetic series that are too smooth, with Sharpe CIs that are too tight — making the strategy look more robust than it actually is.

Block bootstrap (Künsch 1989) fixes this by resampling contiguous chunks of length L. Pick a block, glue it onto the synthetic series, repeat until you have a series the original length. The local autocorrelation structure inside each block is preserved; only the joins are independent. CIs widen to reflect the real noise.

The trade-off: longer blocks preserve more autocorrelation but reduce the effective sample size of the bootstrap (you have fewer independent blocks). Pick L to roughly match the autocorrelation horizon of the series and you capture the structure without unduly inflating CIs.

Choosing block length for HL data

For Hyperliquid 15-minute bars, the relevant time scales are: tick-level noise (sub-bar), intraday momentum (a few hours), funding cycle (hourly settlements but trends persist multi-day), and overnight regime shifts. Reasonable defaults:

signal type             block length (15min bars)   wall-clock
sub-hour mean reversion        8–12                       2–3 hours
intraday momentum             20–40                       5–10 hours
multi-day carry / trend       96                          1 day
weekly rebalancing            672                         1 week

In practice, sweep L across {10, 20, 40, 96} and check two things: (1) the median Sharpe should be stable across block lengths (if not, your dataset is too short for any of them), and (2) the CI width should grow with L up to the autocorrelation horizon and then plateau. The plateau identifies the right block.

If you want a principled choice, the Politis-Romano (1994) stationary bootstrap randomizes block length from a geometric distribution — less sensitive to a single L choice. Most practical implementations default to fixed-length blocks because the simplicity outweighs the slight bias.

Interpreting confidence intervals on Sharpe

Once you have a Sharpe distribution, three numbers matter:

  1. Lower bound (5th percentile). The pessimistic case under the same return distribution. If the 5th percentile Sharpe is still positive, the strategy’s edge survives sampling noise. If it dips below zero, a substantial fraction of paths consistent with your data produce a losing strategy.
  2. Width (95th − 5th). The uncertainty band. A backtest with Sharpe 2.0 ± 0.4 is much more credible than 2.0 ± 1.7. Wide bands usually mean too few trades — Sharpe estimators converge as O(1/√T).
  3. Skew of the distribution. Bootstrapped Sharpe is often right-skewed when the underlying return distribution has fat tails. The median Sharpe can be meaningfully below the mean — report the median.

A heuristic from the Bailey & López de Prado (2014) DSR paper: if the lower bound is positive and the median exceeds the haircut for the number of strategy candidates you searched, the edge is plausibly real. For a single strategy you built from priors, the haircut is small. For one of 1,000 grid-searched parameters, the haircut is large enough that median Sharpe of 1.5 may still collapse to zero on a deflated basis.

Common mistakes

  • Plain bootstrap on autocorrelated returns. CIs come out too tight; the strategy looks more robust than it is. Always use block bootstrap for crypto.
  • Block length 1. Functionally identical to plain bootstrap, same problem. If you see a tutorial that resamples individual bars of a 15-minute crypto series, ignore it.
  • Bootstrapping after parameter selection. Monte Carlo on the in-sample winner of a 1,000-strategy grid search measures nothing useful — the selection bias dominates the sampling noise. Run PBO or DSR first to deflate; then bootstrap the survivor.
  • Reporting only the median. The whole point is the spread. Report 5th/50th/95th together or you are throwing away the information you ran the bootstrap to get.
  • Confusing CI with forecast. A CI tells you uncertainty around the in-sample number, assuming the return distribution is stationary. Live performance will usually be worse because the distribution drifts. The CI is a lower bound on uncertainty, not a forecast.

Try it on your own returns

The Monte Carlo Backtest Resampler runs 10,000 block-bootstrap resamples in your browser. Paste a CSV of per-bar returns or per-trade P&L, pick a block length, and get Sharpe and max-DD distributions with 5th/50th/95th percentile bands. Nothing leaves your machine.

Keel itself reports point-estimate metrics from the backtest engine today; native bootstrap CIs on those metrics are on the roadmap but not shipped. The widget is the bridge — export your backtest returns, drop them into the resampler, get the CIs.

Further reading

  • Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. SSRN 2308659.
  • Bailey, D. H., & López de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. SSRN 2460551.
  • Künsch, H. R. (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics 17(3): 1217–1241.
  • Politis, D. N., & Romano, J. P. (1994). The Stationary Bootstrap. Journal of the American Statistical Association 89(428): 1303–1313.
This article is educational. Monte Carlo confidence intervals quantify sampling noise around in-sample metrics; they do not forecast live performance. Strategies that survive Monte Carlo can still degrade live if the return distribution shifts outside the historical sample. Keel does not ship native Monte Carlo on backtest output today; the linked widget is a standalone bootstrap tool.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Monte Carlo backtests — questions

Why put a confidence interval on Sharpe at all?

A single Sharpe number is one realization of a noisy estimator. Two strategies with the same true Sharpe will produce different sample Sharpe estimates over the same window. A Monte Carlo resample turns that point estimate into a distribution, and the 5th/95th percentile spread tells you how much of the reported number is signal versus the luck of which trades fell inside the sample. Without the CI, you cannot tell a Sharpe 2.0 strategy with tight CI [1.6, 2.4] apart from a Sharpe 2.0 strategy with CI [0.3, 3.7] — and the second is much more likely to disappoint live.

How do I pick a block length?

Block length should roughly match the autocorrelation horizon of the return series. For Hyperliquid 15-minute bars: 20–40 blocks (5–10 hours) is a reasonable default for most signals; 96 (one day) for carry strategies whose edge persists across funding settlements; 8–12 for sub-hour mean reversion. When in doubt, sweep across 10/20/40/96 and check that the median is stable. The CI width should grow with block length up to the underlying autocorrelation horizon and then plateau — that plateau identifies the right block.

How many resamples do I need?

For 5th/95th percentile estimates, 10,000 resamples is enough for stable bounds run-to-run. Push to 20,000–50,000 if you want 99% intervals or are estimating extreme-tail statistics. Below 5,000 the percentile estimates wobble noticeably between runs; above 100,000 you rarely see anything change. The bootstrap converges in O(1/√N), so doubling the resample count only tightens the CI by about 30%.

How is a CI different from the point-estimate Sharpe?

The point estimate is what your backtest reports — a single number computed from the realized return path. The CI is the range that point estimate could plausibly take if the underlying return distribution stayed the same but the timing of trades varied. A wide CI means the point estimate is fragile to sampling noise. A tight CI means the strategy would produce similar Sharpe across many counterfactual realizations of the same edge.

How does Monte Carlo relate to PBO?

They diagnose different failure modes. Monte Carlo bootstraps the realized return path of a single strategy to estimate noise around its Sharpe. PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) takes many candidate strategies and asks how often the in-sample winner underperforms out-of-sample — a selection-bias diagnostic across the strategy space. Use Monte Carlo to estimate uncertainty on a single strategy you've chosen; use PBO to estimate the chance you chose the wrong strategy. Both belong in a rigorous workflow.