What Monte Carlo on a backtest actually means
A backtest is one realization. You ran the strategy once across one historical sample and got one equity curve, one Sharpe, one max drawdown number. That single result is a point estimate. The data-generating process — your alpha, the market regime, the funding cycle on HL — could easily have produced a noticeably different equity curve from the same underlying edge. The question Monte Carlo answers is: across many plausible realizations of the same return process, what is the distribution of outcomes?
The mechanic is straightforward. Take the per-bar (or per-trade) returns from your backtest. Resample them many thousands of times with replacement to build synthetic equity curves. Compute Sharpe and max drawdown on each synthetic run. The 5th, 50th, and 95th percentiles across those runs give you a 90% confidence interval. You now know not just “Sharpe 2.1” but “Sharpe 90% CI [1.4, 2.7]” — a meaningfully different planning number.
Drawdown is where this matters most. Max DD is path-dependent and the single observed max DD in a backtest is, almost by definition, an extreme order statistic. Bootstrap CIs on max DD are usually wider and more honest than the point estimate. Planning capital on the point-estimate max DD is how strategies get force-deleveraged in production.
Bootstrap vs block-bootstrap
Plain (i.i.d.) bootstrap shuffles individual returns independently. That assumes returns are independent and identically distributed — no autocorrelation, no volatility clustering, no regime persistence. For real return series, and especially for high-frequency crypto perp returns, that assumption is false.
HL 15-minute returns have non-trivial autocorrelation: momentum and mean-reversion both show up at various horizons; volatility clusters strongly (high-vol bars beget high-vol bars); funding regimes persist across multi-day windows. Plain bootstrap breaks all of that, producing synthetic samples with implausibly low volatility-of-volatility and unrealistic drawdown profiles. You get CIs that are too tight on Sharpe and too narrow on max DD — the worst possible direction to be wrong.
The fix is block bootstrap. Instead of resampling one return at a time, resample contiguous blocks of length L. The block preserves the local autocorrelation and volatility-cluster structure; stitching many blocks together still randomizes the global sample. Two common variants — moving block (overlapping windows) and circular block (wraps around the end-of-sample) — both work for typical use; the widget uses moving block.
Choosing block length for HL 15-minute bars
Block length L is the single most consequential parameter and the one practitioners get wrong most often. Heuristic:
- L too small (e.g. 1-4 bars) collapses toward plain bootstrap — autocorrelation lost, CIs too tight.
- L too large (more than ~10% of the sample) under-randomizes — your “resamples” are mostly the original sample with minor reshuffles, CIs collapse toward zero, and your sample size of independent units is too small.
- L roughly matched to the autocorrelation horizon of your strategy is the right zone. For HL 15-minute bars and most signal classes that means 20-40 bars (5-10 hours). For multi-day carry signals push to 96 (one day) or 384 (4 days). For sub-hour mean reversion, 8-12 may be enough.
The honest answer is to run a sensitivity sweep: 10, 20, 40, 96. The point estimate (the median across resamples) should be stable; the CI width should grow with L up to about the autocorrelation horizon, then plateau. If the CI is still widening at L = 100 bars, your returns have longer-horizon dependence than you thought.
Interpreting CIs on Sharpe and max drawdown
Two strategies can have identical point-estimate Sharpe and very different CI widths. Consider:
- Strategy A: Sharpe 2.1, 90% CI [1.7, 2.5]. The lower quantile is still well above 1. The edge is robust to path reshuffling. You can plan capital against the lower bound.
- Strategy B: Sharpe 2.1, 90% CI [0.4, 3.8]. The lower quantile crosses 0.5. There is a meaningful probability the true Sharpe is closer to 0.5 and this backtest got lucky on the sequence of bars. Plan capital against the lower bound and size much smaller — or treat the strategy as not yet validated and demand more data.
Same rule for max DD. If the point-estimate max DD is 12% but the 95th-percentile bootstrap max DD is 28%, your live capital-planning number is 28%, not 12%. The single observed max DD in your backtest sample is one draw from a distribution whose right tail you should respect.
Common mistakes
- Plain bootstrap on autocorrelated returns. The single most common error. CIs come back too tight and you conclude the strategy is more robust than it is. Always use block bootstrap for return series.
- Resampling fewer than 5,000 times. Quantile estimates of the 5th and 95th percentile need a meaningful count of samples in the tail. Below 5K the CI itself becomes noisy and varies run-to-run; default to 10K and only push higher when tail-of-tail statistics matter.
- Confusing block length with strategy horizon. Block length matches the autocorrelation horizon of the return series, not the holding period of the trade. A 1-day holding-period strategy on 15-minute bars can still have multi-day autocorrelation in its return stream if it sizes with persistence.
- Bootstrapping trade P&L of an overfit strategy. Monte Carlo gives you CIs on the sample distribution. It does not tell you the sample is representative. An overfit backtest will bootstrap to a tight CI around an inflated Sharpe — and that Sharpe is still fictitious. Use MC for sequence risk, not for overfit detection. For overfit, you want walk-forward + PBO + deflated Sharpe.
- Ignoring funding when bootstrapping HL strategies. On HL the per-bar return decomposes into price and funding components. If your bootstrap input is price-only returns, your CIs miss funding risk. Bootstrap the combined (price + funding) return stream for HL carry strategies.
What Keel ships today
Keel's backtest engine ships point-estimate metrics: Sharpe, max DD, total return, win rate, decomposed funding P&L, per-asset attribution. Bootstrap confidence intervals on those metrics are not shipped today. Native Monte Carlo on backtest output is on the roadmap; the timing is not committed.
In the meantime, the widget below bridges the gap. Drop in trade P&L from any Keel backtest results.json or any returns series, pick a block length, run 10K resamples, and read off the 5/50/95 percentile CIs for Sharpe and max DD in-browser. No upload — the computation runs locally.
Try it
Block-bootstrap any returns series in-browser. Get the 5/50/95 percentile CIs on Sharpe and max DD with a histogram of the resampled distribution.