Monte Carlo

Monte Carlo simulation on Hyperliquid backtests

Monte Carlo on a backtest means block-bootstrapping returns thousands of times to put confidence intervals on Sharpe and max drawdown. HL 15-minute returns are autocorrelated, so block-bootstrap is the right method. The widget below is live; native Monte Carlo inside Keel is on the roadmap.

By Keel Research Team · Updated May 18, 2026

What Monte Carlo on a backtest actually means

A backtest is one realization. You ran the strategy once across one historical sample and got one equity curve, one Sharpe, one max drawdown number. That single result is a point estimate. The data-generating process — your alpha, the market regime, the funding cycle on HL — could easily have produced a noticeably different equity curve from the same underlying edge. The question Monte Carlo answers is: across many plausible realizations of the same return process, what is the distribution of outcomes?

The mechanic is straightforward. Take the per-bar (or per-trade) returns from your backtest. Resample them many thousands of times with replacement to build synthetic equity curves. Compute Sharpe and max drawdown on each synthetic run. The 5th, 50th, and 95th percentiles across those runs give you a 90% confidence interval. You now know not just “Sharpe 2.1” but “Sharpe 90% CI [1.4, 2.7]” — a meaningfully different planning number.

Drawdown is where this matters most. Max DD is path-dependent and the single observed max DD in a backtest is, almost by definition, an extreme order statistic. Bootstrap CIs on max DD are usually wider and more honest than the point estimate. Planning capital on the point-estimate max DD is how strategies get force-deleveraged in production.

Bootstrap vs block-bootstrap

Plain (i.i.d.) bootstrap shuffles individual returns independently. That assumes returns are independent and identically distributed — no autocorrelation, no volatility clustering, no regime persistence. For real return series, and especially for high-frequency crypto perp returns, that assumption is false.

HL 15-minute returns have non-trivial autocorrelation: momentum and mean-reversion both show up at various horizons; volatility clusters strongly (high-vol bars beget high-vol bars); funding regimes persist across multi-day windows. Plain bootstrap breaks all of that, producing synthetic samples with implausibly low volatility-of-volatility and unrealistic drawdown profiles. You get CIs that are too tight on Sharpe and too narrow on max DD — the worst possible direction to be wrong.

The fix is block bootstrap. Instead of resampling one return at a time, resample contiguous blocks of length L. The block preserves the local autocorrelation and volatility-cluster structure; stitching many blocks together still randomizes the global sample. Two common variants — moving block (overlapping windows) and circular block (wraps around the end-of-sample) — both work for typical use; the widget uses moving block.

Choosing block length for HL 15-minute bars

Block length L is the single most consequential parameter and the one practitioners get wrong most often. Heuristic:

  • L too small (e.g. 1-4 bars) collapses toward plain bootstrap — autocorrelation lost, CIs too tight.
  • L too large (more than ~10% of the sample) under-randomizes — your “resamples” are mostly the original sample with minor reshuffles, CIs collapse toward zero, and your sample size of independent units is too small.
  • L roughly matched to the autocorrelation horizon of your strategy is the right zone. For HL 15-minute bars and most signal classes that means 20-40 bars (5-10 hours). For multi-day carry signals push to 96 (one day) or 384 (4 days). For sub-hour mean reversion, 8-12 may be enough.

The honest answer is to run a sensitivity sweep: 10, 20, 40, 96. The point estimate (the median across resamples) should be stable; the CI width should grow with L up to about the autocorrelation horizon, then plateau. If the CI is still widening at L = 100 bars, your returns have longer-horizon dependence than you thought.

Interpreting CIs on Sharpe and max drawdown

Two strategies can have identical point-estimate Sharpe and very different CI widths. Consider:

  • Strategy A: Sharpe 2.1, 90% CI [1.7, 2.5]. The lower quantile is still well above 1. The edge is robust to path reshuffling. You can plan capital against the lower bound.
  • Strategy B: Sharpe 2.1, 90% CI [0.4, 3.8]. The lower quantile crosses 0.5. There is a meaningful probability the true Sharpe is closer to 0.5 and this backtest got lucky on the sequence of bars. Plan capital against the lower bound and size much smaller — or treat the strategy as not yet validated and demand more data.

Same rule for max DD. If the point-estimate max DD is 12% but the 95th-percentile bootstrap max DD is 28%, your live capital-planning number is 28%, not 12%. The single observed max DD in your backtest sample is one draw from a distribution whose right tail you should respect.

Common mistakes

  • Plain bootstrap on autocorrelated returns. The single most common error. CIs come back too tight and you conclude the strategy is more robust than it is. Always use block bootstrap for return series.
  • Resampling fewer than 5,000 times. Quantile estimates of the 5th and 95th percentile need a meaningful count of samples in the tail. Below 5K the CI itself becomes noisy and varies run-to-run; default to 10K and only push higher when tail-of-tail statistics matter.
  • Confusing block length with strategy horizon. Block length matches the autocorrelation horizon of the return series, not the holding period of the trade. A 1-day holding-period strategy on 15-minute bars can still have multi-day autocorrelation in its return stream if it sizes with persistence.
  • Bootstrapping trade P&L of an overfit strategy. Monte Carlo gives you CIs on the sample distribution. It does not tell you the sample is representative. An overfit backtest will bootstrap to a tight CI around an inflated Sharpe — and that Sharpe is still fictitious. Use MC for sequence risk, not for overfit detection. For overfit, you want walk-forward + PBO + deflated Sharpe.
  • Ignoring funding when bootstrapping HL strategies. On HL the per-bar return decomposes into price and funding components. If your bootstrap input is price-only returns, your CIs miss funding risk. Bootstrap the combined (price + funding) return stream for HL carry strategies.

What Keel ships today

Keel's backtest engine ships point-estimate metrics: Sharpe, max DD, total return, win rate, decomposed funding P&L, per-asset attribution. Bootstrap confidence intervals on those metrics are not shipped today. Native Monte Carlo on backtest output is on the roadmap; the timing is not committed.

In the meantime, the widget below bridges the gap. Drop in trade P&L from any Keel backtest results.json or any returns series, pick a block length, run 10K resamples, and read off the 5/50/95 percentile CIs for Sharpe and max DD in-browser. No upload — the computation runs locally.

Try it

Block-bootstrap any returns series in-browser. Get the 5/50/95 percentile CIs on Sharpe and max DD with a histogram of the resampled distribution.

FAQ

Common questions

Why bootstrap a Hyperliquid backtest at all?

A single backtest produces one Sharpe and one max drawdown — a point estimate. Monte Carlo resampling answers the question your point estimate cannot: how much of that result is signal and how much is path luck? You get a distribution of plausible outcomes given the same underlying return process, which gives you a confidence interval on Sharpe, max DD, total return, and any other path-dependent metric. Wide CIs mean the strategy is fragile to sequence; tight CIs mean the edge is robust to reshuffling.

How do I pick block length for Hyperliquid 15-minute bars?

Block length should approximately match the autocorrelation horizon of your returns — short enough to mix the sample, long enough to preserve the local serial structure. For HL 15-minute bars, daily-ish persistence is typical, so a block length of 20-40 bars (5-10 hours) is a reasonable default for most signal classes. For carry strategies with multi-day persistence, push to 96 bars (one day) or higher. For mean-reversion at sub-hour horizons, 8-12 may be enough. When in doubt, run a sensitivity sweep across 10, 20, 40, 96 — the CI should change but the point estimate should not.

How many resamples is enough?

For 90% or 95% CI quantiles you want at least 5,000 resamples; 10,000 is the common default and what the widget uses. The 99% quantile starts to wobble below 10K. Pushing above 50K rarely changes anything meaningful and the runtime cost is linear. If you are computing a CI on something rare (extreme drawdown, worst-week return), bump to 20-50K to stabilize the tail.

What does a wide CI on Sharpe actually tell me?

It tells you the strategy result is more sample-path-dependent than your single backtest suggests. A point-estimate Sharpe of 2.1 with a 90% CI of [0.4, 3.8] is a different decision than a Sharpe of 2.1 with [1.7, 2.5]. The first one has a meaningful probability of being a Sharpe 0.5 strategy that got lucky in this backtest window; the second is robustly above 1.5. Use the lower quantile as your honest planning number, not the point estimate.

When will Keel ship native Monte Carlo on backtest output?

Today Keel reports point-estimate Sharpe, max DD, total return, win rate, and decomposed funding P&L from the backtest engine. Bootstrap confidence intervals on those metrics are on the roadmap — no committed ship date yet. The widget on this page is a bridge: drop in trade P&L or returns from any Keel backtest result (or any other source) and get the CIs in-browser. When native MC ships in Keel it will compute the same statistic on the returns the simulator produced, exposed through the metrics card.