Resample your trade-level returns to build a distribution around Sharpe and max-drawdown rather than a single point estimate. Block bootstrap fits autocorrelated crypto returns; plain bootstrap does not. The output is a confidence interval — how robust the reported Sharpe is to sampling noise.
A backtest reports one number for Sharpe and one number for max drawdown. Both are point estimates of a noisy quantity. The realized return path is one draw from a distribution; resample the path and the numbers shift. Monte Carlo on a backtest quantifies how much they shift — turning the point estimate into a 5th/50th/95th percentile band so you can tell a robust edge apart from a lucky window.
Two strategies can report identical Sharpe 2.0. One might have a 90% confidence interval of [1.6, 2.4] and the other [0.3, 3.7]. The first is a tradeable edge with noise. The second is mostly noise with a flattering point estimate. You cannot distinguish them from the backtest number alone. Monte Carlo gives you the second column.
Cite: Bailey, Borwein, López de Prado, & Zhu (2014) — Pseudo-Mathematics and Financial Charlatanism lay out the selection-bias problem this and adjacent techniques exist to address.
The procedure resamples the realized series of returns (per-bar or per-trade) many times to construct synthetic equity curves that are statistically consistent with the observed one. For each synthetic curve, recompute Sharpe and max drawdown. Repeat 10,000 times. The empirical distribution of those metrics is the answer.
Concretely you end up with:
This is not a forecast of future performance. It is a sensitivity analysis on the in-sample number, holding the underlying distribution fixed. Live Sharpe will generally be worse than the worst end of the CI because regime shifts also degrade the underlying distribution.
Plain (i.i.d.) bootstrap resamples individual returns independently. It assumes the underlying return series has no serial correlation — that yesterday’s return tells you nothing about today’s. For crypto, that assumption is flatly wrong:
Plain bootstrap breaks all of these. It produces synthetic series that are too smooth, with Sharpe CIs that are too tight — making the strategy look more robust than it actually is.
Block bootstrap (Künsch 1989) fixes this by resampling contiguous chunks of length L. Pick a block, glue it onto the synthetic series, repeat until you have a series the original length. The local autocorrelation structure inside each block is preserved; only the joins are independent. CIs widen to reflect the real noise.
The trade-off: longer blocks preserve more autocorrelation but reduce the effective sample size of the bootstrap (you have fewer independent blocks). Pick L to roughly match the autocorrelation horizon of the series and you capture the structure without unduly inflating CIs.
For Hyperliquid 15-minute bars, the relevant time scales are: tick-level noise (sub-bar), intraday momentum (a few hours), funding cycle (hourly settlements but trends persist multi-day), and overnight regime shifts. Reasonable defaults:
signal type block length (15min bars) wall-clock
sub-hour mean reversion 8–12 2–3 hours
intraday momentum 20–40 5–10 hours
multi-day carry / trend 96 1 day
weekly rebalancing 672 1 weekIn practice, sweep L across {10, 20, 40, 96} and check two things: (1) the median Sharpe should be stable across block lengths (if not, your dataset is too short for any of them), and (2) the CI width should grow with L up to the autocorrelation horizon and then plateau. The plateau identifies the right block.
If you want a principled choice, the Politis-Romano (1994) stationary bootstrap randomizes block length from a geometric distribution — less sensitive to a single L choice. Most practical implementations default to fixed-length blocks because the simplicity outweighs the slight bias.
Once you have a Sharpe distribution, three numbers matter:
A heuristic from the Bailey & López de Prado (2014) DSR paper: if the lower bound is positive and the median exceeds the haircut for the number of strategy candidates you searched, the edge is plausibly real. For a single strategy you built from priors, the haircut is small. For one of 1,000 grid-searched parameters, the haircut is large enough that median Sharpe of 1.5 may still collapse to zero on a deflated basis.
The Monte Carlo Backtest Resampler runs 10,000 block-bootstrap resamples in your browser. Paste a CSV of per-bar returns or per-trade P&L, pick a block length, and get Sharpe and max-DD distributions with 5th/50th/95th percentile bands. Nothing leaves your machine.
Keel itself reports point-estimate metrics from the backtest engine today; native bootstrap CIs on those metrics are on the roadmap but not shipped. The widget is the bridge — export your backtest returns, drop them into the resampler, get the CIs.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.
Free to start — connect a Hyperliquid wallet when you’re ready to go live.
A single Sharpe number is one realization of a noisy estimator. Two strategies with the same true Sharpe will produce different sample Sharpe estimates over the same window. A Monte Carlo resample turns that point estimate into a distribution, and the 5th/95th percentile spread tells you how much of the reported number is signal versus the luck of which trades fell inside the sample. Without the CI, you cannot tell a Sharpe 2.0 strategy with tight CI [1.6, 2.4] apart from a Sharpe 2.0 strategy with CI [0.3, 3.7] — and the second is much more likely to disappoint live.
Block length should roughly match the autocorrelation horizon of the return series. For Hyperliquid 15-minute bars: 20–40 blocks (5–10 hours) is a reasonable default for most signals; 96 (one day) for carry strategies whose edge persists across funding settlements; 8–12 for sub-hour mean reversion. When in doubt, sweep across 10/20/40/96 and check that the median is stable. The CI width should grow with block length up to the underlying autocorrelation horizon and then plateau — that plateau identifies the right block.
For 5th/95th percentile estimates, 10,000 resamples is enough for stable bounds run-to-run. Push to 20,000–50,000 if you want 99% intervals or are estimating extreme-tail statistics. Below 5,000 the percentile estimates wobble noticeably between runs; above 100,000 you rarely see anything change. The bootstrap converges in O(1/√N), so doubling the resample count only tightens the CI by about 30%.
The point estimate is what your backtest reports — a single number computed from the realized return path. The CI is the range that point estimate could plausibly take if the underlying return distribution stayed the same but the timing of trades varied. A wide CI means the point estimate is fragile to sampling noise. A tight CI means the strategy would produce similar Sharpe across many counterfactual realizations of the same edge.
They diagnose different failure modes. Monte Carlo bootstraps the realized return path of a single strategy to estimate noise around its Sharpe. PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) takes many candidate strategies and asks how often the in-sample winner underperforms out-of-sample — a selection-bias diagnostic across the strategy space. Use Monte Carlo to estimate uncertainty on a single strategy you've chosen; use PBO to estimate the chance you chose the wrong strategy. Both belong in a rigorous workflow.
Take the same idea to a full backtest on ~220 HL perps with real fees and 1-hour funding, then export returns to bootstrap here.
10K block-bootstrap resamples in your browser. CIs on Sharpe and max DD from your own returns CSV.
The HL-specific application — how to use the resampler against 15-minute perp data, with block-length defaults tuned for HL.
The complementary defense: rolling in-sample/out-of-sample windows to validate parameter robustness, not just metric noise.