Learn

Out-of-sample testing for crypto backtests

Reserve a chunk of history the strategy never touches in optimization or selection. Train/test/holdout splits are the baseline; k-fold needs time-aware variants; walk-forward generalizes to many OOS windows. Short HL histories on newer perps constrain how much data you can hold out — plan accordingly.

By Keel Research Team · Updated May 17, 2026

Out-of-sample (OOS) testing is the cheapest defense against the most expensive kind of mistake — deploying a backtest whose performance was an artifact of the optimization window. The basic idea is simple: a piece of historical data that the strategy has never seen during any parameter selection or signal construction step. Performance there is your honest estimate of what live trading would have looked like during that window.

The difficulty in crypto specifically is data-history length. Mature pairs (BTC, ETH) have years of high-quality data. Newer HL listings often have months. The amount of data you can reserve for OOS is constrained, and the OOS window itself may be dominated by a single regime. Both shape how to design the validation.

What out-of-sample means

OOS is defined by what it is not: data the strategy has not touched in any of the following steps:

Parameter selection (grid search, gradient descent, Bayesian optimization).
Feature engineering (no peeking at OOS to pick which features to include).
Strategy selection (no choosing among N candidates by OOS performance).
Hyperparameter tuning (no adjusting walk-forward window sizes by looking at OOS aggregate Sharpe).

Any contact between the strategy-building process and the OOS data leaks information from OOS into the strategy, and the verification stops being honest. This is harder to maintain than it sounds — the temptation to re-tune after a disappointing OOS is exactly what breaks the validation.

Three flavors: holdout, k-fold, walk-forward

Simple holdout. Split history into two contiguous pieces — typically 70–80% in-sample, 20–30% out-of-sample. Build and tune the strategy on IS. Run the frozen strategy once on OOS. Compare performance. Simplest form, lowest computational cost. Limitation: one realization of OOS performance — if the OOS window happens to be regime-mismatched (a chop window for a trend strategy), the result is unfairly bad and vice versa.

K-fold cross-validation (time-aware). Standard k-fold shuffles rows into folds, which leaks future information into training. For time-series data, you need time-aware variants:

Blocked k-fold: folds are contiguous time windows. Rotate through, training on the others and testing on the held-out fold each time.
Purged-and-embargoed k-fold (López de Prado): purges training samples whose feature lookbacks overlap the test fold, embargoes a buffer of training samples immediately after each test fold to prevent leakage through serial correlation.

Walk-forward. A constrained, chronological form of blocked k-fold. IS and OOS windows slide forward through history, each OOS window is unseen at the time its parameters are selected. The aggregate OOS performance approximates what continuous re-optimization would have looked like live. Walk-forward is the gold standard for parameter-tuned strategies; see the WFO explainer for the full procedure.

HL-specific notes

Crypto data on Hyperliquid imposes constraints that don’t exist for equities. The validation design has to bend around them.

Short history on newer listings. A perp listed six months ago has at most six months of data. With 15-minute bars, that is ~17,000 bars — plenty for bar-level statistics but only a couple of trades per week for typical daily-to-weekly strategies. The OOS slice may have to shrink to 1–2 months to leave enough IS for meaningful tuning. Be honest about sample size before drawing conclusions.
Regime shifts in 15-minute bars. HL funding regimes shift over days to weeks; volatility regimes shift over hours; market-structure regimes (new listings, perp DEX competitor launches) shift over months. A single contiguous OOS window may be dominated by one regime and tell you nothing about the others. Walk-forward’s many OOS windows mitigate this — a single holdout cannot.
Cross-asset estimation. If you’re estimating cross-sectional signals across many HL pairs, the listings universe changes over time. Earlier in your sample, fewer pairs were listed; later, more pairs. The implicit asset selection is itself a form of look-ahead. Either restrict the universe to pairs available throughout the window, or model the listing dates explicitly.
Funding cycle alignment. Hyperliquid settles funding hourly. Long enough OOS windows (at least a few weeks) are needed for funding effects to average out; very short OOS windows can be dominated by transient funding spikes around a single news event.

Interpreting OOS Sharpe vs IS — degradation thresholds

The expected degradation from IS to OOS Sharpe is real and predictable:

OOS / IS ratio    interpretation
≥ 0.7             unusually robust; suspect insufficient IS optimization
0.5–0.7           healthy; the expected outcome for a well-validated strategy
0.3–0.5           degraded; edge survives but smaller than IS suggested
0.0–0.3           heavy overfit; the IS number was mostly noise
< 0               broken; the strategy has no real edge

A 50–70% retention is the realistic baseline. Practitioners new to OOS often expect retention near 100% and reject strategies that show 60% retention — but 60% is what success looks like. The strategies you should worry about are the ones showing 95% retention from a 10,000-trial grid search; that pattern usually means the OOS window leaked into the IS optimization somehow.

Equally important: a strategy that fails on a single OOS window may still be viable on others. Use walk-forward’s multiple OOS windows to estimate the distribution of OOS Sharpe, not just one realization.

Doing OOS today — concrete recipe

A practical workflow for an HL strategy with ~18 months of data:

Carve a final holdout. Reserve the most recent 3 months. Do not look at this data until the strategy is locked.
Run a series of single-window optimizations rolling the start/end dates manually across the remaining 15 months. Use anchored windows with 6-month IS expanding to 12-month IS by the end, with 1-month OOS steps. You get ~9 OOS windows for the strategy to demonstrate consistency on; /lab/walk-forward-visualizer then renders the per-fold IS/OOS comparison.
Aggregate walk-forward OOS. Compute aggregate OOS Sharpe across the windows. Confirm it is at least 50% of average IS Sharpe.
Freeze the strategy. Lock parameters using either the final IS window’s optimum or a consensus across walk- forward windows (more robust). Do not re-tune after seeing OOS aggregate.
Run the frozen strategy once on the holdout. This is the honest verification. If it passes, deploy at small size. If it fails, the strategy is invalidated — do not adjust and re-run on the same holdout, because doing so contaminates it permanently.
Start small live. Even after passing holdout, the actual live test is forward time at small capital. Backtests are decision tools; live data is the verdict.

Keel today ships single-window optimization plus the walk-forward visualizer; rolling/anchored WFO and holdout enforcement are roadmap. The discipline of manually rolling windows and reserving an untouched holdout is on you, not on the platform.

Try the WFO visualizer

The Walk-Forward Visualizer takes a returns series and lets you inspect IS-vs-OOS Sharpe across a sequence of walk-forward windows. Useful for getting an intuition for how much of an in- sample Sharpe number typically survives out-of-sample, and how much the result swings between adjacent windows when underlying conditions shift.

Further reading: Pardo (2008), The Evaluation and Optimization of Trading Strategies (2nd ed.) is the canonical reference on walk-forward methodology. Bailey, Borwein, López de Prado & Zhu (2014), Pseudo-Mathematics and Financial Charlatanism formalizes the selection-bias problem that OOS testing exists to address.

This article is educational. Passing OOS testing does not guarantee live profitability; it only rules out the failure mode where the in-sample number was a fitting artifact. Live deployment at small capital remains the final verification. Native walk-forward and holdout enforcement are on the Keel roadmap; today both the rolling-window approximation and the final holdout discipline are manual professional norms rather than enforced platform constraints.

Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

Start in Keel Screen HL markets

What you can do

Backtest any strategy with realistic fees, slippage, and funding.
Optimize parameter grids by Sharpe, drawdown, hit rate.
Deploy live to HL with stops + position limits + funding-aware execution.
Iterate with AI — describe a thesis, get a tradeable pipeline.

FAQ

Out-of-sample testing — questions

How much data should I reserve for out-of-sample?

Conventional rule of thumb: 20–30% of total history reserved as a final holdout that the strategy never touches during parameter selection or any optimization step. For short HL histories — say a perp with 18 months of data — that means roughly 4–5 months untouched at the end. On longer-history pairs (BTC, ETH back several years), you can afford a 30% holdout while still leaving enough in-sample for meaningful parameter estimates. The key constraint is trade count: in-sample needs at least 200 trades for parameter estimates to be statistically meaningful; OOS needs at least 50 for the verification to be credible.

Does k-fold cross-validation work for crypto?

Not in the standard form. K-fold randomly shuffles rows into folds, which leaks information from future bars into the training folds — a strategy fitted on shuffled k-fold can use information from after the test bar in the training process. For time-series data you have to use time-aware variants: blocked k-fold (folds are contiguous time windows), or purged-and-embargoed k-fold (purges training samples that overlap test windows in feature lookback, embargoes a buffer after each test fold). Walk-forward is functionally a constrained form of blocked k-fold where folds advance chronologically.

What is a good OOS Sharpe relative to IS?

The honest expectation is OOS Sharpe of 50–70% of IS Sharpe for a well-validated strategy. So an IS Sharpe of 2.5 that produces 1.5–1.75 OOS is doing what you should expect. If OOS Sharpe is within 30% of IS, you may have an unusually robust edge or your IS optimization was unusually well-constrained. If OOS is below 30% of IS, the in-sample number was mostly fitted noise. If OOS goes negative, you overfit hard and the strategy has no real edge. The 50–70% degradation is the baseline — it's what good validation looks like, not a failure mode.

How does Keel handle OOS today?

Keel ships single-window parameter optimization today; native walk-forward and strategy-level holdout splits are on the roadmap. The shipped tools let you define a backtest window and run the strategy on a different window for verification, but they do not automatically partition data or block you from re-tuning on the holdout. The discipline of carving an untouched final window — and of approximating walk-forward by running a series of rolling single-window optimizations — is on you, the operator. The `/lab/walk-forward-visualizer` widget renders per-fold IS/OOS results once you have produced them.

How does OOS relate to walk-forward optimization?

Walk-forward generates many OOS windows by sliding IS/OOS pairs through history; each OOS window is unseen at the time its parameters were selected. The aggregate OOS performance across all walk-forward windows is your strategy's expected behavior under continuous re-optimization. A single train/test split is the simplest form of OOS — one IS window, one OOS window, no walking. Walk-forward generalizes this to multiple windows for stronger validation. Final holdout sits above both: a piece of data the strategy never touches in any optimization or walk-forward step, used once as a final sanity check.

Out-of-sample testing for crypto backtests

What out-of-sample means

Three flavors: holdout, k-fold, walk-forward

HL-specific notes

Interpreting OOS Sharpe vs IS — degradation thresholds

Doing OOS today — concrete recipe

Try the WFO visualizer

Trade systematically on Keel

Out-of-sample testing — questions

Run a real HL backtest on Keel

Walk-Forward Visualizer

Walk-Forward Optimization

Out-of-Sample on Hyperliquid

Out-of-sample testing for crypto backtests

What out-of-sample means

Three flavors: holdout, k-fold, walk-forward

HL-specific notes

Interpreting OOS Sharpe vs IS — degradation thresholds

Doing OOS today — concrete recipe

Try the WFO visualizer

Trade systematically on Keel

Out-of-sample testing — questions

Run a real HL backtest on Keel →

Walk-Forward Visualizer →

Walk-Forward Optimization →

Out-of-Sample on Hyperliquid →

Run a real HL backtest on Keel

Walk-Forward Visualizer

Walk-Forward Optimization

Out-of-Sample on Hyperliquid