Out-of-Sample Testing

Out-of-sample testing for Hyperliquid strategies

Out-of-sample testing reserves a slice of history the strategy never saw during research. Train/test/holdout is the simplest form; walk-forward is the rolling version. Keel ships signal-level subsample diagnostics today — strategy-level holdout splits are on the roadmap.

By Keel Research Team · Updated May 18, 2026

What OOS means

An out-of-sample (OOS) test is a backtest run on data the strategy was not allowed to see during research. The point is brutally simple: if you fit parameters, picked components, or ran a grid search on the same bars you then evaluate on, your metrics are inflated. The market has 220 perp markets and a few years of 15-minute history on Hyperliquid — any reasonably expressive search will find something that posted a Sharpe of 3 in-sample. OOS asks the only question that matters: does the edge survive bars the model never saw?

The classical split is three windows: train (fit parameters), test (compare candidate strategies and pick one), holdout (run the chosen strategy once, never again, and report whatever number comes out). The holdout window is sacred — if you peek at it and retune, you have just contaminated it and need fresh data. Many researchers cut corners and merge test and holdout into a single OOS window. That is acceptable for a one-off prototype but understates overfit risk when you have tried more than a handful of variants.

OOS vs walk-forward

Walk-forward optimization (WFO) is OOS done many times in a rolling window. Train on bars 0-N, evaluate on bars N+1 to N+M, slide forward, repeat. The output is a sequence of OOS Sharpes — one per window — not a single number. WFO is strictly more informative because you can see whether the OOS degradation is stable or trending downward. The cost: it requires enough history for many windows and is several times more compute-intensive.

Practical rule: single-split OOS first as a fast sanity check, WFO before risking capital. If single-split OOS already collapses, you do not need to run WFO — the strategy is overfit and you stop. If single-split OOS holds, WFO tells you whether it holds across different regimes or just got lucky in one window. The two are complements, not substitutes. See walk-forward optimization on Hyperliquid for the full treatment.

HL-specific notes

Hyperliquid's data history breaks the textbook OOS recipe in three ways:

  • Newer listings have months, not years. The exchange went live in mid-2023; many of its 220-odd perp markets only have history from late 2024 onward. A 70/30 split on a market with ten months of data leaves a three- month holdout — too short to span a meaningful regime. Either drop the market from the OOS universe or accept that your holdout number on that asset is mostly noise.
  • Funding-rate cycles are short. A holdout window that spans only one funding regime (all-positive or all-negative across the universe) will look great for a carry strategy that happened to align with that regime and terrible for one that did not. Try to size the holdout to cover at least one full funding cycle (rough heuristic: six months on HL).
  • Regime shifts are violent. The 2024-Q4 meme cycle, the early-2025 alt rotation, and the 2026 consolidation are three distinct regimes inside an 18-month window. A single-split OOS landing entirely in one of them is non-representative — this is exactly why WFO is more defensible on HL than on a venue with a decade of mixed history.

What Keel ships today

At the strategy level, Keel runs a single-window backtest with explicit start- and end-date pickers on the app's backtest screen — and the same backtest is available from the keel-trade CLI for terminal/AI-agent workflows (keel backtest run <strategy-id> --start-date ... --end-date ...). That is the primitive an honest single-split OOS is built on — two backtests over disjoint date ranges.

At the signal level, Keel exposes subsample diagnostics that answer a related but narrower question: does the information content of this signal hold up across different periods and different regimes?

  • Time split analysis — splits history chronologically (early vs late by default) and reports IC, t-statistic, observation count, and percent-positive IC per split. If a signal's IC is 0.05 in the early window and 0.00 in the late window, you have a decay problem the aggregate backtest will mask.
  • Regime split analysis — splits by volatility regime (high-vol vs low-vol bars) or any user-supplied regime label and reports the same IC breakdown per regime. Useful for catching signals that only work in one volatility environment.
  • Conditional IC — IC computed only on bars where a regime indicator is active. Cleanest measure of whether a regime gate would have helped.

What Keel does not ship today is a one-click strategy-level holdout that automatically reserves the last 30% of history, runs the strategy on the train portion, freezes the resulting weights or parameters, then evaluates on the held-out window and reports degradation. That is the right primitive for honest research and it is on the roadmap. Until it ships, do it manually: hold out the most recent 3-6 months by setting the backtest end-date earlier and running a second backtest against the holdout window, then compare the metrics side by side.

OOS Sharpe degradation — what counts as a red flag

Some OOS degradation is expected. A backtest fit on the same bars it is evaluated on overstates Sharpe because the fitting process exploited noise in those bars; the noise does not repeat out-of-sample. A reasonable working rule:

  • OOS Sharpe ≥ 0.8 × in-sample — strategy is probably real. Standard noise-driven degradation.
  • OOS Sharpe between 0.5× and 0.8× in-sample — borderline. Worth keeping but treat the in-sample number as inflated. Run WFO before sizing it meaningfully.
  • OOS Sharpe < 0.5 × in-sample — red flag. The strategy is largely a curve-fit. Either the parameter space was too rich for the available data or the signal does not generalize. Do not deploy capital against this.
  • OOS Sharpe negative while in-sample positive — diagnostic certainty. You overfit. The right response is to throw the strategy out, not retune it on the OOS window.

These thresholds are heuristics, not statistical tests. For a real overfit probability number, see PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) and DSR (Deflated Sharpe Ratio, Bailey-Lopez de Prado 2014). Both are on the Keel rigor roadmap; for now, the OOS-vs-IS Sharpe ratio is the cheap version of the same idea.

Try it

Single-split OOS is one extra backtest with a different date range. Walk-forward is the rolling version — open the visualizer to see how rolling IS/OOS windows behave on a series you upload.

FAQ

Common questions

How much out-of-sample data is enough?

Rule of thumb: hold back at least 20-30% of the available history, and never less than one full market regime. On Hyperliquid, where the average market has under two years of history, that often means a holdout of six to nine months. Less than that and OOS results are dominated by whatever regime happened to fall in the test window.

What about k-fold cross-validation?

K-fold is the wrong tool for time-series strategies. Standard k-fold shuffles bars across train and test folds, which leaks future information into the past and destroys autocorrelation structure. If you want fold-style validation, use forward-chaining (purged k-fold per Lopez de Prado) or walk-forward — never plain shuffled k-fold.

OOS vs walk-forward — which should I run?

OOS is one split: train on the first 70%, test on the last 30%. Walk-forward is many splits done in a rolling window. OOS is faster and gives a single number; WFO is slower and gives a degradation curve. Use OOS first as a sanity check; use WFO before deploying capital. They answer related but different questions.

When will strategy-level OOS ship in Keel?

On the roadmap, alongside walk-forward optimization. Today Keel exposes time and regime subsample diagnostics at the signal level — early/late period IC consistency, high-volatility vs low-volatility regime IC, and conditional IC. A first-class strategy-level holdout split is planned but not yet shipped.

How do I do strategy-level OOS in Keel today?

Hold out the most recent 3-6 months by setting the backtest end-date earlier in the Keel app and running a second backtest against the holdout window. Train on 2024-08-15 to 2025-12-31, then run the same strategy unchanged on 2026-01-01 to 2026-02-27 and compare metrics. Two backtests with explicit start/end dates give you an honest single-split OOS — and you can drive the same two runs from the keel-trade CLI if you script it. Crude but real.