What OOS means
An out-of-sample (OOS) test is a backtest run on data the strategy was not allowed to see during research. The point is brutally simple: if you fit parameters, picked components, or ran a grid search on the same bars you then evaluate on, your metrics are inflated. The market has 220 perp markets and a few years of 15-minute history on Hyperliquid — any reasonably expressive search will find something that posted a Sharpe of 3 in-sample. OOS asks the only question that matters: does the edge survive bars the model never saw?
The classical split is three windows: train (fit parameters), test (compare candidate strategies and pick one), holdout (run the chosen strategy once, never again, and report whatever number comes out). The holdout window is sacred — if you peek at it and retune, you have just contaminated it and need fresh data. Many researchers cut corners and merge test and holdout into a single OOS window. That is acceptable for a one-off prototype but understates overfit risk when you have tried more than a handful of variants.
OOS vs walk-forward
Walk-forward optimization (WFO) is OOS done many times in a rolling window. Train on bars 0-N, evaluate on bars N+1 to N+M, slide forward, repeat. The output is a sequence of OOS Sharpes — one per window — not a single number. WFO is strictly more informative because you can see whether the OOS degradation is stable or trending downward. The cost: it requires enough history for many windows and is several times more compute-intensive.
Practical rule: single-split OOS first as a fast sanity check, WFO before risking capital. If single-split OOS already collapses, you do not need to run WFO — the strategy is overfit and you stop. If single-split OOS holds, WFO tells you whether it holds across different regimes or just got lucky in one window. The two are complements, not substitutes. See walk-forward optimization on Hyperliquid for the full treatment.
HL-specific notes
Hyperliquid's data history breaks the textbook OOS recipe in three ways:
- Newer listings have months, not years. The exchange went live in mid-2023; many of its 220-odd perp markets only have history from late 2024 onward. A 70/30 split on a market with ten months of data leaves a three- month holdout — too short to span a meaningful regime. Either drop the market from the OOS universe or accept that your holdout number on that asset is mostly noise.
- Funding-rate cycles are short. A holdout window that spans only one funding regime (all-positive or all-negative across the universe) will look great for a carry strategy that happened to align with that regime and terrible for one that did not. Try to size the holdout to cover at least one full funding cycle (rough heuristic: six months on HL).
- Regime shifts are violent. The 2024-Q4 meme cycle, the early-2025 alt rotation, and the 2026 consolidation are three distinct regimes inside an 18-month window. A single-split OOS landing entirely in one of them is non-representative — this is exactly why WFO is more defensible on HL than on a venue with a decade of mixed history.
What Keel ships today
At the strategy level, Keel runs a single-window backtest with explicit start- and end-date pickers on the app's backtest screen — and the same backtest is available from the keel-trade CLI for terminal/AI-agent workflows (keel backtest run <strategy-id> --start-date ... --end-date ...). That is the primitive an honest single-split OOS is built on — two backtests over disjoint date ranges.
At the signal level, Keel exposes subsample diagnostics that answer a related but narrower question: does the information content of this signal hold up across different periods and different regimes?
- Time split analysis — splits history chronologically (early vs late by default) and reports IC, t-statistic, observation count, and percent-positive IC per split. If a signal's IC is 0.05 in the early window and 0.00 in the late window, you have a decay problem the aggregate backtest will mask.
- Regime split analysis — splits by volatility regime (high-vol vs low-vol bars) or any user-supplied regime label and reports the same IC breakdown per regime. Useful for catching signals that only work in one volatility environment.
- Conditional IC — IC computed only on bars where a regime indicator is active. Cleanest measure of whether a regime gate would have helped.
What Keel does not ship today is a one-click strategy-level holdout that automatically reserves the last 30% of history, runs the strategy on the train portion, freezes the resulting weights or parameters, then evaluates on the held-out window and reports degradation. That is the right primitive for honest research and it is on the roadmap. Until it ships, do it manually: hold out the most recent 3-6 months by setting the backtest end-date earlier and running a second backtest against the holdout window, then compare the metrics side by side.
OOS Sharpe degradation — what counts as a red flag
Some OOS degradation is expected. A backtest fit on the same bars it is evaluated on overstates Sharpe because the fitting process exploited noise in those bars; the noise does not repeat out-of-sample. A reasonable working rule:
- OOS Sharpe ≥ 0.8 × in-sample — strategy is probably real. Standard noise-driven degradation.
- OOS Sharpe between 0.5× and 0.8× in-sample — borderline. Worth keeping but treat the in-sample number as inflated. Run WFO before sizing it meaningfully.
- OOS Sharpe < 0.5 × in-sample — red flag. The strategy is largely a curve-fit. Either the parameter space was too rich for the available data or the signal does not generalize. Do not deploy capital against this.
- OOS Sharpe negative while in-sample positive — diagnostic certainty. You overfit. The right response is to throw the strategy out, not retune it on the OOS window.
These thresholds are heuristics, not statistical tests. For a real overfit probability number, see PBO (Probability of Backtest Overfitting, Bailey-Borwein 2014) and DSR (Deflated Sharpe Ratio, Bailey-Lopez de Prado 2014). Both are on the Keel rigor roadmap; for now, the OOS-vs-IS Sharpe ratio is the cheap version of the same idea.
Try it
Single-split OOS is one extra backtest with a different date range. Walk-forward is the rolling version — open the visualizer to see how rolling IS/OOS windows behave on a series you upload.