Learn

How to Backtest a Trading Strategy

A backtest is a simulation of a trading strategy on historical data — total return, Sharpe, max drawdown, and win rate as they would have been live. Done honestly, it filters out the worst ideas before they cost real money. Done poorly, it gives false confidence in strategies that will fail out-of-sample.

By Keel Research Team · Updated May 13, 2026

Backtesting is the simulation of a strategy on historical data. Run the strategy across past prices, record what it would have done, compute performance metrics. The output is a hypothesis: "if the future resembles the past, this strategy should produce roughly these returns at roughly this risk level."

The word hypothesis is doing work in that sentence. Backtests aren't predictions; they're filters. A strategy that fails in backtest will almost certainly fail live. A strategy that succeeds in backtest might succeed live — but only if the backtest was constructed honestly and the future actually resembles the past. Both are non-trivial conditions.

What a backtest actually measures

The metrics that matter, in priority order:

  • Max drawdown — the worst peak-to-trough loss. The only metric that captures the path-dependent experience of holding the strategy through bad periods. See /learn/max-drawdown.
  • Sharpe / Sortino ratio — risk-adjusted return on average. Sharpe punishes all volatility; Sortino punishes only downside. Pick based on strategy shape. See /learn/sortino-vs-sharpe.
  • Total return — the headline number. Useful for context but easily misleading without risk-adjustment.
  • Win rate + win/loss ratio — the inputs for sizing analysis (Kelly criterion). Two strategies can have the same Sharpe with very different win/loss profiles.
  • Number of trades — sample size for statistical reliability. Under 100 trades makes most metrics unreliable.

Beyond the headline metrics, the equity curve itself matters. A smooth equity curve with occasional small drawdowns is psychologically very different from a curve with two big drawdowns separated by long flat periods — even if both reach the same total return.

The four biggest backtest mistakes

  1. Look-ahead bias. Using information that wasn't available at decision time. Classic example: using the close-of-bar price to enter mid-bar. Subtle examples: assuming you knew the future return when ranking assets. Any look-ahead in your code makes backtest performance fake.
  2. Survivorship bias. Testing only on assets that exist today, missing the ones that delisted or failed. Crypto is brutal here — many altcoins from 2018 don't exist anymore. A strategy that ignored them in backtest will overestimate live performance.
  3. Overfitting. Tuning parameters until the backtest looks great. The strategy is now optimized for the specific historical sample, not for the underlying market dynamics. Out-of-sample performance degrades sharply. The cure: walk-forward optimization, parameter-sensitivity analysis, and limiting the parameters you actually optimize.
  4. Ignoring costs. Real trading pays fees, slippage, and funding (on perps). Backtests that omit these inflate returns by 5-30%+ depending on strategy frequency. Always model realistic costs.

In-sample vs out-of-sample

The single most important defense against overfitting is splitting your data.

  • In-sample (IS): the data you used to develop the strategy. Chose signals, tuned parameters, picked the universe — all on this data.
  • Out-of-sample (OOS): data the strategy never saw during development. Apply the frozen strategy to this data; compare performance to IS.

If OOS performance is comparable to IS, the strategy generalizes. If OOS performance degrades substantially, the strategy is overfit. Common split: 70% IS, 30% OOS — though the right ratio depends on sample size.

Walk-forward optimization extends this: instead of one fixed split, walk through time in chunks (e.g. 6-month optimization → 3-month OOS validation → roll forward → repeat). Strategies that survive walk-forward have demonstrated robustness across multiple regime shifts. See /learn/walk-forward-optimization.

Backtest properly on Keel

Keel handles the structural mistakes automatically:

  • Component system makes look-ahead errors hard to write — you can't access future data in a signal definition.
  • Universe selection includes delisted assets in historical mode (no survivorship bias).
  • Fees, slippage, and funding are modeled by default (configurable, but on by default).
  • Every backtest returns standard error on key metrics + per-regime sub-period performance breakdown.
  • Walk-forward optimization is built-in for parameter tuning.

To get started: open the lab to find candidate assets via a screen, click "Backtest in Keel" to take that state into a workspace, then configure entry/exit logic. Or fork one of the documented templates in /strategies as a starting point — the funding-carry template has a real 20-month backtest with Sharpe 2.17 you can inspect and modify.

This article is educational. Backtest results are not predictions of future performance. Always paper-trade or run small-size live before scaling capital.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

Backtesting — questions

What is a backtest?

A backtest is a simulation of a trading strategy on historical data. It returns metrics like total return, Sharpe ratio, max drawdown, win rate — the would-have-been performance if the strategy had been live across the test period. Backtests are not predictions; they're sanity checks. A strategy that performs poorly in backtest will almost certainly perform poorly live; the converse is not guaranteed.

What are the biggest mistakes that produce fake backtest performance?

Four killers. (1) Look-ahead bias — using information that wasn't available at the time of the trade (e.g. using closing price to decide to enter intraday). (2) Survivorship bias — backtesting only on assets that exist today, missing delisted/failed ones. (3) Overfitting — tuning parameters until the historical sample looks great; the strategy fails out-of-sample. (4) Ignoring costs — leaving out fees, slippage, or funding produces inflated returns that vanish live.

How long should a backtest sample be?

Long enough to include multiple market regimes. For crypto, minimum 1 year covering at least one trend + range + drawdown cycle. 2-3 years is better. For lower-frequency strategies (multi-day holds), longer samples are necessary — 100 trades is the rough minimum for parameter estimates to be reliable.

What's the difference between in-sample and out-of-sample testing?

In-sample is the data you used to develop the strategy (chose parameters, picked signals). Out-of-sample is fresh data the strategy never saw. A strategy that works in-sample but fails out-of-sample is overfit. Good practice: split your data 70/30, develop on 70%, validate on 30%. Walk-forward optimization is an extension that re-validates as you walk through time.

How do I check if my strategy is overfit?

Three tests. (1) Run on multiple in-sample sub-periods — if performance varies wildly, the strategy is regime-dependent. (2) Try the strategy on out-of-sample data (or paper-trade it forward) — substantial degradation is the overfit signature. (3) Parameter sensitivity — vary each parameter slightly; if performance collapses, you're at a fragile peak in the parameter surface. Robust strategies have wide profitable plateaus, not narrow spikes.

Does Keel handle these properly?

Yes. Keel backtests model fees, slippage, and funding by default. The component system makes look-ahead errors hard to write (you can't access future data). Walk-forward optimization is available for the parameter-tuning workflow. The strategy registry includes win-rate standard error and per-regime sub-period breakdowns. Every backtest run gets a permanent share URL for reproducibility.