LLM strategy generation: idea → validated backtest

LLMs are great at proposing trading strategies and terrible at validating them. The workflow splits the labor: LLM does ideation, composition, and summarization; a deterministic engine does the math, the funding accrual, the fees, the slippage. This page is that workflow, end-to-end.

By Keel Research Team · Updated May 20, 2026

The seduction and the trap

An LLM can propose a trading strategy in seconds. "Momentum top-30 by 30-day return, vol-targeted at 10% annualized, rebalanced weekly." It will produce a thesis paragraph that reads like a CIO memo and a Python file that looks like a backtest. The numbers it prints — Sharpe 2.3, +94% return, max DD -11% — feel real. The strategy feels real.

The trap is that almost none of those numbers were actually computed. The Python often uses a Sharpe formula that drops the risk-free term or annualizes wrong. It assumes midpoint fills. It silently uses fees the LLM made up. It ignores funding payments — on a perpetual-futures strategy where funding is first-order P&L. It loads a CSV the LLM hasn’t actually seen and computes returns on a window the LLM peeked at while writing the script. The strategy reads as alpha; it is a vibe.

This is not a knock on LLMs — it is a knock on using them as the engine. LLMs are language models. They produce plausible-looking text, including plausible-looking code and plausible-looking numbers. They are not deterministic mathematical engines, and asking them to be one is asking the wrong question.

The split of labor

Keel is a quantitative trading platform for Hyperliquid. The architecture splits the work between agent and engine — agents handle composition, summarization, and code; the engine handles the deterministic math, funding accrual, and execution.

LLM does: ideation, hypothesis framing, library search, graph composition, metric interpretation, iteration proposals, summary writing. Anything that benefits from broad knowledge, natural-language reasoning, and judgment about what to try next. The LLM proposes a structure and reads back results.

Engine does: deterministic math, real Hyperliquid price history, hourly funding accrual on every open position, exchange-accurate fee schedules, modeled slippage, point-in-time universe, bit-for-bit parity between the compiled backtest artifact and the artifact that runs live. Anything where a number has to be exact and reproducible.

The interface between the two halves is MCP — Anthropic’s Model Context Protocol. The LLM calls typed tools (keel_components_search, keel_strategy_compose, keel_backtest_run) the way it would call any other function. The tools execute against the real engine and return structured results the LLM reads back. Composition stays in the LLM; computation stays in the engine.

The five-step LLM-strategy-generation workflow

Step 1 — Thesis. The user states a problem or asks for an idea. The LLM proposes a hypothesis it can defend in one paragraph: "Funding-payers (positive funding) tend to mean-revert as the carry pressure unwinds; funding-receivers (negative funding) persist because the carry is paying the trade." A good thesis is concrete enough to compose against — it names the signal direction and the universe.

Step 2 — Compose. The LLM calls keel_components_search to find the typed components that implement the thesis — a funding-level signal, a normalization, a regime gate, a sizing rule, an execution component. It assembles them into a graph: data load → signal → normalization → sizing → portfolio rules → execution. The graph is typed end-to-end — type errors surface at compose time, not at runtime. The library has 182 typed components covering most factor-style use cases.

Step 3 — Backtest. The LLM calls keel_backtest_run with a start date, end date, and the compiled graph. The deterministic engine runs the strategy against real Hyperliquid history — 15-minute bars, 1-hour funding settlement, exchange fee schedule, modeled slippage. The engine returns a tearsheet: Sharpe, Sortino, max DD, Calmar, hit rate, turnover, average funding P&L per day. The numbers come from the engine, not the LLM.

Step 4 — Interpret. The LLM reads the tearsheet and proposes iterations. "Sharpe is 1.4 but max DD is -22% in the Feb 2025 funding-regime shift — the strategy is too exposed during fast funding inversions. Want to try gating with a 7-day funding-volatility filter?" The user accepts or steers. Each iteration is another keel_strategy_compose + keel_backtest_run — minutes per loop.

Step 5 — Validate before deploying. Before any live capital, manually carve out a holdout period (the most recent 20–25% of the sample) and run a single forward test on it. Check that OOS Sharpe is at least 0.5× IS Sharpe. Sanity-check position sizing — at full intended capital, what does max DD look like in dollars? Deploy at a fraction of intended capital first; observe live parity for at least a calendar quarter. Keel does not ship walk-forward, Monte Carlo, PBO, or DSR as native diagnostics yet — the rigor calculators in the lab cover those today. See the robustness checklist for the full pre-deploy bar.

What an LLM hallucinates (the failure-mode list)

If you ask an LLM to backtest a strategy with no engine behind it, here are the failures you should expect — in roughly the order of frequency.

  • Wrong Sharpe formula. The LLM picks an annualization factor at random (252, 365, 8760 for 15-minute bars). It often drops the risk-free term. It computes on returns it did not actually compute correctly. The number is plausible and wrong.
  • Ignored funding payments. On a perpetuals strategy, funding is paid every hour. A backtest that ignores it can show positive Sharpe on a strategy that loses money live to carry. LLMs default to ignoring it because the prompt didn’t mention it.
  • Peeked dates. The LLM read about HL’s 2025 volatility regime in its training data. It "tested" a strategy that conveniently switches risk-off during that period — because it knew the answer before writing the test.
  • Gut-feel position sizing. "We’ll size at 2% per name" — without computing what the resulting portfolio vol or DD will look like. Vol-targeting requires a feedback loop the LLM doesn’t maintain across the script.
  • Fabricated fees and slippage. The LLM picks "5 bps" because it sounds reasonable. Actual HL maker/taker fees and the spread profile across the 200+ listed perps are not what the LLM remembers, if it remembers at all.
  • No live parity. The Python script that "passed" is not the artifact that would deploy. There is no executable bridge between the notebook and a real exchange — even if the strategy worked in the notebook, the deployment is from scratch and will diverge.

The mitigation is structural, not behavioral: route every quantitative claim through the engine. The LLM never produces a Sharpe number directly; it calls keel_backtest_run and reads back the engine’s Sharpe. The LLM never decides slippage; the engine’s slippage model decides. The LLM never assumes parity; the same compiled artifact runs in backtest and live.

A worked example

Concrete loop, with the actual tool calls. The user says: "Build a funding-carry strategy on Hyperliquid and backtest it from 2024 to now."

Claude proposes the thesis (funding-receivers persist, funding-payers mean-revert, gate by funding-volatility regime). It calls keel_components_search with query "funding carry" and finds the funding-data loader, the funding-level regime detector, and the volatility-target weight converter in the typed library. It composes a graph: load price + funding → carry signal → cross-sectional normalization → regime gate → vol-target sizing → portfolio rules → execution. It calls keel_strategy_compose to author the DSL graph and validate types.

Then it calls keel_backtest_run --wait with start 2024-08-15, end 2026-04-30. The deterministic engine runs the strategy against real Hyperliquid history. It returns: Sharpe 2.17, total return +79.6%, max DD -9.7%, sample 2024-08-15 → 2026-04-30. The share URL is app.usekeel.io/share/gDXjURKqWPs8CZ4eXdqAI— a live, inspectable tearsheet anyone can pull up.

The LLM did not compute any of those numbers. It proposed the thesis, composed the graph, called the engine, and read the result back. The Sharpe 2.17, the +79.6%, the -9.7% max DD — all engine numbers, computed against real Hyperliquid data with real fees, real funding accrual, and modeled slippage. The artifact that produced those numbers is the same one that would deploy live (bit-for-bit parity is the engine’s job, not the LLM’s).

Validating an LLM-proposed strategy

A single backtest is not a robustness proof. Before live capital, run the standard rigor checks any quant uses to separate real edge from in-sample fitting.

  • Walk-forward optimization: splits the sample into rolling IS/OOS chunks to catch regime dependence. Catches strategies that worked in one window and not the next.
  • Monte Carlo: resamples the strategy-return series to give confidence bands around max DD. Catches path-dependence — your "max DD -10%" might have a 25% tail at -20%.
  • PBO: when an LLM iterates 20 times on a strategy, the champion is selected from 20 trials. PBO quantifies how much of the IS Sharpe is selection bias.
  • DSR: adjusts the champion’s Sharpe for trial count, skew, kurtosis, sample length. Tells you whether the champion is "really 2.0" or "really 0.6 after the multiplicity correction."

Each diagnostic catches a different failure mode. Each link above runs you through the math and ships a calculator that takes a pasted tearsheet or return series and produces the diagnostic.

How to run this workflow yourself

Two commands. The first installs the Keel MCP server as a Python package. The second registers it with Claude Code.

pipx install keel-trade
claude mcp add keel -- keel mcp serve

Restart Claude. The Keel tools appear in the tool list. First prompt to run:

Build a funding-carry strategy on Hyperliquid
and backtest it from 2024 to 2026.

Claude will call keel_status first, see you are not authenticated, and call keel_auth_login — which opens a browser for OAuth 2.1 + PKCE and captures tokens. Then it will search the component library, compose a graph, run a backtest, and read back the tearsheet. Total time from install to first backtest result: a few minutes.

This page is educational. Backtest results — including the worked example — do not forecast live performance. A backtest that clears Sharpe 2 over a 1.5-year sample can still fail in the next regime. Run the robustness checklist before any live capital. Keel is non-custodial — execution requires your wallet and your explicit local arming; the LLM does not have keys.
Automate it

Trade systematically on Keel

Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.

Free to start — connect a Hyperliquid wallet when you’re ready to go live.

What you can do
  • Backtest any strategy with realistic fees, slippage, and funding.
  • Optimize parameter grids by Sharpe, drawdown, hit rate.
  • Deploy live to HL with stops + position limits + funding-aware execution.
  • Iterate with AI — describe a thesis, get a tradeable pipeline.
FAQ

LLM strategy generation — questions

Can an LLM actually generate a profitable trading strategy?

An LLM can propose a thesis ("funding-receivers persist; funding-payers mean-revert") and compose a typed strategy graph from a component library. It cannot tell you whether the thesis is profitable — only a deterministic backtest against real market history can do that. The working pattern is: LLM proposes, engine validates. Treat any number that an LLM produces without routing through a real engine as decorative.

Why not just ask Claude to write a Python backtest script?

Three reasons. First, drift: the LLM will use a Sharpe formula that looks right and is subtly wrong, or skip funding accrual entirely. Second, no parity: the script that "passed" in a notebook is not the artifact that runs live, so backtest results don't generalize. Third, fabricated optimism: LLMs default to peeking at dates and assuming midpoint fills. A real engine applies real fees, real funding history, modeled slippage, and runs the same compiled artifact in backtest and live.

What kinds of strategies does this workflow suit?

Composable factor-style strategies on liquid universes — carry, momentum, mean-reversion, regime-gated, vol-targeted. The Keel component library covers signal generation, normalization, sizing, portfolio rules, and execution as typed phases an LLM can assemble. Strategies that require custom Numba kernels, novel ML models, or order-book microstructure logic are not yet a fit — they would need new components.

Does the LLM see my account or trade my money?

No. The MCP server exposes composition, search, and backtest tools by default. Live trading tools (deploy, monitor, control) require opting into the `live` scope at OAuth time AND a separate local arming step. The compiled strategy artifact is the same in both modes, but the LLM does not have wallet access. Non-custodial execution stays in your hands — Keel never holds the keys.

How long does the loop take?

Minutes per iteration. A typical loop: Claude proposes a thesis (seconds), searches the component library and composes the graph (seconds), kicks off a backtest (10–60s on a 1.5-year sample at 15-minute bars), reads the metrics, proposes a refinement. A research session that would have taken a day of hand-coding compresses to a half-hour of conversation — and the artifact is deploy-ready, not throwaway notebook code.

What if the LLM proposes a bad strategy?

The engine tells you. A bad thesis backtests to negative Sharpe or a Calmar that doesn't clear the bar; the LLM reads the metrics and iterates. That is the point of routing every quantitative claim through the engine instead of trusting the LLM: bad ideas die in backtest, not in production. The cost of a wrong proposal is a 30-second backtest, not a blown-up account.