LLMs are great at proposing trading strategies and terrible at validating them. The workflow splits the labor: LLM does ideation, composition, and summarization; a deterministic engine does the math, the funding accrual, the fees, the slippage. This page is that workflow, end-to-end.
An LLM can propose a trading strategy in seconds. "Momentum top-30 by 30-day return, vol-targeted at 10% annualized, rebalanced weekly." It will produce a thesis paragraph that reads like a CIO memo and a Python file that looks like a backtest. The numbers it prints — Sharpe 2.3, +94% return, max DD -11% — feel real. The strategy feels real.
The trap is that almost none of those numbers were actually computed. The Python often uses a Sharpe formula that drops the risk-free term or annualizes wrong. It assumes midpoint fills. It silently uses fees the LLM made up. It ignores funding payments — on a perpetual-futures strategy where funding is first-order P&L. It loads a CSV the LLM hasn’t actually seen and computes returns on a window the LLM peeked at while writing the script. The strategy reads as alpha; it is a vibe.
This is not a knock on LLMs — it is a knock on using them as the engine. LLMs are language models. They produce plausible-looking text, including plausible-looking code and plausible-looking numbers. They are not deterministic mathematical engines, and asking them to be one is asking the wrong question.
Keel is a quantitative trading platform for Hyperliquid. The architecture splits the work between agent and engine — agents handle composition, summarization, and code; the engine handles the deterministic math, funding accrual, and execution.
LLM does: ideation, hypothesis framing, library search, graph composition, metric interpretation, iteration proposals, summary writing. Anything that benefits from broad knowledge, natural-language reasoning, and judgment about what to try next. The LLM proposes a structure and reads back results.
Engine does: deterministic math, real Hyperliquid price history, hourly funding accrual on every open position, exchange-accurate fee schedules, modeled slippage, point-in-time universe, bit-for-bit parity between the compiled backtest artifact and the artifact that runs live. Anything where a number has to be exact and reproducible.
The interface between the two halves is MCP — Anthropic’s Model Context Protocol. The LLM calls typed tools (keel_components_search, keel_strategy_compose, keel_backtest_run) the way it would call any other function. The tools execute against the real engine and return structured results the LLM reads back. Composition stays in the LLM; computation stays in the engine.
Step 1 — Thesis. The user states a problem or asks for an idea. The LLM proposes a hypothesis it can defend in one paragraph: "Funding-payers (positive funding) tend to mean-revert as the carry pressure unwinds; funding-receivers (negative funding) persist because the carry is paying the trade." A good thesis is concrete enough to compose against — it names the signal direction and the universe.
Step 2 — Compose. The LLM calls keel_components_search to find the typed components that implement the thesis — a funding-level signal, a normalization, a regime gate, a sizing rule, an execution component. It assembles them into a graph: data load → signal → normalization → sizing → portfolio rules → execution. The graph is typed end-to-end — type errors surface at compose time, not at runtime. The library has 182 typed components covering most factor-style use cases.
Step 3 — Backtest. The LLM calls keel_backtest_run with a start date, end date, and the compiled graph. The deterministic engine runs the strategy against real Hyperliquid history — 15-minute bars, 1-hour funding settlement, exchange fee schedule, modeled slippage. The engine returns a tearsheet: Sharpe, Sortino, max DD, Calmar, hit rate, turnover, average funding P&L per day. The numbers come from the engine, not the LLM.
Step 4 — Interpret. The LLM reads the tearsheet and proposes iterations. "Sharpe is 1.4 but max DD is -22% in the Feb 2025 funding-regime shift — the strategy is too exposed during fast funding inversions. Want to try gating with a 7-day funding-volatility filter?" The user accepts or steers. Each iteration is another keel_strategy_compose + keel_backtest_run — minutes per loop.
Step 5 — Validate before deploying. Before any live capital, manually carve out a holdout period (the most recent 20–25% of the sample) and run a single forward test on it. Check that OOS Sharpe is at least 0.5× IS Sharpe. Sanity-check position sizing — at full intended capital, what does max DD look like in dollars? Deploy at a fraction of intended capital first; observe live parity for at least a calendar quarter. Keel does not ship walk-forward, Monte Carlo, PBO, or DSR as native diagnostics yet — the rigor calculators in the lab cover those today. See the robustness checklist for the full pre-deploy bar.
If you ask an LLM to backtest a strategy with no engine behind it, here are the failures you should expect — in roughly the order of frequency.
The mitigation is structural, not behavioral: route every quantitative claim through the engine. The LLM never produces a Sharpe number directly; it calls keel_backtest_run and reads back the engine’s Sharpe. The LLM never decides slippage; the engine’s slippage model decides. The LLM never assumes parity; the same compiled artifact runs in backtest and live.
Concrete loop, with the actual tool calls. The user says: "Build a funding-carry strategy on Hyperliquid and backtest it from 2024 to now."
Claude proposes the thesis (funding-receivers persist, funding-payers mean-revert, gate by funding-volatility regime). It calls keel_components_search with query "funding carry" and finds the funding-data loader, the funding-level regime detector, and the volatility-target weight converter in the typed library. It composes a graph: load price + funding → carry signal → cross-sectional normalization → regime gate → vol-target sizing → portfolio rules → execution. It calls keel_strategy_compose to author the DSL graph and validate types.
Then it calls keel_backtest_run --wait with start 2024-08-15, end 2026-04-30. The deterministic engine runs the strategy against real Hyperliquid history. It returns: Sharpe 2.17, total return +79.6%, max DD -9.7%, sample 2024-08-15 → 2026-04-30. The share URL is app.usekeel.io/share/gDXjURKqWPs8CZ4eXdqAI— a live, inspectable tearsheet anyone can pull up.
The LLM did not compute any of those numbers. It proposed the thesis, composed the graph, called the engine, and read the result back. The Sharpe 2.17, the +79.6%, the -9.7% max DD — all engine numbers, computed against real Hyperliquid data with real fees, real funding accrual, and modeled slippage. The artifact that produced those numbers is the same one that would deploy live (bit-for-bit parity is the engine’s job, not the LLM’s).
A single backtest is not a robustness proof. Before live capital, run the standard rigor checks any quant uses to separate real edge from in-sample fitting.
Each diagnostic catches a different failure mode. Each link above runs you through the math and ships a calculator that takes a pasted tearsheet or return series and produces the diagnostic.
Two commands. The first installs the Keel MCP server as a Python package. The second registers it with Claude Code.
pipx install keel-trade
claude mcp add keel -- keel mcp serveRestart Claude. The Keel tools appear in the tool list. First prompt to run:
Build a funding-carry strategy on Hyperliquid
and backtest it from 2024 to 2026.Claude will call keel_status first, see you are not authenticated, and call keel_auth_login — which opens a browser for OAuth 2.1 + PKCE and captures tokens. Then it will search the component library, compose a graph, run a backtest, and read back the tearsheet. Total time from install to first backtest result: a few minutes.
Keel is a Strategy OS for AI-assisted systematic trading on Hyperliquid. Backtest, optimize, and run live strategies across single-stock perps, indices, and crypto majors — realistic fees, slippage, and funding modeled.
Free to start — connect a Hyperliquid wallet when you’re ready to go live.
An LLM can propose a thesis ("funding-receivers persist; funding-payers mean-revert") and compose a typed strategy graph from a component library. It cannot tell you whether the thesis is profitable — only a deterministic backtest against real market history can do that. The working pattern is: LLM proposes, engine validates. Treat any number that an LLM produces without routing through a real engine as decorative.
Three reasons. First, drift: the LLM will use a Sharpe formula that looks right and is subtly wrong, or skip funding accrual entirely. Second, no parity: the script that "passed" in a notebook is not the artifact that runs live, so backtest results don't generalize. Third, fabricated optimism: LLMs default to peeking at dates and assuming midpoint fills. A real engine applies real fees, real funding history, modeled slippage, and runs the same compiled artifact in backtest and live.
Composable factor-style strategies on liquid universes — carry, momentum, mean-reversion, regime-gated, vol-targeted. The Keel component library covers signal generation, normalization, sizing, portfolio rules, and execution as typed phases an LLM can assemble. Strategies that require custom Numba kernels, novel ML models, or order-book microstructure logic are not yet a fit — they would need new components.
No. The MCP server exposes composition, search, and backtest tools by default. Live trading tools (deploy, monitor, control) require opting into the `live` scope at OAuth time AND a separate local arming step. The compiled strategy artifact is the same in both modes, but the LLM does not have wallet access. Non-custodial execution stays in your hands — Keel never holds the keys.
Minutes per iteration. A typical loop: Claude proposes a thesis (seconds), searches the component library and composes the graph (seconds), kicks off a backtest (10–60s on a 1.5-year sample at 15-minute bars), reads the metrics, proposes a refinement. A research session that would have taken a day of hand-coding compresses to a half-hour of conversation — and the artifact is deploy-ready, not throwaway notebook code.
The engine tells you. A bad thesis backtests to negative Sharpe or a Calmar that doesn't clear the bar; the LLM reads the metrics and iterates. That is the point of routing every quantitative claim through the engine instead of trusting the LLM: bad ideas die in backtest, not in production. The cost of a wrong proposal is a 30-second backtest, not a blown-up account.
The broader pattern: LLMs as research copilots, deterministic engines for execution. Where this workflow fits.
Ten items to clear before deploying anything — the rigor bar that complements the LLM-ideation loop.
PBO + DSR + parameter-stability score from a backtest tearsheet. Run after the LLM loop, before live capital.