Use Claude where it actually helps. Use the engine for the rest.
Keel pairs Claude or Cursor with a deterministic Hyperliquid backtest engine. LLMs propose, compose, summarize, refactor — where they're sharp. The engine runs the backtest, models funding and fees, and deploys the compiled artifact live — where LLMs hallucinate.
Three frames for LLMs in systematic work. The third is the only one that survives.
Most pitches for LLMs in crypto fall into two camps: an autonomous wallet bot, or a one-off Python script. Neither matches what an LLM is actually good at. Here, LLMs compose; the engine computes; the compiled artifact trades.
Proposing a thesis from priors ("carry plus a funding-regime gate"). Searching a 182-component library for matching signals. Composing them into a typed DSL graph that compiles. Summarizing a backtest tearsheet in one paragraph. Refactoring a strategy when you want to swap the regime detector. Explaining a vol-of-vol filter to a teammate who has not seen it before.
These are composition and summarization tasks. They scale with model capability, they tolerate one-shot errors (the engine catches them), and they save real time. This is where Claude is sharp.
Running deterministic math in their head. Predicting a funding-payment trajectory across 200 perps over 20 months. Sizing a position against realised volatility. Telling you whether a Sharpe of 2.17 is statistically distinguishable from luck given the sample size. Computing drawdown from a price series. The LLM will produce a plausible number and it will be wrong in ways you can't easily audit.
Don't ask Claude to compute drawdown; ask it to compose a strategy whose drawdown the engine computes. Runtime math stays in the engine, where it's deterministic and auditable.
The LLM authors the strategy graph through the MCP. The engine runs deterministic math against real Hyperliquid funding and price. The compiled artifact deploys live with bit-for-bit parity to the backtest. The agent is off-line at execution time; the deterministic compiled strategy is what trades.
The agent helps you build the strategy; the engine runs it. The architecture treats LLMs as useful but unreliable, isolating that unreliability to the composition step where you can review every output before the engine commits.
Under the hood
Structured strategy engine
Strategies are composable pipelines of typed components. The system validates every connection at edit time — errors caught before you backtest, not after you deploy.
AI built on the same system
AI doesn’t generate code — it composes from the same components you use. It understands valid connections, constraints, and trade-offs. Every strategy it builds is structurally valid.
Detailed backtest reports
Sharpe, Sortino, max drawdown, win rate, trade-by-trade logs. Compare runs side by side. Real fee and slippage modeling.
Version control for strategies
Every edit creates a new version. Compare any two versions side by side. Tag releases, restore previous versions, fork strategies. Your full history, always recoverable.
Auditable execution logs
Every live run is logged — what the strategy calculated, what orders executed, what filled. Full transparency.
Non-custodial by design
Your keys never touch our servers. Keel uses Hyperliquid’s native delegation. Sign once, revoke anytime.
One run, one share URL, no hand-waving.
A deterministic backtest against real Hyperliquid history, with bit-for-bit parity to live. The share URL holds the full tearsheet — equity curve, decomposed P&L, every trade.
Funding-carry on Hyperliquid perps
A deterministic single-window backtest with bit-for-bit parity to live. Click through for the full tearsheet — equity curve, decomposed P&L, trade-by-trade log. Period: 2024-08-15 → 2026-04-30 (20 mo).
Verified Keel backtest. Past performance is not indicative of future returns.
Common questions
Why use an LLM for systematic work at all?
Because composition and summarization are LLM-shaped problems. Searching a 182-component typed library for "vol-of-vol regime detectors", wiring them into a valid DSL graph, and writing a one-paragraph summary of a backtest tearsheet are tasks where Claude is genuinely strong. None of that asks the LLM to compute drawdown, simulate funding accrual, or decide whether a Sharpe is statistically significant — those stay with the engine and with the human reading the numbers. The LLM is the composition + summarization layer over a typed component graph, not a runtime decision engine.
What's the engine actually doing during a backtest?
Keel ingests Hyperliquid perpetual markets at 15-minute bars and 1-hour funding rates from the same cache the live execution path reads. The simulator applies the live Hyperliquid fee schedule per fill, models per-asset slippage in basis points that you set explicitly, accrues funding into the equity curve on every 1-hour boundary, and decomposes P&L into price return, funding return, and combined return. A buffered rebalancer respects exposure caps and volatility targeting at portfolio level. Output is a deterministic share URL with the full tearsheet — same inputs, same outputs, every time.
How is backtest-to-live parity verified?
The compiled pipeline artifact — the serialized DSL graph plus its component parameters — is the same object that runs in backtest and in live execution. The same data cache feeds both paths. Live still diverges from backtest because of real-world market impact, slippage realization, and regime drift after the cutoff date — but not because of implementation drift. Execution logs let you compare expected versus actual fill by fill. The plumbing under "parity" is that there is one engine, not two.
How does this compare to QuantConnect or quantpylib for Hyperliquid?
Neither runs on Hyperliquid natively. QuantConnect is an equities-and-crypto research platform with strong WFO and rigor diagnostics but no HL execution path. quantpylib is a Python library for systematic crypto research, again without HL-native execution. Keel's differentiation is venue-native data and execution plus agent-driven composition plus bit-for-bit backtest-to-live parity — not rigor-diagnostic surface, where QC is ahead today. If your workflow is "research in QC, deploy somewhere else", you keep the rigor surface and lose parity. If your workflow is "compose in Claude, deploy on HL", Keel is the path.
What's the data depth?
Hyperliquid 15-minute bars and 1-hour funding rates, with open interest where the venue publishes it. History depth varies per asset — BTC, ETH, and SOL go back to 2024-08-15; newer listings like HYPE start at their listing date. The cache is parquet on local disk; the live execution path reads the same files. Sub-15-minute backtesting is not supported; latency-sensitive strategies belong in a different tool.
Is this for crypto only?
Yes. Keel is Hyperliquid-native by design — funding accrual, native delegated signing, per-asset slippage tuning, and the full execution path are all built around HL perps.
Keep exploring
The Keel MCP server
Product page for the MCP — MCP tools, 182 typed components, stdio install, browser OAuth.
Claude × Hyperliquid
The Claude-specific walkthrough. Compose, backtest, deploy entirely from a Claude Code session.
Hyperliquid backtesting (the engine)
The deterministic engine your agent drives. Real fees, funding, slippage. Funding-decomposed P&L.