Design the Agent Interface
Treat quant tools, datasets, permissions, and feedback as the interface your agent must operate through.
Failure pattern
The quant agent is given powerful research tools, but the interface hides what each tool assumes, what it changes, what evidence it returns, and which outputs require review.
The agent can pull data, run backtests, simulate portfolios, and draft memos. That sounds useful until the tools are raw internal functions with vague names and loose outputs. A capable model can still misuse a tool if the interface does not make safe use obvious.
Incident: raw backtest and memo tools
Agent task
A senior analyst asks:
Test whether earnings-revision momentum still works in semiconductors and prepare a short IC memo.
The agent has the right general capability. The unsafe part is the interface.
Available surface
The agent sees these tools:
| Tool | Description shown to agent |
|---|---|
pullData(query) | Pulls market data |
runBacktest(config) | Runs a backtest |
simulatePortfolio(payload) | Simulates positions |
writeMemo(text) | Writes a memo |
publishIdea(id) | Publishes idea to research board |
The tools expose mechanics, not research semantics. runBacktest does not say whether it uses point-in-time data, transaction costs, delisting returns, borrow constraints, or survivorship-bias controls. publishIdea sounds like writing to a board, but analysts treat the board as pre-committee circulation.
Bad run
The agent calls:
pullData("semiconductors earnings revision 5y")
runBacktest({ factor: "eps_revision_3m", sector: "semis", years: 5 })
writeMemo("Earnings revision momentum remains attractive...")
publishIdea("semi_revision_momentum")
The memo says the signal has a 1.8 Sharpe. Review finds the backtest used the current semiconductor universe across the full history, ignored transaction costs, and excluded delisted names. The published idea reaches the research board before risk review.
Why the harness failed
The interface exposed power without meaning.
| Interface gap | Consequence |
|---|---|
| Read surface vague | Agent did not know which dataset snapshot or universe was used |
| Action surface broad | Raw backtest accepted incomplete configuration |
| Feedback weak | Result returned Sharpe without warnings or assumptions |
| Policy hidden | Publishing did not require review status |
| Output schema loose | Memo omitted costs, data lineage, and caveats |
The tools worked. The interface failed.
Why it happens
Internal research tools are often built for humans who already know the desk’s conventions. A human understands that a quick backtest is exploratory, that a published idea implies review, and that a Sharpe without costs is not enough. An agent reads the interface literally.
Harness engineering for a quant agent means designing the agent’s operating surface: what it can read, what it can do, what feedback it receives, and what policies are enforced at the interface.
Harness principle
The agent interface has four surfaces:
| Surface | Quant version |
|---|---|
| Read surface | Datasets, factor definitions, risk model, constraints, prior memos |
| Action surface | Screens, bounded backtests, simulations, memo drafts, review requests |
| Feedback surface | Assumptions, warnings, diagnostics, validation checks, result metadata |
| Policy surface | Approval gates, prohibited outputs, data-use rules, compliance constraints |
flowchart LR A["Read surface"] --> E["Agent decision"] B["Policy surface"] --> E E --> C["Action surface"] C --> D["Feedback surface"] D --> E B --> C
The best interface does not rely on the agent remembering every rule. It shapes the path of action.
Operating practice
Wrap raw tools in research-specific tools:
| Raw tool | Agent-facing tool |
|---|---|
pullData(query) | loadResearchDataset(universe, snapshot, requiredFields) |
runBacktest(config) | runBoundedBacktest(strategySpec, costModel, biasControls) |
simulatePortfolio(payload) | simulateRiskConstrainedPortfolio(candidateSet, constraints) |
writeMemo(text) | draftResearchMemo(evidencePacket, caveats) |
publishIdea(id) | requestIdeaReview(memoId, evidenceChecklist) |
The harnessed backtest tool should return:
status: completed_with_warnings
data_snapshot: 2026-05-16
universe: US semis, point-in-time constituents
cost_model: 20 bps one-way
bias_controls: survivorship adjusted, delisting returns included
warnings:
- borrow cost unavailable for 4 short candidates
- corporate-action adjustment pending for one name
Now the agent can reason from tool feedback instead of treating a single performance metric as truth.
The interface should also make review state explicit. A memo draft, a review request, and an approved idea are different product states. The agent should be able to create the first two under the right conditions, but not silently move an idea into final approval. That boundary belongs in the action surface, not only in written policy.
For example, requestIdeaReview should require an evidence checklist. If data lineage, cost assumptions, and risk output are missing, the tool should refuse the request and return the missing fields. This turns a compliance rule into operational feedback the agent can use during the run.
Product-agent example
Design each research tool with input constraints and output evidence:
| Tool | Required input | Required output |
|---|---|---|
| Screen | universe, date, factor version | ranked names, missing data, timestamp |
| Backtest | signal, horizon, costs, universe | metrics, assumptions, warnings |
| Risk simulation | candidate weights, constraints | exposures, breaches, sensitivities |
| Memo draft | evidence packet | thesis, caveats, open questions |
| Review request | memo and checklist | reviewer, status, blockers |
The interface should make the safe path easier than the shortcut.
The same pattern applies to read tools. loadResearchDataset should not simply return rows. It should return the snapshot ID, vendor, timestamp, point-in-time status, missing fields, and corporate-action adjustment status. Those metadata fields become part of the agent’s reasoning surface. Without them, the agent may treat any returned data as equally trustworthy.
Common mistakes
The first mistake is exposing raw internal tools directly. Humans may know their assumptions; agents need assumptions in the output.
The second mistake is returning metrics without diagnostics. A Sharpe without data lineage, costs, and warnings invites overconfidence.
The third mistake is allowing publication verbs too early. A quant agent should request review before an idea is treated as committee-ready.
The fourth mistake is treating policy as documentation only. Approval gates should be enforced in the action path.
Practical exercise
Choose one quant research tool and redesign it as an agent-facing interface. Write the tool name, safe inputs, refused inputs, structured outputs, warnings, and approval behavior.
Then ask: what incorrect action does the current interface invite? The answer tells you where the harness is weak.
Key takeaways
- The agent’s interface is the research environment, not the chat box.
- Quant tools need semantics, assumptions, and warnings.
- Feedback is part of the interface.
- Publication or review-state changes should be gated.
- Safe tools make good research behavior the default path.
Further reading / source notes
- Model Context Protocol architecture overview for a useful separation of tools, resources, prompts, and lifecycle concepts.
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for designing agent environments and feedback loops.
- Anthropic, “Effective harnesses for long-running agents” for examples of making agent tooling and verification explicit.