Define the Work Surface
Turn vague quant-agent requests into bounded research work that can be completed and reviewed.
Failure pattern
A portfolio manager gives the agent a broad market objective, and the agent turns it into a confident trade thesis without a defined universe, horizon, risk boundary, or evidence standard.
The failure does not begin when the thesis is wrong. It begins when the agent is allowed to decide what the work is. In quant research, a request like “find an idea” can mean factor screen, event study, portfolio hedge, risk review, catalyst research, or memo draft. If the harness does not define the surface, the model supplies its own.
Incident: semiconductor long/short idea
Agent task
A portfolio manager asks the Quant Analyst AI Agent:
Find a long/short idea in semiconductors for next week’s research meeting.
That sounds clear to a human who knows the desk. It is not clear enough for an agent.
Available surface
The agent can inspect and use:
| Surface | What it contains |
|---|---|
| Equity universe | US, Europe, and Asia semiconductor names |
| Factor library | Momentum, earnings revision, quality, valuation, short interest |
| Market data | Prices, volume, fundamentals, estimates, corporate actions |
| Backtest engine | Historical factor screens and pair simulations |
| Risk model | Sector, beta, currency, style, and factor exposure |
| Research archive | Prior investment committee notes and rejected ideas |
| Memo tool | Drafts advisory research memos |
The agent is not allowed to place orders, but the task does not explicitly say whether it may produce a trade recommendation, a hypothesis, or only a research shortlist.
Bad run
The agent screens global semiconductor names, picks one long and one short, runs a quick three-year backtest, and produces:
Advisory idea:
Long NVDA / Short INTC
Horizon: 4-6 weeks
Rationale: stronger momentum, positive revisions, better margin profile.
Expected return: +7.4 percent based on historical pair behavior.
Recommendation: add to Monday IC agenda as actionable trade.
The memo reads well. It is also not reviewable enough. The agent invented the horizon, ignored the desk’s market-neutral risk limits, mixed mega-cap and legacy semiconductor exposures without checking factor crowding, and used a short backtest without transaction-cost assumptions.
Why the harness failed
The agent was given a theme, not a work surface.
| Missing boundary | Consequence |
|---|---|
| Universe | Agent mixed global semiconductor names without liquidity or region rules |
| Output type | Agent produced an advisory trade instead of a research shortlist |
| Horizon | Agent invented a 4-6 week holding period |
| Evidence standard | Backtest, risk, costs, and catalyst checks were not required |
| Stop condition | No rule forced escalation when risk constraints were missing |
The model optimized for a useful-looking answer. The harness did not define what “useful” meant in this research context.
Why it happens
Quant research requests are compressed. A human analyst hears “long/short idea in semiconductors” and may infer the desk’s liquidity floor, factor-neutrality preference, benchmark, holding period, and committee format. The agent does not know which of those conventions are binding unless the harness makes them visible.
The model also tends to complete the pattern of investment research. If the prompt sounds like a trade request, it may produce a trade-shaped answer. That is not the same as satisfying a research process. A harnessed quant agent should know when it is exploring, when it is drafting, when it is recommending, and when it must defer to human approval.
Harness principle
A work surface is the bounded research object the agent may operate on for one run.
For a quant analyst agent, it should define:
- Research object: theme, universe, benchmark, account, factor, strategy, or specific question.
- Permitted movement: read data, screen, backtest, simulate, draft, recommend, or escalate.
- Evidence standard: required data freshness, risk checks, backtest assumptions, costs, and citations.
- Stop conditions: missing data, invalid assumptions, risk-limit breach, or approval boundary.
flowchart LR A["Broad market request"] --> B["Universe and horizon"] B --> C["Allowed outputs"] C --> D["Evidence standard"] D --> E["Risk and approval boundaries"] E --> F["Reviewable research artifact"]
The work surface does not make the agent less useful. It makes the output auditable.
Operating practice
Write a research brief before the agent begins. For the semiconductor case:
| Field | Harnessed brief |
|---|---|
| Situation | Prepare semiconductor long/short candidates for next week’s research meeting. |
| Universe | US-listed semiconductor equities, market cap above $5B, median daily value traded above $50M. |
| Horizon | 1-3 month research horizon; no intraday or execution plan. |
| Allowed actions | Screen, compare factors, run bounded backtests, draft advisory memo. |
| Disallowed actions | Do not mark as trade-ready, do not prepare orders, do not bypass human review. |
| Required evidence | Factor snapshot, risk exposure, transaction-cost assumption, data timestamp, prior-idea check. |
| Escalation triggers | Missing risk model, stale estimates, factor crowding, unavailable corporate-action adjustment. |
| Completion evidence | Shortlist of 3 candidate pairs, evidence table, caveats, and review questions. |
Harnessed run
The agent now returns:
Output type: research shortlist, not trade approval.
Universe: US semiconductors, >$5B market cap, >$50M ADV.
Candidates:
1. Long A / Short B: revision spread improving, valuation risk elevated.
2. Long C / Short D: quality spread strong, momentum decelerating.
3. Long E / Short F: catalyst pending, needs event-risk review.
Missing evidence:
- Risk model unavailable for two names after latest rebalance.
Next action:
- Human analyst should review pair 1 and approve deeper backtest.
The harnessed output is less theatrical than the bad run. It is more useful because it is explicit about scope, evidence, and approval.
Product-agent example
For quant research, a work-surface contract should distinguish output modes:
| Mode | Agent may do | Agent may not do |
|---|---|---|
| Exploration | Generate hypotheses and screens | Recommend action |
| Advisory memo | Draft thesis with caveats | Mark final approval |
| Review packet | Assemble evidence for committee | Hide missing checks |
| Execution-adjacent | Prepare scenario analysis | Submit order or final decision |
The same agent can support all modes, but the harness must name the mode before the run.
Common mistakes
The first mistake is letting market themes define the task. “Semiconductors” is a theme, not a work surface.
The second mistake is treating backtest output as completion evidence without defining assumptions. A backtest without costs, data lineage, and risk checks is not enough.
The third mistake is blurring advisory and approval language. “Attractive candidate” is different from “trade-ready.”
The fourth mistake is failing to define stop conditions. A missing risk model should stop or downgrade the output, not disappear into prose.
Practical exercise
Take one real quant-agent request and rewrite it as a work-surface brief.
Include universe, horizon, allowed outputs, disallowed outputs, evidence requirements, risk checks, approval boundary, and completion evidence. Then ask whether a reviewer could detect an authority violation from the artifact alone.
Key takeaways
- Quant-agent work needs a bounded research surface before analysis begins.
- A theme is not a task.
- Advisory output must remain separate from final investment approval.
- Evidence standards should be defined before the model produces a thesis.
- A useful agent can say “research candidate, not trade-ready.”
Further reading / source notes
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for the broader shift toward specifying intent and designing feedback loops around agents.
- Anthropic, “Effective harnesses for long-running agents” for examples of turning failure modes into harness structure.
- NIST AI Risk Management Framework for evaluation and risk-management framing around AI systems.