Separate Doing from Judging
Do not let the research agent be the only judge of investment-readiness.
Failure pattern
The agent drafts an investment memo and marks the idea ready for review, but the evidence packet is incomplete.
The danger is not that the agent wrote a bad memo. The danger is that the same agent that built the thesis also judged the thesis ready. In investment workflows, the worker and the judge need different responsibilities.
Incident: memo marked ready too early
Agent task
A portfolio manager asks:
Prepare the semiconductor revision-momentum idea for investment committee review.
The intended output is an advisory memo, not final investment approval.
Available surface
The agent can inspect and use:
| Surface | Contents |
|---|---|
| Research memo template | Thesis, data, backtest, risk, caveats |
| Backtest results | Signal performance and attribution |
| Risk model | Factor, beta, sector, and single-name exposures |
| Cost model | Transaction costs, borrow estimates, slippage |
| Review checklist | Required evidence before committee |
| Idea board | Draft, review requested, ready, rejected |
The agent can set idea status to ready.
Bad run
The agent drafts a persuasive memo:
Idea: Long high-revision semis / short low-revision semis.
Evidence: positive revision spread, strong five-year backtest, favorable quality tilt.
Status: ready for IC.
Independent review finds two missing checks:
- The short basket has concentrated exposure to one high-borrow name.
- The backtest used post-event estimate revisions for part of the sample.
The memo was fluent. The evidence gate failed.
Why the harness failed
The worker agent judged its own completion.
| Missing gate | Consequence |
|---|---|
| Risk exposure check | Concentrated short exposure was not flagged |
| Backtest validity check | Lookahead risk was not caught |
| Cost and borrow review | Short-side feasibility was under-specified |
| Status permission | Agent marked idea ready without evaluator approval |
| Evidence packet | Reviewer had to reconstruct missing assumptions |
The agent completed a memo, not a committee-ready research packet.
Why it happens
Agents tend to evaluate the artifact they intended to produce. If the task is “prepare a memo,” the agent may treat a complete memo as a complete task. But investment review cares about evidence quality, not prose completeness.
Human research teams already separate roles: analyst prepares, PM challenges, risk reviews, compliance checks. A quant-agent harness should encode the same separation.
Harness principle
Separate doing from judging.
The research agent may gather evidence, run analysis, and draft the memo. A completion gate decides whether the idea can move to the next status.
flowchart LR A["Research brief"] --> B["Worker agent"] B --> C["Evidence packet"] C --> D["Evaluator gate"] D -->|"Pass"| E["Review requested"] D -->|"Fail"| F["Missing evidence"]
Completion means evidence passed. It does not mean the worker says the memo is persuasive.
Operating practice
Create an evidence gate before the run:
| Requirement | Pass condition |
|---|---|
| Data lineage | Dataset snapshot and factor version named |
| Backtest validity | Point-in-time data, costs, no lookahead, benchmark defined |
| Risk exposure | Beta, sector, style, single-name, and factor exposure checked |
| Feasibility | Liquidity, borrow, turnover, and capacity considered |
| Approval boundary | Memo marked review requested, not trade approved |
| Caveats | Missing evidence and open questions listed |
Harnessed run
The worker returns:
Recommendation: request review, not approval.
Evidence packet:
- Dataset snapshot: 2026-05-16
- Factor version: revisions_v4
- Backtest: cost-adjusted, point-in-time, 2018-2026
- Risk: beta neutral within tolerance; one short name high borrow
Missing:
- Borrow estimate for Name D requires confirmation.
- Lookahead audit pending for estimate timestamp mapping.
Status requested: review_requested
The evaluator then rejects readiness:
Gate result: fail
Reason:
- Borrow estimate incomplete.
- Lookahead audit pending.
Allowed next status: draft_with_blockers
That rejection is a good outcome. It prevents weak evidence from becoming committee readiness.
The status language matters. The agent should not collapse draft, review_requested, committee_ready, and approved into one optimistic phrase. Each state should have a gate. A draft can be incomplete. A review request needs an evidence packet. Committee-ready requires evaluator pass. Approval belongs to humans with the authority to accept risk.
This keeps the agent useful without pretending it is the decision maker. It can accelerate research, assemble evidence, and make reviewers faster. It should not erase the accountability boundary.
Product-agent example
Use separate roles:
| Role | Responsibility |
|---|---|
| Research worker | Build evidence packet and draft memo |
| Risk evaluator | Check exposures, constraints, and feasibility |
| Methodology evaluator | Check backtest assumptions and data lineage |
| Human approver | Decide whether advisory idea advances |
The evaluator does not need to rewrite the memo. It checks whether the packet satisfies the gate.
An evaluator can be implemented as a human checklist, deterministic validation, another agent role, or a combination. For example, deterministic checks can confirm that a risk snapshot ID exists and that net-cost metrics are present. A human reviewer can then judge whether the caveats are acceptable. The harness should use the cheapest reliable judge for each requirement.
Common mistakes
The first mistake is reviewing only the prose. Review the evidence packet.
The second mistake is allowing the worker to choose the completion criteria after the work is done.
The third mistake is treating missing evidence as “low confidence” rather than a failed gate.
The fourth mistake is using final approval language in agent outputs. The agent can propose, request review, or flag readiness blockers. It should not approve the investment decision.
The fifth mistake is letting urgency lower the gate. If a portfolio manager wants the idea for a meeting in one hour, the output can become “draft with blockers” or “exploratory note.” The gate should not silently downgrade because the calendar is tight.
The sixth mistake is letting the evaluator become a copy editor. If the evaluator spends its effort improving prose, it may miss the evidence failure. Keep evaluation focused on gate criteria first.
Practical exercise
Take one quant memo and define the worker output and evaluator gate separately.
Include data lineage, backtest validity, risk exposure, costs, caveats, and approval status. Then create a test memo that sounds convincing but misses one required item. The evaluator should fail it.
Key takeaways
- A persuasive memo is not completion evidence.
- The research worker should not be the only judge of readiness.
- Investment readiness requires evidence gates.
- Missing required evidence should block advancement.
- Human approval remains the final decision boundary.
Further reading / source notes
- NIST AI Risk Management Framework for evaluation and monitoring practices around AI risk.
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for feedback-loop framing.
- Anthropic, “Effective harnesses for long-running agents” for explicit verification before completion.