Verify 40 min

Separate Doing from Judging

Do not let the research agent be the only judge of investment-readiness.

Failure pattern

The agent drafts an investment memo and marks the idea ready for review, but the evidence packet is incomplete.

The danger is not that the agent wrote a bad memo. The danger is that the same agent that built the thesis also judged the thesis ready. In investment workflows, the worker and the judge need different responsibilities.

Incident: memo marked ready too early

Agent task

A portfolio manager asks:

Prepare the semiconductor revision-momentum idea for investment committee review.

The intended output is an advisory memo, not final investment approval.

Available surface

The agent can inspect and use:

SurfaceContents
Research memo templateThesis, data, backtest, risk, caveats
Backtest resultsSignal performance and attribution
Risk modelFactor, beta, sector, and single-name exposures
Cost modelTransaction costs, borrow estimates, slippage
Review checklistRequired evidence before committee
Idea boardDraft, review requested, ready, rejected

The agent can set idea status to ready.

Bad run

The agent drafts a persuasive memo:

Idea: Long high-revision semis / short low-revision semis.
Evidence: positive revision spread, strong five-year backtest, favorable quality tilt.
Status: ready for IC.

Independent review finds two missing checks:

  • The short basket has concentrated exposure to one high-borrow name.
  • The backtest used post-event estimate revisions for part of the sample.

The memo was fluent. The evidence gate failed.

Why the harness failed

The worker agent judged its own completion.

Missing gateConsequence
Risk exposure checkConcentrated short exposure was not flagged
Backtest validity checkLookahead risk was not caught
Cost and borrow reviewShort-side feasibility was under-specified
Status permissionAgent marked idea ready without evaluator approval
Evidence packetReviewer had to reconstruct missing assumptions

The agent completed a memo, not a committee-ready research packet.

Why it happens

Agents tend to evaluate the artifact they intended to produce. If the task is “prepare a memo,” the agent may treat a complete memo as a complete task. But investment review cares about evidence quality, not prose completeness.

Human research teams already separate roles: analyst prepares, PM challenges, risk reviews, compliance checks. A quant-agent harness should encode the same separation.

Harness principle

Separate doing from judging.

The research agent may gather evidence, run analysis, and draft the memo. A completion gate decides whether the idea can move to the next status.

flowchart LR
  A["Research brief"] --> B["Worker agent"]
  B --> C["Evidence packet"]
  C --> D["Evaluator gate"]
  D -->|"Pass"| E["Review requested"]
  D -->|"Fail"| F["Missing evidence"]
The research worker prepares evidence; the evaluator decides whether the idea can advance.

Completion means evidence passed. It does not mean the worker says the memo is persuasive.

Operating practice

Create an evidence gate before the run:

RequirementPass condition
Data lineageDataset snapshot and factor version named
Backtest validityPoint-in-time data, costs, no lookahead, benchmark defined
Risk exposureBeta, sector, style, single-name, and factor exposure checked
FeasibilityLiquidity, borrow, turnover, and capacity considered
Approval boundaryMemo marked review requested, not trade approved
CaveatsMissing evidence and open questions listed

Harnessed run

The worker returns:

Recommendation: request review, not approval.
Evidence packet:
- Dataset snapshot: 2026-05-16
- Factor version: revisions_v4
- Backtest: cost-adjusted, point-in-time, 2018-2026
- Risk: beta neutral within tolerance; one short name high borrow
Missing:
- Borrow estimate for Name D requires confirmation.
- Lookahead audit pending for estimate timestamp mapping.
Status requested: review_requested

The evaluator then rejects readiness:

Gate result: fail
Reason:
- Borrow estimate incomplete.
- Lookahead audit pending.
Allowed next status: draft_with_blockers

That rejection is a good outcome. It prevents weak evidence from becoming committee readiness.

The status language matters. The agent should not collapse draft, review_requested, committee_ready, and approved into one optimistic phrase. Each state should have a gate. A draft can be incomplete. A review request needs an evidence packet. Committee-ready requires evaluator pass. Approval belongs to humans with the authority to accept risk.

This keeps the agent useful without pretending it is the decision maker. It can accelerate research, assemble evidence, and make reviewers faster. It should not erase the accountability boundary.

Product-agent example

Use separate roles:

RoleResponsibility
Research workerBuild evidence packet and draft memo
Risk evaluatorCheck exposures, constraints, and feasibility
Methodology evaluatorCheck backtest assumptions and data lineage
Human approverDecide whether advisory idea advances

The evaluator does not need to rewrite the memo. It checks whether the packet satisfies the gate.

An evaluator can be implemented as a human checklist, deterministic validation, another agent role, or a combination. For example, deterministic checks can confirm that a risk snapshot ID exists and that net-cost metrics are present. A human reviewer can then judge whether the caveats are acceptable. The harness should use the cheapest reliable judge for each requirement.

Common mistakes

The first mistake is reviewing only the prose. Review the evidence packet.

The second mistake is allowing the worker to choose the completion criteria after the work is done.

The third mistake is treating missing evidence as “low confidence” rather than a failed gate.

The fourth mistake is using final approval language in agent outputs. The agent can propose, request review, or flag readiness blockers. It should not approve the investment decision.

The fifth mistake is letting urgency lower the gate. If a portfolio manager wants the idea for a meeting in one hour, the output can become “draft with blockers” or “exploratory note.” The gate should not silently downgrade because the calendar is tight.

The sixth mistake is letting the evaluator become a copy editor. If the evaluator spends its effort improving prose, it may miss the evidence failure. Keep evaluation focused on gate criteria first.

Practical exercise

Take one quant memo and define the worker output and evaluator gate separately.

Include data lineage, backtest validity, risk exposure, costs, caveats, and approval status. Then create a test memo that sounds convincing but misses one required item. The evaluator should fail it.

Key takeaways

  • A persuasive memo is not completion evidence.
  • The research worker should not be the only judge of readiness.
  • Investment readiness requires evidence gates.
  • Missing required evidence should block advancement.
  • Human approval remains the final decision boundary.

Further reading / source notes