Separate Doing from Judging

Failure pattern

The agent drafts an investment memo and marks the idea ready for review, but the evidence packet is incomplete.

The danger is not that the agent wrote a bad memo. The danger is that the same agent that built the thesis also judged the thesis ready. In investment workflows, the worker and the judge need different responsibilities.

Incident: memo marked ready too early

Agent task

A portfolio manager asks:

Prepare the semiconductor revision-momentum idea for investment committee review.

The intended output is an advisory memo, not final investment approval.

Available surface

The agent can inspect and use:

Surface	Contents
Research memo template	Thesis, data, backtest, risk, caveats
Backtest results	Signal performance and attribution
Risk model	Factor, beta, sector, and single-name exposures
Cost model	Transaction costs, borrow estimates, slippage
Review checklist	Required evidence before committee
Idea board	Draft, review requested, ready, rejected

The agent can set idea status to ready.

Bad run

The agent drafts a persuasive memo:

Idea: Long high-revision semis / short low-revision semis.
Evidence: positive revision spread, strong five-year backtest, favorable quality tilt.
Status: ready for IC.

Independent review finds two missing checks:

The short basket has concentrated exposure to one high-borrow name.
The backtest used post-event estimate revisions for part of the sample.

The memo was fluent. The evidence gate failed.

Why the harness failed

The worker agent judged its own completion.

Missing gate	Consequence
Risk exposure check	Concentrated short exposure was not flagged
Backtest validity check	Lookahead risk was not caught
Cost and borrow review	Short-side feasibility was under-specified
Status permission	Agent marked idea ready without evaluator approval
Evidence packet	Reviewer had to reconstruct missing assumptions

The agent completed a memo, not a committee-ready research packet.

Why it happens

Agents tend to evaluate the artifact they intended to produce. If the task is “prepare a memo,” the agent may treat a complete memo as a complete task. But investment review cares about evidence quality, not prose completeness.

Human research teams already separate roles: analyst prepares, PM challenges, risk reviews, compliance checks. A quant-agent harness should encode the same separation.

Harness principle

Separate doing from judging.

The research agent may gather evidence, run analysis, and draft the memo. A completion gate decides whether the idea can move to the next status.

flowchart LR
  A["Research brief"] --> B["Worker agent"]
  B --> C["Evidence packet"]
  C --> D["Evaluator gate"]
  D -->|"Pass"| E["Review requested"]
  D -->|"Fail"| F["Missing evidence"]

The research worker prepares evidence; the evaluator decides whether the idea can advance.

Completion means evidence passed. It does not mean the worker says the memo is persuasive.

Operating practice

Create an evidence gate before the run:

Requirement	Pass condition
Data lineage	Dataset snapshot and factor version named
Backtest validity	Point-in-time data, costs, no lookahead, benchmark defined
Risk exposure	Beta, sector, style, single-name, and factor exposure checked
Feasibility	Liquidity, borrow, turnover, and capacity considered
Approval boundary	Memo marked `review requested`, not `trade approved`
Caveats	Missing evidence and open questions listed

Harnessed run

The worker returns:

Recommendation: request review, not approval.
Evidence packet:
- Dataset snapshot: 2026-05-16
- Factor version: revisions_v4
- Backtest: cost-adjusted, point-in-time, 2018-2026
- Risk: beta neutral within tolerance; one short name high borrow
Missing:
- Borrow estimate for Name D requires confirmation.
- Lookahead audit pending for estimate timestamp mapping.
Status requested: review_requested

The evaluator then rejects readiness:

Gate result: fail
Reason:
- Borrow estimate incomplete.
- Lookahead audit pending.
Allowed next status: draft_with_blockers

That rejection is a good outcome. It prevents weak evidence from becoming committee readiness.

The status language matters. The agent should not collapse draft, review_requested, committee_ready, and approved into one optimistic phrase. Each state should have a gate. A draft can be incomplete. A review request needs an evidence packet. Committee-ready requires evaluator pass. Approval belongs to humans with the authority to accept risk.

This keeps the agent useful without pretending it is the decision maker. It can accelerate research, assemble evidence, and make reviewers faster. It should not erase the accountability boundary.

Product-agent example

Use separate roles:

Role	Responsibility
Research worker	Build evidence packet and draft memo
Risk evaluator	Check exposures, constraints, and feasibility
Methodology evaluator	Check backtest assumptions and data lineage
Human approver	Decide whether advisory idea advances

The evaluator does not need to rewrite the memo. It checks whether the packet satisfies the gate.

An evaluator can be implemented as a human checklist, deterministic validation, another agent role, or a combination. For example, deterministic checks can confirm that a risk snapshot ID exists and that net-cost metrics are present. A human reviewer can then judge whether the caveats are acceptable. The harness should use the cheapest reliable judge for each requirement.

Common mistakes

The first mistake is reviewing only the prose. Review the evidence packet.

The second mistake is allowing the worker to choose the completion criteria after the work is done.

The third mistake is treating missing evidence as “low confidence” rather than a failed gate.

The fourth mistake is using final approval language in agent outputs. The agent can propose, request review, or flag readiness blockers. It should not approve the investment decision.

The fifth mistake is letting urgency lower the gate. If a portfolio manager wants the idea for a meeting in one hour, the output can become “draft with blockers” or “exploratory note.” The gate should not silently downgrade because the calendar is tight.

The sixth mistake is letting the evaluator become a copy editor. If the evaluator spends its effort improving prose, it may miss the evidence failure. Keep evaluation focused on gate criteria first.

Practical exercise

Take one quant memo and define the worker output and evaluator gate separately.

Include data lineage, backtest validity, risk exposure, costs, caveats, and approval status. Then create a test memo that sounds convincing but misses one required item. The evaluator should fail it.

Key takeaways

A persuasive memo is not completion evidence.
The research worker should not be the only judge of readiness.
Investment readiness requires evidence gates.
Missing required evidence should block advancement.
Human approval remains the final decision boundary.