Verify 40 min

Separate Doing from Judging

Do not let the implementing coding agent be the only judge of completion.

Failure pattern

The agent implements a patch, runs a narrow check, and declares the issue done. Later, CI or a human reviewer catches a broken real flow.

The agent may not be lying. It judged from the evidence available to it. The harness allowed weak evidence to count as completion.

Incident: PR marked done too early

Agent task

The agent is asked:

Fix invite acceptance for required-SSO workspaces and open the PR.

The agent patches the redirect guard and adds a unit test.

Available surface

The repo has:

SurfaceEvidence type
Unit testsGuard behavior
API testsInvite acceptance endpoint
E2E testsReal invite email to first login
TypecheckCross-module type safety
CIFull verification
Review checklistProduct behavior, migrations, security, tests

The agent runs only the unit test.

Bad run

It reports:

Fixed invite acceptance.
Tested redirect guard.
Ready for review.

CI fails e2e. The real flow still breaks because the invite email link includes a token format the unit test never covered. The patch solved the isolated guard but not the product behavior.

Why the harness failed

The implementer judged completion.

Missing gateConsequence
Behavior-level evidenceUnit test passed but real flow failed
Independent reviewerNo role checked user-visible path
CI gatePR readiness did not wait for required checks
Missing-evidence rule”Not run” did not block completion
Review rubricAgent treated code change as product fix

The task needed external judgment.

Why it happens

Coding agents often optimize for the nearest test. If a unit test passes, the patch feels complete. But user-visible behavior may depend on routing, email tokens, API state, background jobs, and browser flow.

Humans separate doing and judging with code review, CI, QA, and acceptance criteria. A harness should do the same.

Harness principle

Completion is evidence passing, not agent confidence.

flowchart LR
  A["Work surface"] --> B["Implementing agent"]
  B --> C["Patch and evidence"]
  C --> D["Completion gate"]
  D -->|"Pass"| E["Ready"]
  D -->|"Fail"| F["Fix or block"]
Implementation and judgment should be separate steps.

The worker can produce evidence. A gate decides whether the evidence is enough.

Operating practice

Define a completion gate:

RequirementPass condition
ReproductionOriginal bug fails before fix or is described from issue
UnitRedirect guard test passes
APIInvite acceptance endpoint test passes
E2EEmail invite to first login works
TypecheckProject typecheck passes
Missing evidenceAny skipped required check blocks “done”

Harnessed run

The agent returns:

Status: not ready.
Implemented redirect guard fix and unit test.
Passed: unit guard test, API invite acceptance test.
Missing: E2E invite flow not run because local email preview service is down.
Next: restore preview service or ask reviewer for CI-based verification.

This is more honest than a premature “done.”

Coding-agent example

Use reviewer roles:

RoleChecks
ImplementerProduces patch and first evidence
CIRuns deterministic test suite
ReviewerChecks scope, behavior, security, maintainability
Product verifierConfirms user-visible scenario when needed

The harness should not let the implementer skip all other roles.

Review artifact

The useful artifact is an evidence packet that a judge can inspect without trusting the implementing agent’s confidence.

GateRequired evidenceResult
Behavior gateE2E invite acceptance passes from email link to workspace setupMissing
Regression gateExpired-token and already-accepted-token paths still passPresent
Code gatePatch limited to invite acceptance route and testsPresent
Risk gateAuth redirect and membership side effects explainedPartial
Human gateReviewer approves after seeing evidence packetNot started

This packet makes “ready” a judged state, not a feeling. The implementer can say the patch is ready for evaluation, but it cannot mark the behavior complete until the gates pass.

A coding harness can express this with two prompts or two agents:

Implementer:
- make the smallest patch for the active behavior
- collect evidence
- do not declare final completion

Judge:
- inspect diff and evidence
- rerun or challenge key checks
- decide Ready, Needs Work, or Blocked

The judge does not need to be hostile. It needs a different job. The implementer optimizes for producing a solution; the judge optimizes for detecting unsupported claims. This separation matters because language models are good at writing convincing completion summaries even when evidence is thin.

In the PR incident, the implementer had a unit test and a plausible explanation. The judge would have asked for the user journey evidence: can a newly invited user click the link and land in the workspace? That single gate catches the missing e2e path.

Harnessed version

The harnessed run has two moments of completion. First, the implementing agent reaches “ready for judgment” and submits a patch with evidence. Second, an evaluator reaches “ready for human review” after inspecting the evidence against gates. These states should not collapse into one.

This is not bureaucracy. It is a guard against fluent self-assessment. Coding agents can explain why their patch should work, but the harness should prefer independent evidence over explanation. The evaluator can be another agent, a deterministic CI gate, a human reviewer, or a combination. What matters is that the implementer does not own the final grade.

When the judge finds a missing gate, the task should return with a precise reason: “behavior evidence missing for invited-user journey.” That is more useful than “needs more testing” because the implementer knows exactly which proof to produce.

Common mistakes

The first mistake is equating changed code with fixed behavior.

The second mistake is accepting a narrow test when the task described an end-to-end flow.

The third mistake is treating skipped tests as neutral. Missing evidence should block completion.

The fourth mistake is letting the agent review its own summary instead of the actual diff and test output.

Practical exercise

Pick one recent patch. Ignore the summary. List only evidence proving the user-visible behavior works.

If the evidence would not convince a skeptical reviewer, design a stronger completion gate.

Key takeaways

  • Agent confidence is not completion evidence.
  • User-visible bugs need user-visible verification.
  • Missing evidence should block “done.”
  • CI and review are harness components.
  • The implementing agent should not be the only judge.

Further reading / source notes