Separate Doing from Judging

Failure pattern

The agent implements a patch, runs a narrow check, and declares the issue done. Later, CI or a human reviewer catches a broken real flow.

The agent may not be lying. It judged from the evidence available to it. The harness allowed weak evidence to count as completion.

Incident: PR marked done too early

Agent task

The agent is asked:

Fix invite acceptance for required-SSO workspaces and open the PR.

The agent patches the redirect guard and adds a unit test.

Available surface

The repo has:

Surface	Evidence type
Unit tests	Guard behavior
API tests	Invite acceptance endpoint
E2E tests	Real invite email to first login
Typecheck	Cross-module type safety
CI	Full verification
Review checklist	Product behavior, migrations, security, tests

The agent runs only the unit test.

Bad run

It reports:

Fixed invite acceptance.
Tested redirect guard.
Ready for review.

CI fails e2e. The real flow still breaks because the invite email link includes a token format the unit test never covered. The patch solved the isolated guard but not the product behavior.

Why the harness failed

The implementer judged completion.

Missing gate	Consequence
Behavior-level evidence	Unit test passed but real flow failed
Independent reviewer	No role checked user-visible path
CI gate	PR readiness did not wait for required checks
Missing-evidence rule	”Not run” did not block completion
Review rubric	Agent treated code change as product fix

The task needed external judgment.

Why it happens

Coding agents often optimize for the nearest test. If a unit test passes, the patch feels complete. But user-visible behavior may depend on routing, email tokens, API state, background jobs, and browser flow.

Humans separate doing and judging with code review, CI, QA, and acceptance criteria. A harness should do the same.

Harness principle

Completion is evidence passing, not agent confidence.

flowchart LR
  A["Work surface"] --> B["Implementing agent"]
  B --> C["Patch and evidence"]
  C --> D["Completion gate"]
  D -->|"Pass"| E["Ready"]
  D -->|"Fail"| F["Fix or block"]

Implementation and judgment should be separate steps.

The worker can produce evidence. A gate decides whether the evidence is enough.

Operating practice

Define a completion gate:

Requirement	Pass condition
Reproduction	Original bug fails before fix or is described from issue
Unit	Redirect guard test passes
API	Invite acceptance endpoint test passes
E2E	Email invite to first login works
Typecheck	Project typecheck passes
Missing evidence	Any skipped required check blocks “done”

Harnessed run

The agent returns:

Status: not ready.
Implemented redirect guard fix and unit test.
Passed: unit guard test, API invite acceptance test.
Missing: E2E invite flow not run because local email preview service is down.
Next: restore preview service or ask reviewer for CI-based verification.

This is more honest than a premature “done.”

Coding-agent example

Use reviewer roles:

Role	Checks
Implementer	Produces patch and first evidence
CI	Runs deterministic test suite
Reviewer	Checks scope, behavior, security, maintainability
Product verifier	Confirms user-visible scenario when needed

The harness should not let the implementer skip all other roles.

Review artifact

The useful artifact is an evidence packet that a judge can inspect without trusting the implementing agent’s confidence.

Gate	Required evidence	Result
Behavior gate	E2E invite acceptance passes from email link to workspace setup	Missing
Regression gate	Expired-token and already-accepted-token paths still pass	Present
Code gate	Patch limited to invite acceptance route and tests	Present
Risk gate	Auth redirect and membership side effects explained	Partial
Human gate	Reviewer approves after seeing evidence packet	Not started

This packet makes “ready” a judged state, not a feeling. The implementer can say the patch is ready for evaluation, but it cannot mark the behavior complete until the gates pass.

A coding harness can express this with two prompts or two agents:

Implementer:
- make the smallest patch for the active behavior
- collect evidence
- do not declare final completion

Judge:
- inspect diff and evidence
- rerun or challenge key checks
- decide Ready, Needs Work, or Blocked

The judge does not need to be hostile. It needs a different job. The implementer optimizes for producing a solution; the judge optimizes for detecting unsupported claims. This separation matters because language models are good at writing convincing completion summaries even when evidence is thin.

In the PR incident, the implementer had a unit test and a plausible explanation. The judge would have asked for the user journey evidence: can a newly invited user click the link and land in the workspace? That single gate catches the missing e2e path.

Harnessed version

The harnessed run has two moments of completion. First, the implementing agent reaches “ready for judgment” and submits a patch with evidence. Second, an evaluator reaches “ready for human review” after inspecting the evidence against gates. These states should not collapse into one.

This is not bureaucracy. It is a guard against fluent self-assessment. Coding agents can explain why their patch should work, but the harness should prefer independent evidence over explanation. The evaluator can be another agent, a deterministic CI gate, a human reviewer, or a combination. What matters is that the implementer does not own the final grade.

When the judge finds a missing gate, the task should return with a precise reason: “behavior evidence missing for invited-user journey.” That is more useful than “needs more testing” because the implementer knows exactly which proof to produce.

Common mistakes

The first mistake is equating changed code with fixed behavior.

The second mistake is accepting a narrow test when the task described an end-to-end flow.

The third mistake is treating skipped tests as neutral. Missing evidence should block completion.

The fourth mistake is letting the agent review its own summary instead of the actual diff and test output.

Practical exercise

Pick one recent patch. Ignore the summary. List only evidence proving the user-visible behavior works.

If the evidence would not convince a skeptical reviewer, design a stronger completion gate.

Key takeaways

Agent confidence is not completion evidence.
User-visible bugs need user-visible verification.
Missing evidence should block “done.”
CI and review are harness components.
The implementing agent should not be the only judge.