Separate Doing from Judging
Do not let the implementing coding agent be the only judge of completion.
Failure pattern
The agent implements a patch, runs a narrow check, and declares the issue done. Later, CI or a human reviewer catches a broken real flow.
The agent may not be lying. It judged from the evidence available to it. The harness allowed weak evidence to count as completion.
Incident: PR marked done too early
Agent task
The agent is asked:
Fix invite acceptance for required-SSO workspaces and open the PR.
The agent patches the redirect guard and adds a unit test.
Available surface
The repo has:
| Surface | Evidence type |
|---|---|
| Unit tests | Guard behavior |
| API tests | Invite acceptance endpoint |
| E2E tests | Real invite email to first login |
| Typecheck | Cross-module type safety |
| CI | Full verification |
| Review checklist | Product behavior, migrations, security, tests |
The agent runs only the unit test.
Bad run
It reports:
Fixed invite acceptance.
Tested redirect guard.
Ready for review.
CI fails e2e. The real flow still breaks because the invite email link includes a token format the unit test never covered. The patch solved the isolated guard but not the product behavior.
Why the harness failed
The implementer judged completion.
| Missing gate | Consequence |
|---|---|
| Behavior-level evidence | Unit test passed but real flow failed |
| Independent reviewer | No role checked user-visible path |
| CI gate | PR readiness did not wait for required checks |
| Missing-evidence rule | ”Not run” did not block completion |
| Review rubric | Agent treated code change as product fix |
The task needed external judgment.
Why it happens
Coding agents often optimize for the nearest test. If a unit test passes, the patch feels complete. But user-visible behavior may depend on routing, email tokens, API state, background jobs, and browser flow.
Humans separate doing and judging with code review, CI, QA, and acceptance criteria. A harness should do the same.
Harness principle
Completion is evidence passing, not agent confidence.
flowchart LR A["Work surface"] --> B["Implementing agent"] B --> C["Patch and evidence"] C --> D["Completion gate"] D -->|"Pass"| E["Ready"] D -->|"Fail"| F["Fix or block"]
The worker can produce evidence. A gate decides whether the evidence is enough.
Operating practice
Define a completion gate:
| Requirement | Pass condition |
|---|---|
| Reproduction | Original bug fails before fix or is described from issue |
| Unit | Redirect guard test passes |
| API | Invite acceptance endpoint test passes |
| E2E | Email invite to first login works |
| Typecheck | Project typecheck passes |
| Missing evidence | Any skipped required check blocks “done” |
Harnessed run
The agent returns:
Status: not ready.
Implemented redirect guard fix and unit test.
Passed: unit guard test, API invite acceptance test.
Missing: E2E invite flow not run because local email preview service is down.
Next: restore preview service or ask reviewer for CI-based verification.
This is more honest than a premature “done.”
Coding-agent example
Use reviewer roles:
| Role | Checks |
|---|---|
| Implementer | Produces patch and first evidence |
| CI | Runs deterministic test suite |
| Reviewer | Checks scope, behavior, security, maintainability |
| Product verifier | Confirms user-visible scenario when needed |
The harness should not let the implementer skip all other roles.
Review artifact
The useful artifact is an evidence packet that a judge can inspect without trusting the implementing agent’s confidence.
| Gate | Required evidence | Result |
|---|---|---|
| Behavior gate | E2E invite acceptance passes from email link to workspace setup | Missing |
| Regression gate | Expired-token and already-accepted-token paths still pass | Present |
| Code gate | Patch limited to invite acceptance route and tests | Present |
| Risk gate | Auth redirect and membership side effects explained | Partial |
| Human gate | Reviewer approves after seeing evidence packet | Not started |
This packet makes “ready” a judged state, not a feeling. The implementer can say the patch is ready for evaluation, but it cannot mark the behavior complete until the gates pass.
A coding harness can express this with two prompts or two agents:
Implementer:
- make the smallest patch for the active behavior
- collect evidence
- do not declare final completion
Judge:
- inspect diff and evidence
- rerun or challenge key checks
- decide Ready, Needs Work, or Blocked
The judge does not need to be hostile. It needs a different job. The implementer optimizes for producing a solution; the judge optimizes for detecting unsupported claims. This separation matters because language models are good at writing convincing completion summaries even when evidence is thin.
In the PR incident, the implementer had a unit test and a plausible explanation. The judge would have asked for the user journey evidence: can a newly invited user click the link and land in the workspace? That single gate catches the missing e2e path.
Harnessed version
The harnessed run has two moments of completion. First, the implementing agent reaches “ready for judgment” and submits a patch with evidence. Second, an evaluator reaches “ready for human review” after inspecting the evidence against gates. These states should not collapse into one.
This is not bureaucracy. It is a guard against fluent self-assessment. Coding agents can explain why their patch should work, but the harness should prefer independent evidence over explanation. The evaluator can be another agent, a deterministic CI gate, a human reviewer, or a combination. What matters is that the implementer does not own the final grade.
When the judge finds a missing gate, the task should return with a precise reason: “behavior evidence missing for invited-user journey.” That is more useful than “needs more testing” because the implementer knows exactly which proof to produce.
Common mistakes
The first mistake is equating changed code with fixed behavior.
The second mistake is accepting a narrow test when the task described an end-to-end flow.
The third mistake is treating skipped tests as neutral. Missing evidence should block completion.
The fourth mistake is letting the agent review its own summary instead of the actual diff and test output.
Practical exercise
Pick one recent patch. Ignore the summary. List only evidence proving the user-visible behavior works.
If the evidence would not convince a skeptical reviewer, design a stronger completion gate.
Key takeaways
- Agent confidence is not completion evidence.
- User-visible bugs need user-visible verification.
- Missing evidence should block “done.”
- CI and review are harness components.
- The implementing agent should not be the only judge.
Further reading / source notes
- OpenAI, “Harness engineering” for feedback-loop framing.
- Anthropic, “Effective harnesses for long-running agents” for explicit verification before completion.