Verify 40 min

Close the Feedback Loop

Use recurring coding-agent failures to strengthen the harness.

Failure pattern

The same review comments return across agent patches: wrong test command, missed edge case, broad refactor, unsafe migration, stale docs. Humans fix the PR, but the harness does not improve.

A repeated review comment is harness data. It says the environment made the wrong behavior easy.

Incident: repeated migration review failure

Agent task

Across several issues, the coding agent adds database columns for workspace features.

Review repeatedly says:

Migration is unsafe for existing tenants. Add a backfill plan and avoid locking writes.

Available surface

The repo has:

SurfaceContents
Migration examplesMixed safe and unsafe patterns
DB docsBackfill guidance, partially outdated
Review commentsSeveral corrected PRs
CISchema checks but no migration safety check
RunbookProduction deploy rules

The agent keeps copying a simple migration pattern.

Bad run

It creates:

ALTER TABLE workspace ADD COLUMN require_sso BOOLEAN NOT NULL DEFAULT false;

Review flags table-lock risk and missing backfill plan again.

Why the harness failed

The failure was corrected but not converted into a harness fix.

Repeated failureHarness layer
Unsafe migration copiedContext examples weak
Backfill missingCompletion gate incomplete
Review comments ignoredFeedback not persisted
CI passedAutomated checks missing
Same issue repeatedNo regression case

The team patched the output, not the system.

Why it happens

Coding agents learn from the context and feedback they can see. If prior review comments live only in PR threads, the next run may not see them. If unsafe examples remain near safe examples, the agent may copy the shorter pattern.

Closing the loop means turning human correction into future harness behavior.

Harness principle

Every repeated failure should produce a small harness change and a comparable rerun.

flowchart LR
  A["Failed PR"] --> B["Attribute layer"]
  B --> C["Small harness fix"]
  C --> D["Comparable coding task"]
  D --> E{"Failure prevented?"}
  E -->|"Yes"| F["Keep regression"]
  E -->|"No"| B
Review comments become harness fixes and regression checks.

The goal is not to write a longer prompt. The goal is to make the same mistake harder to repeat.

Operating practice

Use a failure log:

FailureLayerHarness fixEvidence
Unsafe NOT NULL DEFAULT migrationContextMark unsafe examples as legacy; add safe migration guideNext migration uses expand/backfill/contract
Backfill plan missingCompletion gateAdd migration checklist to PR readinessAgent outputs backfill step
CI misses migration riskVerificationAdd lint/check or reviewer checklist itemPR blocked before review

Then add a regression case:

Task: add workspace boolean setting.
Expected:
- nullable column or expand/backfill/contract plan
- no long write lock
- rollback notes
- verification command

Coding-agent example

Failure attribution rubric:

QuestionLikely layer
Was task vague?Work surface
Did agent copy stale pattern?Context
Did command/tool invite risk?Interface
Was baseline unknown?Runway
Did patch sprawl?Active work
Did state vanish?Progress
Did weak evidence pass?Judging
Could path not be reconstructed?Instrumentation
Was final state unclear?Handoff

Review artifact

Feedback should become a harness change, not a repeated comment.

Review commentAttributionHarness change
”Migration locks the table.”InterfaceAdd migration planner output before apply
”Backfill has no rollback.”Work surfaceRequire rollback note for data changes
”Test only covers happy path.”VerificationAdd negative-path regression template
”Agent changed unrelated cleanup.”Active workRequire queued/rejected state table
”Reviewer cannot reproduce result.”InstrumentationStore run command and fixture version

This table turns review pain into system improvement. If the same comment appears three times, the harness is failing. The answer is rarely “try harder.” The answer is usually a sharper task brief, safer tool surface, better context routing, stronger evaluation gate, or better run record.

The loop should be small:

flowchart LR
  A["Observed failure"] --> B["Attribute to harness layer"]
  B --> C["Change one harness rule"]
  C --> D["Rerun comparable case"]
  D --> E["Keep, adjust, or remove rule"]

The comparable case is important. If the agent failed on a migration because it ignored lock risk, rerun another migration-like task after adding the planner requirement. Do not wait for a future production incident to learn whether the harness improved.

For coding agents, feedback loops should live in durable assets: task templates, command wrappers, review checklists, eval cases, and examples. A reviewer comment inside one PR helps that PR. A changed harness helps the next ten PRs.

Harnessed version

The harnessed run treats the repeated migration issue as a system signal. The team adds a migration-planning requirement, creates two regression prompts that resemble previous bad migrations, and reruns them after the rule changes. If the agent now produces rollback notes and lock-risk evidence, the harness improved. If it still misses the risk, the fix was in the wrong layer.

This is the core of feedback-loop work: separate the agent’s one-time mistake from the harness weakness that allowed it. A stronger prompt might help for one task. A better command contract, review gate, or eval case changes future behavior.

The loop should stay intentionally small. Do not rewrite the entire harness because one patch had one review comment. Attribute the failure, change one thing, rerun a comparable case, and keep the change only if it improves the result.

That discipline keeps the harness learnable. Engineers can understand why a new rule exists because it points back to a real failure.

Common mistakes

The first mistake is repeating review comments without changing the harness.

The second mistake is fixing everything at once. Small fixes are easier to evaluate.

The third mistake is adding only negative instructions. Better to add safe examples and gates.

The fourth mistake is letting corrected PRs disappear instead of becoming examples.

Practical exercise

Review five agent PR comments. Group repeated comments by harness layer. Pick one repeated failure and design the smallest harness change that would have prevented it.

Then test the change on a comparable task.

Key takeaways

  • Repeated review comments are harness data.
  • Fixing a PR is not the same as fixing the harness.
  • Safe examples beat vague warnings.
  • Regression cases keep old failures from returning.
  • Feedback loops should be small and testable.

Further reading / source notes