Close the Feedback Loop
Use recurring coding-agent failures to strengthen the harness.
Failure pattern
The same review comments return across agent patches: wrong test command, missed edge case, broad refactor, unsafe migration, stale docs. Humans fix the PR, but the harness does not improve.
A repeated review comment is harness data. It says the environment made the wrong behavior easy.
Incident: repeated migration review failure
Agent task
Across several issues, the coding agent adds database columns for workspace features.
Review repeatedly says:
Migration is unsafe for existing tenants. Add a backfill plan and avoid locking writes.
Available surface
The repo has:
| Surface | Contents |
|---|---|
| Migration examples | Mixed safe and unsafe patterns |
| DB docs | Backfill guidance, partially outdated |
| Review comments | Several corrected PRs |
| CI | Schema checks but no migration safety check |
| Runbook | Production deploy rules |
The agent keeps copying a simple migration pattern.
Bad run
It creates:
ALTER TABLE workspace ADD COLUMN require_sso BOOLEAN NOT NULL DEFAULT false;
Review flags table-lock risk and missing backfill plan again.
Why the harness failed
The failure was corrected but not converted into a harness fix.
| Repeated failure | Harness layer |
|---|---|
| Unsafe migration copied | Context examples weak |
| Backfill missing | Completion gate incomplete |
| Review comments ignored | Feedback not persisted |
| CI passed | Automated checks missing |
| Same issue repeated | No regression case |
The team patched the output, not the system.
Why it happens
Coding agents learn from the context and feedback they can see. If prior review comments live only in PR threads, the next run may not see them. If unsafe examples remain near safe examples, the agent may copy the shorter pattern.
Closing the loop means turning human correction into future harness behavior.
Harness principle
Every repeated failure should produce a small harness change and a comparable rerun.
flowchart LR
A["Failed PR"] --> B["Attribute layer"]
B --> C["Small harness fix"]
C --> D["Comparable coding task"]
D --> E{"Failure prevented?"}
E -->|"Yes"| F["Keep regression"]
E -->|"No"| B The goal is not to write a longer prompt. The goal is to make the same mistake harder to repeat.
Operating practice
Use a failure log:
| Failure | Layer | Harness fix | Evidence |
|---|---|---|---|
Unsafe NOT NULL DEFAULT migration | Context | Mark unsafe examples as legacy; add safe migration guide | Next migration uses expand/backfill/contract |
| Backfill plan missing | Completion gate | Add migration checklist to PR readiness | Agent outputs backfill step |
| CI misses migration risk | Verification | Add lint/check or reviewer checklist item | PR blocked before review |
Then add a regression case:
Task: add workspace boolean setting.
Expected:
- nullable column or expand/backfill/contract plan
- no long write lock
- rollback notes
- verification command
Coding-agent example
Failure attribution rubric:
| Question | Likely layer |
|---|---|
| Was task vague? | Work surface |
| Did agent copy stale pattern? | Context |
| Did command/tool invite risk? | Interface |
| Was baseline unknown? | Runway |
| Did patch sprawl? | Active work |
| Did state vanish? | Progress |
| Did weak evidence pass? | Judging |
| Could path not be reconstructed? | Instrumentation |
| Was final state unclear? | Handoff |
Review artifact
Feedback should become a harness change, not a repeated comment.
| Review comment | Attribution | Harness change |
|---|---|---|
| ”Migration locks the table.” | Interface | Add migration planner output before apply |
| ”Backfill has no rollback.” | Work surface | Require rollback note for data changes |
| ”Test only covers happy path.” | Verification | Add negative-path regression template |
| ”Agent changed unrelated cleanup.” | Active work | Require queued/rejected state table |
| ”Reviewer cannot reproduce result.” | Instrumentation | Store run command and fixture version |
This table turns review pain into system improvement. If the same comment appears three times, the harness is failing. The answer is rarely “try harder.” The answer is usually a sharper task brief, safer tool surface, better context routing, stronger evaluation gate, or better run record.
The loop should be small:
flowchart LR
A["Observed failure"] --> B["Attribute to harness layer"]
B --> C["Change one harness rule"]
C --> D["Rerun comparable case"]
D --> E["Keep, adjust, or remove rule"]
The comparable case is important. If the agent failed on a migration because it ignored lock risk, rerun another migration-like task after adding the planner requirement. Do not wait for a future production incident to learn whether the harness improved.
For coding agents, feedback loops should live in durable assets: task templates, command wrappers, review checklists, eval cases, and examples. A reviewer comment inside one PR helps that PR. A changed harness helps the next ten PRs.
Harnessed version
The harnessed run treats the repeated migration issue as a system signal. The team adds a migration-planning requirement, creates two regression prompts that resemble previous bad migrations, and reruns them after the rule changes. If the agent now produces rollback notes and lock-risk evidence, the harness improved. If it still misses the risk, the fix was in the wrong layer.
This is the core of feedback-loop work: separate the agent’s one-time mistake from the harness weakness that allowed it. A stronger prompt might help for one task. A better command contract, review gate, or eval case changes future behavior.
The loop should stay intentionally small. Do not rewrite the entire harness because one patch had one review comment. Attribute the failure, change one thing, rerun a comparable case, and keep the change only if it improves the result.
That discipline keeps the harness learnable. Engineers can understand why a new rule exists because it points back to a real failure.
Common mistakes
The first mistake is repeating review comments without changing the harness.
The second mistake is fixing everything at once. Small fixes are easier to evaluate.
The third mistake is adding only negative instructions. Better to add safe examples and gates.
The fourth mistake is letting corrected PRs disappear instead of becoming examples.
Practical exercise
Review five agent PR comments. Group repeated comments by harness layer. Pick one repeated failure and design the smallest harness change that would have prevented it.
Then test the change on a comparable task.
Key takeaways
- Repeated review comments are harness data.
- Fixing a PR is not the same as fixing the harness.
- Safe examples beat vague warnings.
- Regression cases keep old failures from returning.
- Feedback loops should be small and testable.
Further reading / source notes
- Google SRE, “Postmortem Culture” for learning from repeated failures.
- Anthropic, “Effective harnesses for long-running agents” for improving harnesses from observed failures.