Persist Progress
Make long-running coding work restartable without relying on chat history.
Failure pattern
A bug investigation spans multiple sessions, but the findings live only in the conversation. The next agent repeats work, trusts a stale summary, or contradicts the prior diagnosis.
Coding work often pauses at the worst time: after reproducing a bug, before the fix is known. Without durable progress, a continuation starts from memory instead of evidence.
Incident: intermittent invite failure
Agent task
The agent investigates:
Some invited users get a 403 on first login. Find the cause and prepare a fix.
The issue is intermittent and tied to workspace membership hydration.
Available surface
The agent can inspect:
| Surface | Contents |
|---|---|
| Issue thread | User reports and timestamps |
| Logs | auth callback, invite acceptance, membership creation |
| Tests | Playwright invite flow and API tests |
| Database | invitations, memberships, sessions |
| Recent PRs | auth callback refactor and queue change |
| Local notes | scratch findings from prior run |
The first session reproduces the failure but stops before fixing it.
Bad run
The first agent writes in chat:
Looks like a race between invite acceptance and session creation. Need to inspect membership hydration.
The next agent starts fresh, sees a 403, and blames permission middleware. It edits the middleware and makes the bug less frequent but not fixed.
The prior reproduction details are gone.
Why the harness failed
Progress was not persisted as repo state.
| Missing record | Consequence |
|---|---|
| Reproduction steps | Second agent used a different scenario |
| Verified findings | Race condition evidence was lost |
| Rejected hypotheses | Middleware theory was reopened |
| Open blocker | Queue timing was not tracked |
| Next action | Continuation chose a different path |
The conversation had clues. The repo did not.
Why it happens
Long bug hunts create partial knowledge. Some hypotheses are verified, some rejected, and some open. A chat summary compresses that state into vague language. A fresh agent cannot tell what is trustworthy.
Coding harnesses need restart artifacts: short, current, and operational.
Harness principle
Persist progress outside chat.
flowchart LR A["Session starts"] --> B["Read progress note"] B --> C["Continue active hypothesis"] C --> D["Verify or reject"] D --> E["Update progress note"] E --> F["Next session resumes"]
Progress records should distinguish verified findings from guesses.
Operating practice
Write a restart note:
| Field | Example |
|---|---|
| Objective | Fix intermittent 403 after invite acceptance |
| Reproduction | Create invite, accept email, first auth callback returns 403 in 3/10 runs |
| Verified | Membership row sometimes created after session permission check |
| Rejected | Static role middleware is not the primary cause |
| Open | Queue job timing after auth callback needs inspection |
| Evidence | Log trace INV-403-2026-05-18-A, failing test branch |
| Next action | Add focused test around session wait for membership hydration |
Harnessed run
The next agent reads the note and returns:
Continuation:
- Reused reproduction from INV-403-2026-05-18-A.
- Confirmed middleware hypothesis was already rejected.
- Added failing test for session before membership hydration.
Status:
- Root cause verified.
- Fix not implemented yet.
The second run advances instead of restarting.
Coding-agent example
Use state labels:
| State | Meaning |
|---|---|
verified | Confirmed with evidence |
rejected | Tested and ruled out |
open | Known question |
blocked | Waiting on access or decision |
stale | Needs rerun after code changes |
These labels prevent “looked into” from meaning five different things.
Review artifact
A durable progress record should survive a new session, a new agent, or a reviewer joining late.
Research note: intermittent 403 after invite acceptance
Verified:
- Token is valid when acceptance request begins.
- Membership row exists after acceptance mutation.
- 403 happens on first workspace route load, not during token exchange.
Rejected:
- Expired-token branch is not involved.
- Email link encoding is not the cause.
- Browser cache does not reproduce the issue.
Unverified:
- Session refresh may happen before membership event propagation completes.
- Workspace role cache may be stale after acceptance.
Evidence:
- run-2026-05-18-a: failing e2e trace
- run-2026-05-18-b: membership row inspection
- log-acceptance-17: session refresh timing
Next action:
- Add focused trace around role cache read after acceptance redirect.
This note is more valuable than a chat summary because it separates verified findings from guesses. The next agent can continue from the open question instead of re-running the first hour of investigation.
Persistence also changes how contradiction is handled. If the next agent believes the role cache is not stale, it should add evidence under Rejected, not overwrite the previous note with a new story. The harness should make knowledge additive and auditable.
For longer coding tasks, store this record close to the work: in a run artifact, issue comment, PR body, or structured task file. Chat history is not enough because it is hard to search, hard to diff, and easy to lose when the session changes. A coding harness should assume restarts are normal.
Harnessed version
In the harnessed run, the first agent does not end with “I think it is session timing.” It ends with a progress record that names the reproduction, verified facts, rejected hypotheses, and next command. The second agent starts by reading that record, then either continues the next action or challenges a finding with new evidence.
That changes the team dynamic. A restart no longer means the investigation returns to zero. A contradiction no longer becomes two competing stories in chat. The durable record becomes the shared memory of the task, and every new claim must attach itself to evidence.
For coding work, this is especially valuable when the bug is intermittent. The agent may need multiple test traces, logs, or fixture variations before the cause is clear. Persisting progress keeps those attempts useful even when the final fix has not been found.
Common mistakes
The first mistake is saving only a final summary. Bug hunts need reproduction and rejected hypotheses.
The second mistake is hiding the progress note somewhere the next agent will not read.
The third mistake is failing to mark stale findings after code changes.
The fourth mistake is omitting command output or trace IDs.
Practical exercise
Take one unfinished bug investigation. Write a restart note with objective, reproduction, verified facts, rejected hypotheses, open questions, evidence, and next action.
Start a fresh session from only that note. If it repeats work, the note is not operational enough.
Key takeaways
- Chat history is not durable progress.
- Bug investigations need restart notes.
- Rejected hypotheses are as important as verified ones.
- Evidence IDs make findings inspectable.
- The next action should be explicit.
Further reading / source notes
- Anthropic, “Effective harnesses for long-running agents” for restartable progress practices.
- OpenAI, “Harness engineering” for agent environment and feedback-loop design.