Instrument the Work
Make coding-agent behavior debuggable with command logs, diffs, run records, and decision artifacts.
Failure pattern
A bad patch reaches review, but the team cannot reconstruct which files the agent inspected, which tests failed, which commands were rerun, or why it chose the final approach.
Without instrumentation, debugging becomes a debate about the final diff. That is too late.
Incident: hidden test failure path
Agent task
The agent is asked:
Fix SSO invite acceptance and update tests.
The final patch changes auth guards and tests.
Available surface
The run includes:
| Surface | Signal |
|---|---|
| Files read | auth guard, invite route, membership helper |
| Commands | unit tests, typecheck, e2e |
| Failures | initial unit failure, later e2e timeout |
| Decisions | chose to wait for membership hydration |
| Diff | code and test changes |
| Handoff | final status and missing checks |
Only the final summary is saved.
Bad run
Review sees a patch that looks reasonable, but CI fails e2e. The agent summary says:
Tests passed locally.
No one knows which tests passed. The e2e command may not have run. The agent may have seen a timeout and ignored it. The path is invisible.
Why the harness failed
The harness recorded the output but not the run.
| Missing signal | Consequence |
|---|---|
| Commands run | Cannot verify local evidence |
| Failed commands | Cannot see ignored failures |
| Files inspected | Cannot tell whether agent read relevant code |
| Decision artifact | Cannot inspect why approach was chosen |
| Verification status | ”Tests passed” is too vague |
The patch is not debuggable.
Why it happens
Software teams instrument applications, but often do not instrument agent work. Agent work needs process signal: commands, file reads, failures, decisions, verification, and handoff.
You do not need every token. You need enough evidence to reconstruct important decisions.
Harness principle
Instrument the coding run.
flowchart TD A["Task contract"] --> B["Files inspected"] B --> C["Commands run"] C --> D["Failures observed"] D --> E["Decision artifact"] E --> F["Verification evidence"] F --> G["Final patch"] B --> H["Run record"] C --> H E --> H F --> H
The run record makes the patch reviewable.
Operating practice
Create a minimal run record:
| Field | Example |
|---|---|
| Run ID | CODING-INVITE-403-2026-05-18 |
| Objective | Fix intermittent 403 after invite acceptance |
| Files inspected | invite route, auth callback, membership helper |
| Commands run | unit guard test, API invite test, typecheck |
| Failures | e2e not run: preview email service unavailable |
| Decision | wait for membership hydration before redirect |
| Verification | unit and API passed; e2e missing |
| Final status | patch with blocker |
Harnessed run
Verification:
- pnpm test auth-guard.spec.ts: pass
- pnpm test invite-api.spec.ts: pass
- pnpm playwright invite-flow.spec.ts: not run, preview service unavailable
Status:
- Not complete until e2e or CI equivalent passes.
The reviewer now knows exactly what evidence exists.
Coding-agent example
Useful event types:
| Event | Required fields |
|---|---|
task_started | objective, scope, exclusions |
file_read | path, reason |
command_run | command, result, duration |
failure_seen | command, error, suspected cause |
decision_made | option chosen, alternatives rejected |
verification_ran | check, pass/fail, evidence |
handoff_written | status, blockers, next action |
Structured records beat giant transcripts.
Review artifact
A run record should let someone reconstruct the path from request to claim.
{
"run_id": "coding-2026-05-18-0042",
"task": "Fix invite acceptance redirect after SSO enforcement",
"commit_base": "8f2c19a",
"active_behavior": "valid invite lands in workspace setup",
"context_sources": [
"ADR-029-membership-events",
"tests/e2e/invite-acceptance.spec.ts"
],
"tool_calls": [
{"name": "pnpm test invite", "result": "failed before patch"},
{"name": "pnpm test invite", "result": "passed after patch"},
{"name": "pnpm lint", "result": "passed"}
],
"changed_files": [
"src/routes/invite/[token].ts",
"tests/e2e/invite-acceptance.spec.ts"
],
"open_risks": ["first-login analytics event not verified"]
}
This is not meant to replace logs. It is the index to the logs. A reviewer can ask: which context did the agent use, which behavior was active, which commands ran, and what risk remains?
Instrumentation should capture decision points, not only final commands. If the agent chooses ADR-029 over ADR-014, that choice should appear in the trace. If it rejects a dashboard change as out of scope, that rejection should appear too. These events explain the shape of the patch.
The harnessed version of the hidden-failure incident would have shown that the agent never ran the cross-browser e2e path. That does not automatically mean the patch is wrong, but it changes the review state from “done” to “missing required evidence.” Good instrumentation makes uncertainty visible early.
Harnessed version
The harnessed run produces a trace that can answer three questions: what did the agent know, what did it do, and why did it believe the task was complete? The answer should not require reading every token of the conversation. A compact run record, linked command outputs, and named decision events are enough.
For example, if the invite fix fails only in Firefox, the trace should show whether Firefox was in the required evidence list, whether the agent skipped it, whether the tool was unavailable, or whether the failure appeared after handoff. Each explanation points to a different harness fix. Missing evidence points to verification. Unavailable tooling points to runway. A skipped required command points to judging. A post-handoff regression points to release process.
Instrumentation is therefore not only observability for debugging code. It is observability for debugging the harness itself. Without it, teams argue from memory. With it, they can inspect the run.
At minimum, record the base commit, active behavior, context sources, commands run, changed files, skipped checks, and open risks. That small record is usually enough to decide whether a failure came from implementation, verification, context, or handoff.
If the record cannot support that diagnosis, the harness is still operating on trust instead of durable evidence.
Common mistakes
The first mistake is logging only final summaries. Summaries hide failed paths.
The second mistake is omitting failed commands. Failures are the most valuable signal.
The third mistake is saying “tests pass” without command names.
The fourth mistake is recording decisions without rejected alternatives.
Practical exercise
Design a run record for one coding-agent task. Include files inspected, commands run, failures, decisions, verification, and final status.
Then ask whether a reviewer could audit the patch without rerunning everything.
Key takeaways
- Coding-agent observability needs process signal.
- Commands and failures should be recorded explicitly.
- “Tests passed” is not enough.
- Decision artifacts make patches easier to review.
- Run records support debugging and handoff.
Further reading / source notes
- OpenTelemetry Signals for trace and event vocabulary.
- Honeycomb, “Observability Engineering” for debugging complex systems from rich events.