Instrument the Work

Failure pattern

A bad patch reaches review, but the team cannot reconstruct which files the agent inspected, which tests failed, which commands were rerun, or why it chose the final approach.

Without instrumentation, debugging becomes a debate about the final diff. That is too late.

Incident: hidden test failure path

Agent task

The agent is asked:

Fix SSO invite acceptance and update tests.

The final patch changes auth guards and tests.

Available surface

The run includes:

Surface	Signal
Files read	auth guard, invite route, membership helper
Commands	unit tests, typecheck, e2e
Failures	initial unit failure, later e2e timeout
Decisions	chose to wait for membership hydration
Diff	code and test changes
Handoff	final status and missing checks

Only the final summary is saved.

Bad run

Review sees a patch that looks reasonable, but CI fails e2e. The agent summary says:

Tests passed locally.

No one knows which tests passed. The e2e command may not have run. The agent may have seen a timeout and ignored it. The path is invisible.

Why the harness failed

The harness recorded the output but not the run.

Missing signal	Consequence
Commands run	Cannot verify local evidence
Failed commands	Cannot see ignored failures
Files inspected	Cannot tell whether agent read relevant code
Decision artifact	Cannot inspect why approach was chosen
Verification status	”Tests passed” is too vague

The patch is not debuggable.

Why it happens

Software teams instrument applications, but often do not instrument agent work. Agent work needs process signal: commands, file reads, failures, decisions, verification, and handoff.

You do not need every token. You need enough evidence to reconstruct important decisions.

Harness principle

Instrument the coding run.

flowchart TD
  A["Task contract"] --> B["Files inspected"]
  B --> C["Commands run"]
  C --> D["Failures observed"]
  D --> E["Decision artifact"]
  E --> F["Verification evidence"]
  F --> G["Final patch"]
  B --> H["Run record"]
  C --> H
  E --> H
  F --> H

A coding run record connects task, files, commands, decisions, verification, and handoff.

The run record makes the patch reviewable.

Operating practice

Create a minimal run record:

Field	Example
Run ID	`CODING-INVITE-403-2026-05-18`
Objective	Fix intermittent 403 after invite acceptance
Files inspected	invite route, auth callback, membership helper
Commands run	unit guard test, API invite test, typecheck
Failures	e2e not run: preview email service unavailable
Decision	wait for membership hydration before redirect
Verification	unit and API passed; e2e missing
Final status	patch with blocker

Harnessed run

Verification:
- pnpm test auth-guard.spec.ts: pass
- pnpm test invite-api.spec.ts: pass
- pnpm playwright invite-flow.spec.ts: not run, preview service unavailable

Status:
- Not complete until e2e or CI equivalent passes.

The reviewer now knows exactly what evidence exists.

Coding-agent example

Useful event types:

Event	Required fields
`task_started`	objective, scope, exclusions
`file_read`	path, reason
`command_run`	command, result, duration
`failure_seen`	command, error, suspected cause
`decision_made`	option chosen, alternatives rejected
`verification_ran`	check, pass/fail, evidence
`handoff_written`	status, blockers, next action

Structured records beat giant transcripts.

Review artifact

A run record should let someone reconstruct the path from request to claim.

{
  "run_id": "coding-2026-05-18-0042",
  "task": "Fix invite acceptance redirect after SSO enforcement",
  "commit_base": "8f2c19a",
  "active_behavior": "valid invite lands in workspace setup",
  "context_sources": [
    "ADR-029-membership-events",
    "tests/e2e/invite-acceptance.spec.ts"
  ],
  "tool_calls": [
    {"name": "pnpm test invite", "result": "failed before patch"},
    {"name": "pnpm test invite", "result": "passed after patch"},
    {"name": "pnpm lint", "result": "passed"}
  ],
  "changed_files": [
    "src/routes/invite/[token].ts",
    "tests/e2e/invite-acceptance.spec.ts"
  ],
  "open_risks": ["first-login analytics event not verified"]
}

This is not meant to replace logs. It is the index to the logs. A reviewer can ask: which context did the agent use, which behavior was active, which commands ran, and what risk remains?

Instrumentation should capture decision points, not only final commands. If the agent chooses ADR-029 over ADR-014, that choice should appear in the trace. If it rejects a dashboard change as out of scope, that rejection should appear too. These events explain the shape of the patch.

The harnessed version of the hidden-failure incident would have shown that the agent never ran the cross-browser e2e path. That does not automatically mean the patch is wrong, but it changes the review state from “done” to “missing required evidence.” Good instrumentation makes uncertainty visible early.

Harnessed version

The harnessed run produces a trace that can answer three questions: what did the agent know, what did it do, and why did it believe the task was complete? The answer should not require reading every token of the conversation. A compact run record, linked command outputs, and named decision events are enough.

For example, if the invite fix fails only in Firefox, the trace should show whether Firefox was in the required evidence list, whether the agent skipped it, whether the tool was unavailable, or whether the failure appeared after handoff. Each explanation points to a different harness fix. Missing evidence points to verification. Unavailable tooling points to runway. A skipped required command points to judging. A post-handoff regression points to release process.

Instrumentation is therefore not only observability for debugging code. It is observability for debugging the harness itself. Without it, teams argue from memory. With it, they can inspect the run.

At minimum, record the base commit, active behavior, context sources, commands run, changed files, skipped checks, and open risks. That small record is usually enough to decide whether a failure came from implementation, verification, context, or handoff.

If the record cannot support that diagnosis, the harness is still operating on trust instead of durable evidence.

Common mistakes

The first mistake is logging only final summaries. Summaries hide failed paths.

The second mistake is omitting failed commands. Failures are the most valuable signal.

The third mistake is saying “tests pass” without command names.

The fourth mistake is recording decisions without rejected alternatives.

Practical exercise

Design a run record for one coding-agent task. Include files inspected, commands run, failures, decisions, verification, and final status.

Then ask whether a reviewer could audit the patch without rerunning everything.

Key takeaways

Coding-agent observability needs process signal.
Commands and failures should be recorded explicitly.
“Tests passed” is not enough.
Decision artifacts make patches easier to review.
Run records support debugging and handoff.