Observe 41 min

Instrument the Work

Make coding-agent behavior debuggable with command logs, diffs, run records, and decision artifacts.

Failure pattern

A bad patch reaches review, but the team cannot reconstruct which files the agent inspected, which tests failed, which commands were rerun, or why it chose the final approach.

Without instrumentation, debugging becomes a debate about the final diff. That is too late.

Incident: hidden test failure path

Agent task

The agent is asked:

Fix SSO invite acceptance and update tests.

The final patch changes auth guards and tests.

Available surface

The run includes:

SurfaceSignal
Files readauth guard, invite route, membership helper
Commandsunit tests, typecheck, e2e
Failuresinitial unit failure, later e2e timeout
Decisionschose to wait for membership hydration
Diffcode and test changes
Handofffinal status and missing checks

Only the final summary is saved.

Bad run

Review sees a patch that looks reasonable, but CI fails e2e. The agent summary says:

Tests passed locally.

No one knows which tests passed. The e2e command may not have run. The agent may have seen a timeout and ignored it. The path is invisible.

Why the harness failed

The harness recorded the output but not the run.

Missing signalConsequence
Commands runCannot verify local evidence
Failed commandsCannot see ignored failures
Files inspectedCannot tell whether agent read relevant code
Decision artifactCannot inspect why approach was chosen
Verification status”Tests passed” is too vague

The patch is not debuggable.

Why it happens

Software teams instrument applications, but often do not instrument agent work. Agent work needs process signal: commands, file reads, failures, decisions, verification, and handoff.

You do not need every token. You need enough evidence to reconstruct important decisions.

Harness principle

Instrument the coding run.

flowchart TD
  A["Task contract"] --> B["Files inspected"]
  B --> C["Commands run"]
  C --> D["Failures observed"]
  D --> E["Decision artifact"]
  E --> F["Verification evidence"]
  F --> G["Final patch"]
  B --> H["Run record"]
  C --> H
  E --> H
  F --> H
A coding run record connects task, files, commands, decisions, verification, and handoff.

The run record makes the patch reviewable.

Operating practice

Create a minimal run record:

FieldExample
Run IDCODING-INVITE-403-2026-05-18
ObjectiveFix intermittent 403 after invite acceptance
Files inspectedinvite route, auth callback, membership helper
Commands rununit guard test, API invite test, typecheck
Failurese2e not run: preview email service unavailable
Decisionwait for membership hydration before redirect
Verificationunit and API passed; e2e missing
Final statuspatch with blocker

Harnessed run

Verification:
- pnpm test auth-guard.spec.ts: pass
- pnpm test invite-api.spec.ts: pass
- pnpm playwright invite-flow.spec.ts: not run, preview service unavailable

Status:
- Not complete until e2e or CI equivalent passes.

The reviewer now knows exactly what evidence exists.

Coding-agent example

Useful event types:

EventRequired fields
task_startedobjective, scope, exclusions
file_readpath, reason
command_runcommand, result, duration
failure_seencommand, error, suspected cause
decision_madeoption chosen, alternatives rejected
verification_rancheck, pass/fail, evidence
handoff_writtenstatus, blockers, next action

Structured records beat giant transcripts.

Review artifact

A run record should let someone reconstruct the path from request to claim.

{
  "run_id": "coding-2026-05-18-0042",
  "task": "Fix invite acceptance redirect after SSO enforcement",
  "commit_base": "8f2c19a",
  "active_behavior": "valid invite lands in workspace setup",
  "context_sources": [
    "ADR-029-membership-events",
    "tests/e2e/invite-acceptance.spec.ts"
  ],
  "tool_calls": [
    {"name": "pnpm test invite", "result": "failed before patch"},
    {"name": "pnpm test invite", "result": "passed after patch"},
    {"name": "pnpm lint", "result": "passed"}
  ],
  "changed_files": [
    "src/routes/invite/[token].ts",
    "tests/e2e/invite-acceptance.spec.ts"
  ],
  "open_risks": ["first-login analytics event not verified"]
}

This is not meant to replace logs. It is the index to the logs. A reviewer can ask: which context did the agent use, which behavior was active, which commands ran, and what risk remains?

Instrumentation should capture decision points, not only final commands. If the agent chooses ADR-029 over ADR-014, that choice should appear in the trace. If it rejects a dashboard change as out of scope, that rejection should appear too. These events explain the shape of the patch.

The harnessed version of the hidden-failure incident would have shown that the agent never ran the cross-browser e2e path. That does not automatically mean the patch is wrong, but it changes the review state from “done” to “missing required evidence.” Good instrumentation makes uncertainty visible early.

Harnessed version

The harnessed run produces a trace that can answer three questions: what did the agent know, what did it do, and why did it believe the task was complete? The answer should not require reading every token of the conversation. A compact run record, linked command outputs, and named decision events are enough.

For example, if the invite fix fails only in Firefox, the trace should show whether Firefox was in the required evidence list, whether the agent skipped it, whether the tool was unavailable, or whether the failure appeared after handoff. Each explanation points to a different harness fix. Missing evidence points to verification. Unavailable tooling points to runway. A skipped required command points to judging. A post-handoff regression points to release process.

Instrumentation is therefore not only observability for debugging code. It is observability for debugging the harness itself. Without it, teams argue from memory. With it, they can inspect the run.

At minimum, record the base commit, active behavior, context sources, commands run, changed files, skipped checks, and open risks. That small record is usually enough to decide whether a failure came from implementation, verification, context, or handoff.

If the record cannot support that diagnosis, the harness is still operating on trust instead of durable evidence.

Common mistakes

The first mistake is logging only final summaries. Summaries hide failed paths.

The second mistake is omitting failed commands. Failures are the most valuable signal.

The third mistake is saying “tests pass” without command names.

The fourth mistake is recording decisions without rejected alternatives.

Practical exercise

Design a run record for one coding-agent task. Include files inspected, commands run, failures, decisions, verification, and final status.

Then ask whether a reviewer could audit the patch without rerunning everything.

Key takeaways

  • Coding-agent observability needs process signal.
  • Commands and failures should be recorded explicitly.
  • “Tests passed” is not enough.
  • Decision artifacts make patches easier to review.
  • Run records support debugging and handoff.

Further reading / source notes