Define the Work Surface
Turn vague coding-agent requests into bounded code changes that can be implemented and reviewed.
Failure pattern
A developer asks the coding agent to “fix onboarding,” and the agent turns a vague product complaint into a broad code change across auth, UI, email, analytics, and tests.
The failure begins before the first file is edited. The agent is not given a precise behavior, affected surface, exclusion list, verification condition, or stop rule. It tries to be helpful by expanding the task into everything that seems related.
Incident: onboarding setup bug
Agent task
A product engineer writes:
New users get stuck during onboarding. Fix the onboarding bug before tomorrow’s demo.
The SaaS app has a team-setup flow: create workspace, invite teammate, connect billing trial, and land in the dashboard.
Available surface
The agent can read and edit:
| Surface | Examples |
|---|---|
| Frontend route | app/onboarding/*, setup checklist, dashboard empty state |
| API | workspace creation, invite endpoint, trial endpoint |
| Database | workspace, invitation, billing trial tables |
| Email templates | invite email and welcome email |
| Tests | unit tests, Playwright onboarding flow, API integration tests |
| Issue tracker | bug report and demo checklist |
The bug report says only that users “get stuck.” It does not define which step fails.
Bad run
The agent edits:
- onboarding checklist state
- invitation API response
- welcome email link
- dashboard empty-state copy
- trial-start side effect
- one Playwright test
It then reports:
Fixed onboarding end to end and improved the demo flow.
Review finds the original bug was only this: invited users who accepted from email landed on /dashboard before their workspace membership was hydrated. The agent changed unrelated billing behavior and introduced a trial-start regression.
Why the harness failed
The harness let a symptom become a work surface.
| Missing boundary | Consequence |
|---|---|
| User-visible behavior | Agent did not know which onboarding step was broken |
| Allowed files | Agent edited API, email, billing, and dashboard together |
| Excluded work | Demo polish became implementation scope |
| Verification | No single reproduction was defined before edits |
| Stop condition | Billing side effects were changed without approval |
The result was not simply too much code. It was unreviewable code.
Why it happens
Coding agents follow semantic proximity. If onboarding is broken, dashboard state is related. So are invites, emails, billing trials, analytics, and copy. A human engineer narrows by asking “which behavior fails?” A harness should force that narrowing before implementation starts.
The work surface protects both the agent and the reviewer. It tells the agent what to change and tells the reviewer what should not have changed.
Harness principle
A coding work surface is the bounded code behavior the agent may change in one run.
It defines:
- Behavior: the exact user or system behavior to fix.
- Allowed surface: files, modules, commands, and tests likely in scope.
- Excluded surface: related areas that must not change.
- Evidence: reproduction, test, or manual scenario proving the fix.
- Stop rules: conditions requiring human decision.
flowchart LR A["Vague issue"] --> B["Reproduction"] B --> C["Allowed code surface"] C --> D["Excluded surface"] D --> E["Verification command"] E --> F["Reviewable patch"]
Operating practice
Rewrite the task before coding:
| Field | Harnessed brief |
|---|---|
| Behavior | Invited user accepting email should land in onboarding until membership is loaded |
| Reproduction | Create invite, accept from email link, observe dashboard redirect |
| Allowed surface | Invite acceptance route, membership hydration, onboarding redirect test |
| Excluded surface | Billing trial start, email template copy, dashboard empty state |
| Verification | Playwright invite-acceptance flow plus related unit test |
| Stop rules | Ask before touching billing, auth schema, or migration files |
| Completion evidence | Failing reproduction passes and unrelated onboarding tests still pass |
Harnessed run
The agent now returns:
Changed:
- invite acceptance route waits for membership hydration
- redirect guard sends incomplete users to /onboarding
- added Playwright coverage for invite acceptance
Not changed:
- billing trial logic
- email template copy
- dashboard empty state
Verification:
- pnpm test onboarding-redirect
- pnpm playwright onboarding-invite.spec.ts
The patch is smaller, easier to review, and directly tied to the bug.
Coding-agent example
For coding agents, output modes matter:
| Mode | Agent may do | Agent may not do |
|---|---|---|
| Bug fix | Patch one failing behavior | Refactor adjacent systems |
| Investigation | Inspect, reproduce, report cause | Change production code |
| Test addition | Add missing coverage | Change implementation unless asked |
| Refactor | Preserve behavior with evidence | Add new product behavior |
The harness should name the mode.
Review artifact
A work-surface brief should be short enough to fit at the top of the task, but precise enough that a reviewer can reject scope drift without debating intent.
| Field | Example |
|---|---|
| User-visible behavior | Invited users land in workspace setup after accepting a valid invite |
| Entry point | /invite/:token acceptance flow |
| In scope | Token validation, membership creation, redirect target, acceptance test |
| Out of scope | Email copy, billing trials, dashboard redesign, onboarding checklist logic |
| Constraints | No production data migration, no auth provider change, no copy rewrite |
| Evidence | Failing test reproduced, patch applied, acceptance path passes, no regression in expired-token path |
The brief also needs a refusal rule. If the agent discovers that the real issue is an upstream auth callback bug, it should stop and report the new work surface instead of silently widening the task. That one rule prevents a large class of agent failures: the agent tries to be helpful, finds a larger problem, and returns a much bigger patch than the team can safely review.
For coding agents, the work surface is not only a prompt document. It is enforced by route-specific tests, allowed commands, branch policy, and review gates. A good brief therefore pairs intent with a narrow verification path:
Required evidence:
- show failing invite acceptance test before patch
- show passing invite acceptance test after patch
- run existing auth redirect regression tests
- list files changed and explain why each file belongs to invite acceptance
This is the difference between asking an agent to “fix onboarding” and giving it a bounded engineering assignment. The model may still reason broadly, but the harness makes completion narrow.
Common mistakes
The first mistake is accepting product nouns as scope. “Onboarding” is not a task. “Invite acceptance redirects too early” is.
The second mistake is letting the agent improve nearby code. Reviewers need a clean reason for every changed file.
The third mistake is defining completion as “tests pass” without naming the failing scenario. Generic green tests may miss the bug.
The fourth mistake is failing to stop at dangerous boundaries. Billing, auth, migrations, and permissions usually deserve explicit approval.
Practical exercise
Take one vague issue from a repo and write a work-surface brief with behavior, reproduction, allowed files, excluded files, verification, and stop rules.
Then ask whether a reviewer could reject an unrelated file change from the brief alone. If not, the surface is still too broad.
Key takeaways
- Coding agents need bounded work before code changes.
- A bug report is not automatically a work surface.
- Exclusions are as important as allowed files.
- Verification should be defined before implementation.
- Smaller patches are not just cleaner; they are more auditable.
Further reading / source notes
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for the shift toward specifying intent and designing feedback loops.
- Anthropic, “Effective harnesses for long-running agents” for setup, task tracking, and verification practices.