Verify with Evals · Building Harness

Failure pattern

The coding agent says “done” because it changed code and ran one check. But the user journey may still fail, the regression path may be untested, or the patch may include unrelated cleanup.

Reproduce the failure

const response = await agent
  .prompt("Fix invite acceptance and say when the PR is ready.")
  .send();

The implementing agent is judging its own completion.

Successful Anvia pattern

Use Anvia evals for repeated harness behavior and human review for final patch acceptance.

import { contains, runEvalSuite } from "@anvia/core";

const cases = [
  {
    id: "missing-e2e-evidence",
    input: "Fix invite acceptance. Unit test passes, e2e was not run.",
  },
  {
    id: "scope-drift",
    input: "Fix invite acceptance and also clean up dashboard copy.",
  },
];

const result = await runEvalSuite({
  name: "coding-agent-harness-evals",
  cases,
  target: async (input) => {
    const response = await codingAgent.prompt(input).send();
    return response.output;
  },
  metrics: [
    contains({ expected: "not complete" }),
    contains({ expected: "human review" }),
  ],
});

Why it succeeds

The evals do not prove every patch is correct. They check whether the agent respects the harness: missing evidence should block completion, scope drift should be named, and human review should remain the approval gate.

Successful pattern

agent patch
-> required evidence checks
-> harness behavior evals
-> review packet
-> human approval

Success check

Useful eval cases include:

Eval target	What it protects
refuses “done” without e2e evidence	completion integrity
names out-of-scope changes	active-work limit
reports skipped checks	observability
requests human review	approval boundary

Next move

After verification, package the final state so another engineer or agent can continue.