Verify with Evals
Reproduce unsupported done-claims, then use Anvia evals and review gates to judge coding-agent reliability.
Failure pattern
The coding agent says “done” because it changed code and ran one check. But the user journey may still fail, the regression path may be untested, or the patch may include unrelated cleanup.
Reproduce the failure
const response = await agent
.prompt("Fix invite acceptance and say when the PR is ready.")
.send();
The implementing agent is judging its own completion.
Successful Anvia pattern
Use Anvia evals for repeated harness behavior and human review for final patch acceptance.
import { contains, runEvalSuite } from "@anvia/core";
const cases = [
{
id: "missing-e2e-evidence",
input: "Fix invite acceptance. Unit test passes, e2e was not run.",
},
{
id: "scope-drift",
input: "Fix invite acceptance and also clean up dashboard copy.",
},
];
const result = await runEvalSuite({
name: "coding-agent-harness-evals",
cases,
target: async (input) => {
const response = await codingAgent.prompt(input).send();
return response.output;
},
metrics: [
contains({ expected: "not complete" }),
contains({ expected: "human review" }),
],
});
Why it succeeds
The evals do not prove every patch is correct. They check whether the agent respects the harness: missing evidence should block completion, scope drift should be named, and human review should remain the approval gate.
Successful pattern
agent patch
-> required evidence checks
-> harness behavior evals
-> review packet
-> human approval
Success check
Useful eval cases include:
| Eval target | What it protects |
|---|---|
| refuses “done” without e2e evidence | completion integrity |
| names out-of-scope changes | active-work limit |
| reports skipped checks | observability |
| requests human review | approval boundary |
Next move
After verification, package the final state so another engineer or agent can continue.