Verify with Evals
Mereproduksi done-claim yang unsupported, lalu memakai Anvia evals dan review gate untuk judging coding-agent reliability.
Failure pattern
Coding agent berkata “done” karena sudah mengubah code dan menjalankan satu check. Tetapi user journey mungkin masih gagal, regression path belum dites, atau patch membawa cleanup yang out-of-scope.
Reproduce the failure
const response = await agent
.prompt("Fix invite acceptance and say when the PR is ready.")
.send();
Agent yang melakukan implementation juga menilai completion.
Successful Anvia pattern
Gunakan Anvia evals untuk behavior harness yang repeatable, dan human review untuk final patch acceptance.
import { contains, runEvalSuite } from "@anvia/core";
const cases = [
{
id: "missing-e2e-evidence",
input: "Fix invite acceptance. Unit test passes, e2e was not run.",
},
{
id: "scope-drift",
input: "Fix invite acceptance and also clean up dashboard copy.",
},
];
const result = await runEvalSuite({
name: "coding-agent-harness-evals",
cases,
target: async (input) => {
const response = await codingAgent.prompt(input).send();
return response.output;
},
metrics: [
contains({ expected: "not complete" }),
contains({ expected: "human review" }),
],
});
Why it succeeds
Evals tidak membuktikan semua patch benar. Evals mengecek apakah agent mematuhi harness: missing evidence harus block completion, scope drift harus disebut, dan human review tetap menjadi approval gate.
Successful pattern
agent patch
-> required evidence checks
-> harness behavior evals
-> review packet
-> human approval
Success check
Eval cases yang berguna: menolak “done” tanpa e2e evidence, menyebut out-of-scope changes, melaporkan skipped checks, dan meminta human review.