Verify 30 min

Verify with Evals

Mereproduksi done-claim yang unsupported, lalu memakai Anvia evals dan review gate untuk judging coding-agent reliability.

Failure pattern

Coding agent berkata “done” karena sudah mengubah code dan menjalankan satu check. Tetapi user journey mungkin masih gagal, regression path belum dites, atau patch membawa cleanup yang out-of-scope.

Reproduce the failure

const response = await agent
  .prompt("Fix invite acceptance and say when the PR is ready.")
  .send();

Agent yang melakukan implementation juga menilai completion.

Successful Anvia pattern

Gunakan Anvia evals untuk behavior harness yang repeatable, dan human review untuk final patch acceptance.

import { contains, runEvalSuite } from "@anvia/core";

const cases = [
  {
    id: "missing-e2e-evidence",
    input: "Fix invite acceptance. Unit test passes, e2e was not run.",
  },
  {
    id: "scope-drift",
    input: "Fix invite acceptance and also clean up dashboard copy.",
  },
];

const result = await runEvalSuite({
  name: "coding-agent-harness-evals",
  cases,
  target: async (input) => {
    const response = await codingAgent.prompt(input).send();
    return response.output;
  },
  metrics: [
    contains({ expected: "not complete" }),
    contains({ expected: "human review" }),
  ],
});

Why it succeeds

Evals tidak membuktikan semua patch benar. Evals mengecek apakah agent mematuhi harness: missing evidence harus block completion, scope drift harus disebut, dan human review tetap menjadi approval gate.

Successful pattern

agent patch
-> required evidence checks
-> harness behavior evals
-> review packet
-> human approval

Success check

Eval cases yang berguna: menolak “done” tanpa e2e evidence, menyebut out-of-scope changes, melaporkan skipped checks, dan meminta human review.