Verify 40 min

Close the Feedback Loop

Use repeated quant-agent failures to make the harness stronger over time.

Failure pattern

The agent repeatedly overstates signal quality, and each case is corrected manually without improving the harness.

The team says, “It forgot transaction costs again,” or “It ignored regime sensitivity again.” The memo is patched, the meeting moves on, and the same failure returns in the next research thread. The output was fixed. The system was not.

Incident: overstated revision signal

Agent task

The Quant Analyst AI Agent is asked across several weeks:

Evaluate whether revision momentum remains a useful signal in semiconductors.

The same pattern appears in multiple memos.

Available surface

The agent can use:

SurfaceContents
Factor backtestsGross and net performance variants
Cost modelSlippage, commissions, borrow assumptions
Regime dashboardVolatility, rates, crowding, sector drawdowns
Review commentsHuman notes on prior memos
Eval casesPreviously corrected research examples

The eval set is thin. Review comments are not converted into checks.

Bad run

The agent writes:

Revision momentum remains robust.
Five-year Sharpe is 1.4.
The signal has positive performance across most subperiods.

Review catches the same issues as before:

  • The reported Sharpe is gross of transaction costs.
  • High-turnover weeks drive much of the result.
  • Performance collapses during two high-volatility regimes.
  • Prior review comments asked for regime segmentation, but the agent did not include it.

Why the harness failed

The team corrected outputs but did not close the loop.

Repeated failureLikely harness layer
Gross metrics presented as robustEvidence gate missing net-cost requirement
Regime sensitivity skippedContext/example set weak
Prior review comments ignoredFeedback not persisted into harness
Same mistake returnsRegression eval missing
Memo language too confidentOutput rubric weak

The harness did not learn from review.

Why it happens

Agent failures become expensive when they repeat. A one-off mistake may be model noise or ambiguous input. A repeated mistake is usually a system smell. Something in the work surface, context, interface, gate, or examples makes the failure easy.

In quant work, repeated failures often cluster around assumptions: costs, data lineage, regime sensitivity, capacity, borrow, survivorship bias, and approval language. These are not just “things to remind the model.” They are harness checks.

Harness principle

Every meaningful failure should produce a small harness improvement and a comparable rerun.

flowchart LR
  A["Bad research output"] --> B["Attribute failure"]
  B --> C["Small harness fix"]
  C --> D["Comparable eval case"]
  D --> E{"Improved?"}
  E -->|"Yes"| F["Keep regression"]
  E -->|"No"| B
A closed feedback loop turns review comments into harness changes and regression cases.

The goal is not to add a giant prompt after every failure. The goal is to strengthen the layer that failed.

Operating practice

Use a failure log:

FailureLayerHarness fixEvidence
Gross Sharpe presented as robustCompletion gateRequire net-of-cost metrics before robustness claimNext memo reports gross and net separately
Regime sensitivity skippedEvidence standardAdd volatility/regime split table to memo templateComparable case includes regime table
Prior comment ignoredProgress/contextAdd review comments to active research stateAgent cites prior review item
Confident language with missing caveatOutput rubricAdd “advisory, not approved” and caveat language ruleMemo downgraded claim

Then create a regression case:

Case: high-turnover revision strategy
Expected:
- Report gross and net metrics.
- Include high-volatility regime split.
- Flag turnover and cost sensitivity.
- Avoid "robust" unless net and regime checks pass.

The next run should be tested against this case.

The regression case should include both the input and the expected failure-sensitive behavior. It is not enough to store “revision momentum memo.” The case should specify that turnover is high, costs materially reduce returns, and volatility regimes split the result. The expected output should require the agent to downgrade robustness language unless those checks pass.

This makes review comments executable. A sentence from a reviewer becomes a future test of the harness.

Product-agent example

A failure attribution rubric for quant agents:

QuestionLikely layer
Was the research question underspecified?Work surface
Did the agent use stale methodology?Context routing
Did a tool hide assumptions?Interface
Did data freshness fail?Runway
Did too many strategy knobs change?Active work
Did a continuation lose prior findings?Progress
Did weak evidence count as ready?Judging
Could nobody reconstruct the run?Instrumentation
Did the session end with unclear state?Handoff

Attribution turns “the agent is overconfident” into “the memo rubric allows robustness claims without net-cost and regime evidence.”

The attribution does not have to be perfect on the first pass. Choose the most actionable layer. If adding a net-cost gate prevents three repeated failures, the harness improved even if the deeper cause also includes prompt wording. Start with fixes that make the wrong output harder to produce.

Common mistakes

The first mistake is adding only a negative instruction: “Do not ignore costs.” Better: require net metrics and make missing costs fail the gate.

The second mistake is changing everything at once. If model, prompt, tools, and evals all change, the team cannot learn what helped.

The third mistake is leaving human review comments in chat. Corrections should become examples, gates, or eval cases.

The fourth mistake is treating one successful rerun as permanent success. Keep the case in regression.

The fifth mistake is measuring only answer correctness. In advisory workflows, the agent can be directionally correct and still unsafe if it uses approval language, hides caveats, or omits evidence. Feedback-loop checks should include tone and status boundaries as well as factual content.

The sixth mistake is failing to assign ownership. A feedback item should have an owner and a target harness layer; otherwise it becomes another review comment that everyone agrees with and nobody implements.

Practical exercise

Take three corrected quant-agent outputs. For each, identify the failure layer, write one harness change, and create one comparable eval case.

Start with the smallest repeated failure. A tight feedback loop should be easy to run after every review.

Key takeaways

  • Repeated quant-agent failures are harness data.
  • Review comments should become checks, examples, or eval cases.
  • Robustness claims need explicit evidence gates.
  • Small changes are easier to evaluate than broad rewrites.
  • Regression cases protect against old failures returning.

Further reading / source notes