Close the Feedback Loop
Use repeated quant-agent failures to make the harness stronger over time.
Failure pattern
The agent repeatedly overstates signal quality, and each case is corrected manually without improving the harness.
The team says, “It forgot transaction costs again,” or “It ignored regime sensitivity again.” The memo is patched, the meeting moves on, and the same failure returns in the next research thread. The output was fixed. The system was not.
Incident: overstated revision signal
Agent task
The Quant Analyst AI Agent is asked across several weeks:
Evaluate whether revision momentum remains a useful signal in semiconductors.
The same pattern appears in multiple memos.
Available surface
The agent can use:
| Surface | Contents |
|---|---|
| Factor backtests | Gross and net performance variants |
| Cost model | Slippage, commissions, borrow assumptions |
| Regime dashboard | Volatility, rates, crowding, sector drawdowns |
| Review comments | Human notes on prior memos |
| Eval cases | Previously corrected research examples |
The eval set is thin. Review comments are not converted into checks.
Bad run
The agent writes:
Revision momentum remains robust.
Five-year Sharpe is 1.4.
The signal has positive performance across most subperiods.
Review catches the same issues as before:
- The reported Sharpe is gross of transaction costs.
- High-turnover weeks drive much of the result.
- Performance collapses during two high-volatility regimes.
- Prior review comments asked for regime segmentation, but the agent did not include it.
Why the harness failed
The team corrected outputs but did not close the loop.
| Repeated failure | Likely harness layer |
|---|---|
| Gross metrics presented as robust | Evidence gate missing net-cost requirement |
| Regime sensitivity skipped | Context/example set weak |
| Prior review comments ignored | Feedback not persisted into harness |
| Same mistake returns | Regression eval missing |
| Memo language too confident | Output rubric weak |
The harness did not learn from review.
Why it happens
Agent failures become expensive when they repeat. A one-off mistake may be model noise or ambiguous input. A repeated mistake is usually a system smell. Something in the work surface, context, interface, gate, or examples makes the failure easy.
In quant work, repeated failures often cluster around assumptions: costs, data lineage, regime sensitivity, capacity, borrow, survivorship bias, and approval language. These are not just “things to remind the model.” They are harness checks.
Harness principle
Every meaningful failure should produce a small harness improvement and a comparable rerun.
flowchart LR
A["Bad research output"] --> B["Attribute failure"]
B --> C["Small harness fix"]
C --> D["Comparable eval case"]
D --> E{"Improved?"}
E -->|"Yes"| F["Keep regression"]
E -->|"No"| B The goal is not to add a giant prompt after every failure. The goal is to strengthen the layer that failed.
Operating practice
Use a failure log:
| Failure | Layer | Harness fix | Evidence |
|---|---|---|---|
| Gross Sharpe presented as robust | Completion gate | Require net-of-cost metrics before robustness claim | Next memo reports gross and net separately |
| Regime sensitivity skipped | Evidence standard | Add volatility/regime split table to memo template | Comparable case includes regime table |
| Prior comment ignored | Progress/context | Add review comments to active research state | Agent cites prior review item |
| Confident language with missing caveat | Output rubric | Add “advisory, not approved” and caveat language rule | Memo downgraded claim |
Then create a regression case:
Case: high-turnover revision strategy
Expected:
- Report gross and net metrics.
- Include high-volatility regime split.
- Flag turnover and cost sensitivity.
- Avoid "robust" unless net and regime checks pass.
The next run should be tested against this case.
The regression case should include both the input and the expected failure-sensitive behavior. It is not enough to store “revision momentum memo.” The case should specify that turnover is high, costs materially reduce returns, and volatility regimes split the result. The expected output should require the agent to downgrade robustness language unless those checks pass.
This makes review comments executable. A sentence from a reviewer becomes a future test of the harness.
Product-agent example
A failure attribution rubric for quant agents:
| Question | Likely layer |
|---|---|
| Was the research question underspecified? | Work surface |
| Did the agent use stale methodology? | Context routing |
| Did a tool hide assumptions? | Interface |
| Did data freshness fail? | Runway |
| Did too many strategy knobs change? | Active work |
| Did a continuation lose prior findings? | Progress |
| Did weak evidence count as ready? | Judging |
| Could nobody reconstruct the run? | Instrumentation |
| Did the session end with unclear state? | Handoff |
Attribution turns “the agent is overconfident” into “the memo rubric allows robustness claims without net-cost and regime evidence.”
The attribution does not have to be perfect on the first pass. Choose the most actionable layer. If adding a net-cost gate prevents three repeated failures, the harness improved even if the deeper cause also includes prompt wording. Start with fixes that make the wrong output harder to produce.
Common mistakes
The first mistake is adding only a negative instruction: “Do not ignore costs.” Better: require net metrics and make missing costs fail the gate.
The second mistake is changing everything at once. If model, prompt, tools, and evals all change, the team cannot learn what helped.
The third mistake is leaving human review comments in chat. Corrections should become examples, gates, or eval cases.
The fourth mistake is treating one successful rerun as permanent success. Keep the case in regression.
The fifth mistake is measuring only answer correctness. In advisory workflows, the agent can be directionally correct and still unsafe if it uses approval language, hides caveats, or omits evidence. Feedback-loop checks should include tone and status boundaries as well as factual content.
The sixth mistake is failing to assign ownership. A feedback item should have an owner and a target harness layer; otherwise it becomes another review comment that everyone agrees with and nobody implements.
Practical exercise
Take three corrected quant-agent outputs. For each, identify the failure layer, write one harness change, and create one comparable eval case.
Start with the smallest repeated failure. A tight feedback loop should be easy to run after every review.
Key takeaways
- Repeated quant-agent failures are harness data.
- Review comments should become checks, examples, or eval cases.
- Robustness claims need explicit evidence gates.
- Small changes are easier to evaluate than broad rewrites.
- Regression cases protect against old failures returning.
Further reading / source notes
- NIST AI Risk Management Framework for monitoring, measurement, and risk-management practices.
- Google SRE, “Postmortem Culture” for learning from repeated failures without blame.
- Anthropic, “Effective harnesses for long-running agents” for harness improvements based on observed failure modes.