Instrument the Work
Make quant-agent behavior debuggable with traces, data lineage, run records, and decision artifacts.
Failure pattern
A bad advisory idea reaches review, but the team cannot reconstruct which data snapshot, factor version, filters, or backtest parameters produced it.
Without instrumentation, review turns into archaeology. The memo says a signal works. The reviewer asks: which universe, which date, which cost model, which factor definition, which risk snapshot? If the harness cannot answer, the artifact is not debuggable.
Incident: unreproducible trade idea
Agent task
The agent is asked:
Prepare a review packet for the semiconductor revision-momentum long/short idea.
It produces a polished memo and a candidate basket.
Available surface
The agent uses:
| Surface | Examples |
|---|---|
| Data snapshot | Prices, estimates, fundamentals |
| Factor version | revisions_v4, quality_v3 |
| Universe filter | US semis, market cap, liquidity |
| Backtest config | Horizon, rebalance, costs, benchmark |
| Risk model | Exposure snapshot and constraint checks |
| Memo generator | Charts, thesis, caveats |
The system stores the final memo but not the run path.
Bad run
Review finds that the memo includes a strong claim:
The strategy has remained resilient after costs and across regimes.
But the reviewer cannot reproduce it. The memo does not include the data snapshot ID, the factor versions, the excluded names, the transaction-cost model, or the exact backtest config.
Two reruns produce different results because the default universe file updated overnight.
Why the harness failed
The final artifact was saved without lineage.
| Missing signal | Debugging impact |
|---|---|
| Data snapshot ID | Cannot reproduce source data |
| Factor versions | Cannot know which definitions produced scores |
| Universe filter | Cannot explain included/excluded names |
| Backtest config | Cannot reproduce performance metrics |
| Tool outputs | Cannot see warnings or missing values |
| Decision artifact | Cannot inspect rejected alternatives |
The issue is not only a missing appendix. It is a missing trace.
Why it happens
Teams often log final research artifacts but not the process that created them. That is insufficient for agentic work because the agent can combine many data pulls, transformations, tool calls, and decisions in one run.
OpenTelemetry’s vocabulary is useful: traces connect operations, spans represent units of work, logs/events capture detail. A quant-agent run needs similar structure, even if implemented simply.
Harness principle
Instrument enough of the work to reconstruct the run.
flowchart TD A["Task contract"] --> B["Data snapshot"] B --> C["Factor and universe config"] C --> D["Backtest and risk tools"] D --> E["Decision artifact"] E --> F["Verification checks"] F --> G["Review packet"] B --> H["Run record"] D --> H E --> H F --> H
The goal is not to capture every token. The goal is to make important claims traceable.
Operating practice
Create a run record:
| Field | Example |
|---|---|
| Run ID | QR-2026-05-17-SEMI-REV-01 |
| Objective | Prepare review packet for semiconductor revision-momentum idea |
| Work surface | US semis, 1-3 month horizon, advisory memo only |
| Data snapshot | MKT-2026-05-16-EOD |
| Factor versions | revisions_v4, quality_v3 |
| Universe filter | Market cap > $5B, ADV > $50M, US-listed |
| Backtest config | Monthly rebalance, 20 bps one-way cost, 2018-2026 |
| Risk snapshot | RISK-2026-05-16-v12 |
| Warnings | Borrow missing for 2 short candidates |
| Verification | Net-cost table passed; regime split missing |
| Final status | Draft with blockers, not review-ready |
If a reviewer challenges the memo, the team can inspect the run record rather than rerunning from memory.
The run record should travel with the memo. A chart without its data snapshot is decoration. A backtest table without its config is not reproducible evidence. A risk paragraph without a risk snapshot cannot be audited. The harness should make it difficult to export or request review without attaching the record.
This is where observability supports governance. The goal is not to monitor the model for its own sake. The goal is to preserve enough lineage that a human reviewer can accept, reject, or reproduce the advisory artifact.
Harnessed trace excerpt
09:10 task_started QR-2026-05-17-SEMI-REV-01
09:12 context_retrieved factor_registry revisions_v4
09:14 data_loaded MKT-2026-05-16-EOD
09:18 backtest_run config BT-7781 status completed_with_warnings
09:21 risk_check RISK-2026-05-16-v12 status fail warning high_borrow_missing
09:25 decision_made status draft_with_blockers
09:26 memo_drafted memo SEMI-REV-DRAFT-03
This trace reveals why the output should not be committee-ready.
Product-agent example
A minimal event schema:
| Event | Required fields |
|---|---|
task_started | run ID, objective, work surface |
context_retrieved | source ID, version, freshness |
data_loaded | snapshot ID, universe, timestamp |
tool_called | tool name, config ID, safe input summary |
tool_returned | status, metrics, warnings |
decision_made | selected status, rejected options, evidence |
verification_ran | check, pass/fail, evidence |
handoff_written | final state, next action, blockers |
Keep sensitive details controlled. Identifiers and summaries are often enough.
For example, a record may store a dataset ID and row counts instead of raw holdings, or a safe summary of a tool input instead of full customer or portfolio details. The harness should balance auditability with confidentiality. In finance contexts, useful observability is structured and access-controlled.
Common mistakes
The first mistake is logging only the final memo. A memo is an output, not a trace.
The second mistake is saving screenshots of charts without config. Charts need data and parameter lineage.
The third mistake is omitting warnings because the run “completed.” Completed with warnings is different from passed.
The fourth mistake is recording decisions without rejected alternatives. Reviewers need to know what the agent considered and why it chose one path.
The fifth mistake is treating warnings as optional decoration. If a tool returns missing borrow data or stale estimates, that warning should be promoted into the decision artifact and handoff. Otherwise the trace records the warning but the workflow still ignores it.
The sixth mistake is making traces impossible for reviewers to read. A run record should be structured enough for machines, but concise enough that a PM, risk reviewer, or analyst can scan the important path quickly.
Practical exercise
Take one quant-agent memo and design the run record needed to reproduce it. Include data snapshot, factor versions, universe, backtest config, risk snapshot, warnings, decisions, and verification.
Then ask whether a skeptical reviewer could reproduce or reject the claim without rerunning the whole workflow.
Key takeaways
- You cannot review what you cannot reconstruct.
- Quant-agent observability requires data lineage and tool-call evidence.
- Run records should capture warnings, not only success.
- Structured traces beat giant transcripts.
- Instrumentation supports review, debugging, and auditability.
Further reading / source notes
- OpenTelemetry Signals for vocabulary around traces, logs, metrics, and events.
- Honeycomb, “Observability Engineering” for debugging complex systems through rich events.
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for harness feedback loops and observability.