Instrument the Work

Failure pattern

A bad advisory idea reaches review, but the team cannot reconstruct which data snapshot, factor version, filters, or backtest parameters produced it.

Without instrumentation, review turns into archaeology. The memo says a signal works. The reviewer asks: which universe, which date, which cost model, which factor definition, which risk snapshot? If the harness cannot answer, the artifact is not debuggable.

Incident: unreproducible trade idea

Agent task

The agent is asked:

Prepare a review packet for the semiconductor revision-momentum long/short idea.

It produces a polished memo and a candidate basket.

Available surface

The agent uses:

Surface	Examples
Data snapshot	Prices, estimates, fundamentals
Factor version	`revisions_v4`, `quality_v3`
Universe filter	US semis, market cap, liquidity
Backtest config	Horizon, rebalance, costs, benchmark
Risk model	Exposure snapshot and constraint checks
Memo generator	Charts, thesis, caveats

The system stores the final memo but not the run path.

Bad run

Review finds that the memo includes a strong claim:

The strategy has remained resilient after costs and across regimes.

But the reviewer cannot reproduce it. The memo does not include the data snapshot ID, the factor versions, the excluded names, the transaction-cost model, or the exact backtest config.

Two reruns produce different results because the default universe file updated overnight.

Why the harness failed

The final artifact was saved without lineage.

Missing signal	Debugging impact
Data snapshot ID	Cannot reproduce source data
Factor versions	Cannot know which definitions produced scores
Universe filter	Cannot explain included/excluded names
Backtest config	Cannot reproduce performance metrics
Tool outputs	Cannot see warnings or missing values
Decision artifact	Cannot inspect rejected alternatives

The issue is not only a missing appendix. It is a missing trace.

Why it happens

Teams often log final research artifacts but not the process that created them. That is insufficient for agentic work because the agent can combine many data pulls, transformations, tool calls, and decisions in one run.

OpenTelemetry’s vocabulary is useful: traces connect operations, spans represent units of work, logs/events capture detail. A quant-agent run needs similar structure, even if implemented simply.

Harness principle

Instrument enough of the work to reconstruct the run.

flowchart TD
  A["Task contract"] --> B["Data snapshot"]
  B --> C["Factor and universe config"]
  C --> D["Backtest and risk tools"]
  D --> E["Decision artifact"]
  E --> F["Verification checks"]
  F --> G["Review packet"]
  B --> H["Run record"]
  D --> H
  E --> H
  F --> H

A quant run record connects task, data, tools, decisions, verification, and final memo.

The goal is not to capture every token. The goal is to make important claims traceable.

Operating practice

Create a run record:

Field	Example
Run ID	`QR-2026-05-17-SEMI-REV-01`
Objective	Prepare review packet for semiconductor revision-momentum idea
Work surface	US semis, 1-3 month horizon, advisory memo only
Data snapshot	`MKT-2026-05-16-EOD`
Factor versions	`revisions_v4`, `quality_v3`
Universe filter	Market cap > `$5B`, ADV > `$50M`, US-listed
Backtest config	Monthly rebalance, 20 bps one-way cost, 2018-2026
Risk snapshot	`RISK-2026-05-16-v12`
Warnings	Borrow missing for 2 short candidates
Verification	Net-cost table passed; regime split missing
Final status	Draft with blockers, not review-ready

If a reviewer challenges the memo, the team can inspect the run record rather than rerunning from memory.

The run record should travel with the memo. A chart without its data snapshot is decoration. A backtest table without its config is not reproducible evidence. A risk paragraph without a risk snapshot cannot be audited. The harness should make it difficult to export or request review without attaching the record.

This is where observability supports governance. The goal is not to monitor the model for its own sake. The goal is to preserve enough lineage that a human reviewer can accept, reject, or reproduce the advisory artifact.

Harnessed trace excerpt

09:10 task_started QR-2026-05-17-SEMI-REV-01
09:12 context_retrieved factor_registry revisions_v4
09:14 data_loaded MKT-2026-05-16-EOD
09:18 backtest_run config BT-7781 status completed_with_warnings
09:21 risk_check RISK-2026-05-16-v12 status fail warning high_borrow_missing
09:25 decision_made status draft_with_blockers
09:26 memo_drafted memo SEMI-REV-DRAFT-03

This trace reveals why the output should not be committee-ready.

Product-agent example

A minimal event schema:

Event	Required fields
`task_started`	run ID, objective, work surface
`context_retrieved`	source ID, version, freshness
`data_loaded`	snapshot ID, universe, timestamp
`tool_called`	tool name, config ID, safe input summary
`tool_returned`	status, metrics, warnings
`decision_made`	selected status, rejected options, evidence
`verification_ran`	check, pass/fail, evidence
`handoff_written`	final state, next action, blockers

Keep sensitive details controlled. Identifiers and summaries are often enough.

For example, a record may store a dataset ID and row counts instead of raw holdings, or a safe summary of a tool input instead of full customer or portfolio details. The harness should balance auditability with confidentiality. In finance contexts, useful observability is structured and access-controlled.

Common mistakes

The first mistake is logging only the final memo. A memo is an output, not a trace.

The second mistake is saving screenshots of charts without config. Charts need data and parameter lineage.

The third mistake is omitting warnings because the run “completed.” Completed with warnings is different from passed.

The fourth mistake is recording decisions without rejected alternatives. Reviewers need to know what the agent considered and why it chose one path.

The fifth mistake is treating warnings as optional decoration. If a tool returns missing borrow data or stale estimates, that warning should be promoted into the decision artifact and handoff. Otherwise the trace records the warning but the workflow still ignores it.

The sixth mistake is making traces impossible for reviewers to read. A run record should be structured enough for machines, but concise enough that a PM, risk reviewer, or analyst can scan the important path quickly.

Practical exercise

Take one quant-agent memo and design the run record needed to reproduce it. Include data snapshot, factor versions, universe, backtest config, risk snapshot, warnings, decisions, and verification.

Then ask whether a skeptical reviewer could reproduce or reject the claim without rerunning the whole workflow.

Key takeaways

You cannot review what you cannot reconstruct.
Quant-agent observability requires data lineage and tool-call evidence.
Run records should capture warnings, not only success.
Structured traces beat giant transcripts.
Instrumentation supports review, debugging, and auditability.