Observe 42 min

Instrument the Work

Make quant-agent behavior debuggable with traces, data lineage, run records, and decision artifacts.

Failure pattern

A bad advisory idea reaches review, but the team cannot reconstruct which data snapshot, factor version, filters, or backtest parameters produced it.

Without instrumentation, review turns into archaeology. The memo says a signal works. The reviewer asks: which universe, which date, which cost model, which factor definition, which risk snapshot? If the harness cannot answer, the artifact is not debuggable.

Incident: unreproducible trade idea

Agent task

The agent is asked:

Prepare a review packet for the semiconductor revision-momentum long/short idea.

It produces a polished memo and a candidate basket.

Available surface

The agent uses:

SurfaceExamples
Data snapshotPrices, estimates, fundamentals
Factor versionrevisions_v4, quality_v3
Universe filterUS semis, market cap, liquidity
Backtest configHorizon, rebalance, costs, benchmark
Risk modelExposure snapshot and constraint checks
Memo generatorCharts, thesis, caveats

The system stores the final memo but not the run path.

Bad run

Review finds that the memo includes a strong claim:

The strategy has remained resilient after costs and across regimes.

But the reviewer cannot reproduce it. The memo does not include the data snapshot ID, the factor versions, the excluded names, the transaction-cost model, or the exact backtest config.

Two reruns produce different results because the default universe file updated overnight.

Why the harness failed

The final artifact was saved without lineage.

Missing signalDebugging impact
Data snapshot IDCannot reproduce source data
Factor versionsCannot know which definitions produced scores
Universe filterCannot explain included/excluded names
Backtest configCannot reproduce performance metrics
Tool outputsCannot see warnings or missing values
Decision artifactCannot inspect rejected alternatives

The issue is not only a missing appendix. It is a missing trace.

Why it happens

Teams often log final research artifacts but not the process that created them. That is insufficient for agentic work because the agent can combine many data pulls, transformations, tool calls, and decisions in one run.

OpenTelemetry’s vocabulary is useful: traces connect operations, spans represent units of work, logs/events capture detail. A quant-agent run needs similar structure, even if implemented simply.

Harness principle

Instrument enough of the work to reconstruct the run.

flowchart TD
  A["Task contract"] --> B["Data snapshot"]
  B --> C["Factor and universe config"]
  C --> D["Backtest and risk tools"]
  D --> E["Decision artifact"]
  E --> F["Verification checks"]
  F --> G["Review packet"]
  B --> H["Run record"]
  D --> H
  E --> H
  F --> H
A quant run record connects task, data, tools, decisions, verification, and final memo.

The goal is not to capture every token. The goal is to make important claims traceable.

Operating practice

Create a run record:

FieldExample
Run IDQR-2026-05-17-SEMI-REV-01
ObjectivePrepare review packet for semiconductor revision-momentum idea
Work surfaceUS semis, 1-3 month horizon, advisory memo only
Data snapshotMKT-2026-05-16-EOD
Factor versionsrevisions_v4, quality_v3
Universe filterMarket cap > $5B, ADV > $50M, US-listed
Backtest configMonthly rebalance, 20 bps one-way cost, 2018-2026
Risk snapshotRISK-2026-05-16-v12
WarningsBorrow missing for 2 short candidates
VerificationNet-cost table passed; regime split missing
Final statusDraft with blockers, not review-ready

If a reviewer challenges the memo, the team can inspect the run record rather than rerunning from memory.

The run record should travel with the memo. A chart without its data snapshot is decoration. A backtest table without its config is not reproducible evidence. A risk paragraph without a risk snapshot cannot be audited. The harness should make it difficult to export or request review without attaching the record.

This is where observability supports governance. The goal is not to monitor the model for its own sake. The goal is to preserve enough lineage that a human reviewer can accept, reject, or reproduce the advisory artifact.

Harnessed trace excerpt

09:10 task_started QR-2026-05-17-SEMI-REV-01
09:12 context_retrieved factor_registry revisions_v4
09:14 data_loaded MKT-2026-05-16-EOD
09:18 backtest_run config BT-7781 status completed_with_warnings
09:21 risk_check RISK-2026-05-16-v12 status fail warning high_borrow_missing
09:25 decision_made status draft_with_blockers
09:26 memo_drafted memo SEMI-REV-DRAFT-03

This trace reveals why the output should not be committee-ready.

Product-agent example

A minimal event schema:

EventRequired fields
task_startedrun ID, objective, work surface
context_retrievedsource ID, version, freshness
data_loadedsnapshot ID, universe, timestamp
tool_calledtool name, config ID, safe input summary
tool_returnedstatus, metrics, warnings
decision_madeselected status, rejected options, evidence
verification_rancheck, pass/fail, evidence
handoff_writtenfinal state, next action, blockers

Keep sensitive details controlled. Identifiers and summaries are often enough.

For example, a record may store a dataset ID and row counts instead of raw holdings, or a safe summary of a tool input instead of full customer or portfolio details. The harness should balance auditability with confidentiality. In finance contexts, useful observability is structured and access-controlled.

Common mistakes

The first mistake is logging only the final memo. A memo is an output, not a trace.

The second mistake is saving screenshots of charts without config. Charts need data and parameter lineage.

The third mistake is omitting warnings because the run “completed.” Completed with warnings is different from passed.

The fourth mistake is recording decisions without rejected alternatives. Reviewers need to know what the agent considered and why it chose one path.

The fifth mistake is treating warnings as optional decoration. If a tool returns missing borrow data or stale estimates, that warning should be promoted into the decision artifact and handoff. Otherwise the trace records the warning but the workflow still ignores it.

The sixth mistake is making traces impossible for reviewers to read. A run record should be structured enough for machines, but concise enough that a PM, risk reviewer, or analyst can scan the important path quickly.

Practical exercise

Take one quant-agent memo and design the run record needed to reproduce it. Include data snapshot, factor versions, universe, backtest config, risk snapshot, warnings, decisions, and verification.

Then ask whether a skeptical reviewer could reproduce or reject the claim without rerunning the whole workflow.

Key takeaways

  • You cannot review what you cannot reconstruct.
  • Quant-agent observability requires data lineage and tool-call evidence.
  • Run records should capture warnings, not only success.
  • Structured traces beat giant transcripts.
  • Instrumentation supports review, debugging, and auditability.

Further reading / source notes