Prepare the Runway
Separate setup, data checks, and first verification from quant research work.
Failure pattern
The agent starts research before the data and tools are trustworthy, so every later result is ambiguous.
In quant work, this is especially costly. A factor screen can look precise while using stale prices, missing corporate actions, broken benchmark membership, or a backtest engine that silently changed defaults. Once analysis starts on a bad runway, no one knows whether the result is insight or setup noise.
Incident: factor screen before preflight
Agent task
A researcher asks:
Run a semiconductor factor screen for revision momentum and quality. Bring me the top long and short candidates before the morning meeting.
The agent moves immediately into screening.
Available surface
The workflow depends on:
| Surface | Required condition |
|---|---|
| Market data snapshot | Prices and estimates updated after prior close |
| Corporate actions | Splits, dividends, and restatements applied |
| Universe membership | Point-in-time semiconductor universe available |
| Benchmark data | Current sector benchmark and weights loaded |
| Backtest engine | Baseline strategy test passes |
| Risk model | Latest factor exposures available |
The harness has no mandatory preflight. The agent can call the screen tool directly.
Bad run
The agent returns:
Top long candidates:
- Name A: high revisions, strong quality
- Name B: improving margins
Top short candidates:
- Name C: negative revisions
- Name D: weak quality
Later, the analyst discovers three runway problems:
- Estimates data was stale by one trading day.
- A split adjustment was missing for one candidate.
- The benchmark constituent file failed to load, so relative rankings used a fallback sector list.
The screen is not reliable. The agent did not fail at ranking; it ranked from an unverified environment.
Why the harness failed
The harness let product work start before runway checks.
| Missing check | Consequence |
|---|---|
| Data freshness | Agent used stale estimates |
| Corporate-action status | Split-adjusted returns were wrong |
| Universe membership | Ranking used fallback constituents |
| Baseline backtest | Tool health was not proven |
| Known broken state | Fallback behavior was invisible |
The output looked like research, but it was setup debt.
Why it happens
Agents are task-directed. If asked to run a screen, they run a screen. They may not stop to ask whether the market-data snapshot is current unless the harness makes that a start condition.
Humans often know the daily ritual: check data loads, read pipeline alerts, confirm benchmark files, scan corporate-action warnings, and run a small baseline. A quant agent needs that ritual encoded. Otherwise, stale inputs produce confident artifacts.
Harness principle
Initialization is its own phase.
Before research execution, the harness should prove:
- Data is fresh enough.
- Universe and benchmark files are available.
- Corporate-action adjustments are applied.
- Backtest and screen tools pass a baseline run.
- Known broken state is recorded.
- Degraded modes are explicit.
flowchart LR
A["Start research run"] --> B["Check data freshness"]
B --> C["Check universe and benchmark"]
C --> D["Check corporate actions"]
D --> E["Run baseline tool test"]
E --> F{"Runway clear?"}
F -->|"Yes"| G["Run factor screen"]
F -->|"No"| H["Stop or use declared degraded mode"] A runway check is not bureaucracy. It protects the meaning of results.
Operating practice
Use a preflight record:
| Check | Pass condition | Result |
|---|---|---|
| Prices | Snapshot after prior close | Pass |
| Estimates | Vendor load timestamp after 06:00 | Fail |
| Corporate actions | No unresolved adjustments for universe | Pass |
| Universe | Point-in-time semiconductor file loaded | Pass |
| Benchmark | Sector benchmark weights loaded | Fail |
| Baseline screen | Known sample returns expected top names | Not run |
With this record, the agent should not produce final rankings. It should return:
Runway blocked:
- Estimates snapshot is stale.
- Benchmark weights failed to load.
- Factor screen not executed as reviewable output.
Possible degraded mode:
- Run exploratory absolute ranking only, clearly marked not committee-ready.
That is a successful harness response.
The preflight record should be saved with the run, even when all checks pass. Later, when a researcher questions a surprising signal, the team can see whether the run started from a clean environment. This matters because many quant errors are discovered after the fact. A clean preflight does not prove the thesis, but it narrows the search space when debugging.
For advisory workflows, the preflight can also control output level. If every check passes, the agent may create a review packet. If a non-critical benchmark check fails, it may create an exploratory note. If risk model or point-in-time data fails, it should stop before producing advisory language.
Product-agent example
A quant preflight contract should define hard stops and degraded modes:
| Condition | Behavior |
|---|---|
| Risk model missing | Stop advisory output |
| Estimates stale | Stop revision analysis |
| Benchmark missing | Allow absolute screen only, not relative ranking |
| Corporate actions unresolved | Exclude affected names or stop |
| Baseline tool test fails | Stop and report runway failure |
The contract prevents the agent from treating partial data as complete research.
This is also where the harness can encode desk-specific tolerance. Some teams may allow exploratory screens with stale benchmark weights if the memo is clearly marked exploratory. They should not allow the same output to enter committee review. The runway check should therefore produce both a technical status and an allowed output mode.
Common mistakes
The first mistake is checking only tool availability. A data tool can respond and still return stale data.
The second mistake is allowing silent fallback. Fallback universe or benchmark behavior must be visible.
The third mistake is mixing runway repair with research output. If the agent fixes data loads and produces a thesis in one run, evidence becomes muddy.
The fourth mistake is treating exploratory output as committee-ready. Degraded mode should be labeled.
Practical exercise
Write a preflight checklist for one quant-agent workflow. Include data freshness, methodology version, universe, benchmark, tool baseline, and risk model checks.
Then define what the agent should do when each check fails: stop, degrade, exclude, or escalate.
Key takeaways
- Quant research started on an unverified runway is already suspect.
- Freshness and methodology checks must precede analysis.
- Fallback behavior should never be silent.
- Degraded output must be labeled as degraded.
- A blocked runway is useful information, not a failed agent.
Further reading / source notes
- Anthropic, “Effective harnesses for long-running agents” for setup and baseline checks in long-running harnesses.
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for environment design and feedback loops around agents.
- NIST AI Risk Management Framework for risk-aware AI system operation and monitoring.