Limit Active Work
Prevent strategy drift by forcing the agent to finish one research behavior before starting the next.
Failure pattern
A request to “improve the strategy” causes the agent to change universe filters, factor weights, risk constraints, rebalance cadence, and memo framing at the same time.
The run looks productive because many things changed. The research is worse because no one can tell which change helped, which change hurt, and which interaction created the result.
In quant research, uncontrolled active work turns analysis into accidental curve fitting.
Incident: strategy improvement spiral
Agent task
A portfolio manager says:
Improve the semiconductor long/short screen. The latest backtest is weak after 2023.
The agent sees this as a broad optimization problem.
Available surface
The agent can change:
| Surface | Possible changes |
|---|---|
| Universe filter | Market cap, liquidity, region, subsector |
| Factor weights | Revision momentum, quality, valuation, short interest |
| Risk constraints | Beta, sector, country, factor exposure |
| Rebalance cadence | Weekly, monthly, event-driven |
| Cost model | Spread, borrow, slippage assumptions |
| Memo framing | Thesis, caveats, charts, recommendation language |
No state model defines what is active, queued, blocked, or verified.
Bad run
The agent changes five things:
- Removes small-cap names under $10B.
- Increases revision momentum weight from 30% to 50%.
- Adds quality floor.
- Changes rebalance from monthly to weekly.
- Rewrites memo to emphasize near-term estimate revisions.
The new backtest improves after 2023. But review finds that turnover doubled, borrow constraints worsened, and the improvement mostly came from excluding weak small-cap shorts. The team cannot attribute the performance change because too many variables moved.
Why the harness failed
The harness allowed many active behaviors.
| Missing control | Consequence |
|---|---|
| One active item | Agent changed filters, factors, cadence, and memo together |
| Verification per change | No evidence linked to each modification |
| Blocked state | Borrow-cost question was bypassed |
| Deferred list | Memo improvements mixed with strategy changes |
| Attribution rule | Performance improvement could not be explained |
The agent did not optimize a strategy. It opened too many fronts.
Why it happens
Agents are good at association. If the backtest is weak, universe, weights, costs, cadence, and narrative all feel related. A human researcher may explore several hypotheses privately, but a reviewable research process needs controlled changes.
Quant work especially needs isolation. A strategy result is only meaningful if the team knows what changed. Without active-work limits, the agent can accidentally produce a better-looking backtest that cannot survive review.
Harness principle
Limit active work to one research behavior at a time.
stateDiagram-v2 [*] --> Queued Queued --> Active: selected hypothesis Active --> Verified: evidence passes Active --> Blocked: missing data or decision Blocked --> Active: unblocked Verified --> [*]
A large research goal can still proceed through many steps. The rule is that each run has one active behavior and one evidence standard.
Operating practice
Turn “improve the strategy” into a queue:
| Item | State | Evidence |
|---|---|---|
| Test whether small-cap exclusion explains post-2023 weakness | active | Backtest with only universe floor changed |
| Test revision momentum weight from 30% to 50% | queued | Same universe, same cadence, same costs |
| Evaluate weekly rebalance | queued | Turnover and cost-adjusted return comparison |
| Add borrow-cost constraint | blocked | Borrow dataset coverage check |
| Rewrite memo | deferred | Only after strategy evidence is stable |
The active item has a tight contract:
| Field | Example |
|---|---|
| Behavior | Determine whether small-cap exclusion improves post-2023 performance |
| Allowed changes | Market-cap floor only |
| Excluded changes | Factor weights, rebalance cadence, memo language, risk constraints |
| Evidence | Before/after metrics, turnover, exposure changes, affected names |
Harnessed run
The agent returns:
Active hypothesis: small-cap exclusion.
Only change: market-cap floor from $5B to $10B.
Result: post-2023 Sharpe improves from 0.42 to 0.71.
Cost impact: turnover unchanged.
Exposure change: short book loses 6 high-borrow names.
Caveat: improvement may reflect borrow constraint, not size alone.
Next item: evaluate explicit borrow-cost constraint.
This output is slower than broad optimization. It is also useful because the team can attribute the result.
The harness should preserve the baseline for every active item. In quant research, “improved” only has meaning relative to a named baseline. The active item should identify the previous strategy version, dataset snapshot, factor versions, and cost model. If those change during the run, the run is no longer testing one behavior.
This does not prevent exploration. The agent can still discover that borrow cost, liquidity, and rebalance cadence deserve attention. It simply records those as queued or blocked hypotheses instead of folding them into the current result.
Product-agent example
A quant active-work board should track research hypotheses, not files:
| Bad item | Better item |
|---|---|
| Update strategy | Test small-cap exclusion |
| Improve factors | Compare revision weight change only |
| Fix risk | Evaluate beta-neutral constraint breach |
| Clean memo | Draft caveats after evidence passes |
Behavior-sized work is easier to verify and easier to reject.
A good active-work board also protects the memo. Narrative should follow evidence. If the agent changes the story while the research is still moving, it may hide unresolved uncertainty. Treat memo framing as its own work item unless wording is required to describe the active evidence.
Reviewers should be able to ask one simple question: “What changed in this run?” If the answer contains more than one research behavior, the harness should reject the run as too broad or split it into separate artifacts.
Common mistakes
The first mistake is optimizing multiple knobs in one run. That produces performance, not evidence.
The second mistake is treating memo edits as harmless. Narrative changes can hide uncertainty and should wait until evidence is stable.
The third mistake is allowing blocked data to become guessed assumptions. If borrow data is missing, the item is blocked.
The fourth mistake is marking a strategy better without attribution. Better compared to what exact baseline?
Practical exercise
Take one strategy-improvement request and split it into five hypotheses. For each, write the one allowed change and the evidence required.
Then pick exactly one active hypothesis. Anything else the agent discovers should become queued, blocked, or deferred.
Key takeaways
- Active-work limits protect research attribution.
- Strategy changes should move one behavior at a time.
- Better backtest results are not enough if the cause is unclear.
- Blocked data should not become guessed assumptions.
- Deferred findings are useful, but they are not current scope.
Further reading / source notes
- Anthropic, “Effective harnesses for long-running agents” for tracking active work and progress in agent tasks.
- OpenAI, “Harness engineering: leveraging Codex in an agent-first world” for specifying intent and constraining agent execution.