Back to Writing

Context pruning is a bet on the future

When an agent's window fills, the obvious move is to drop the oldest, biggest tool results. That's a cache-eviction bet you can't make optimally without seeing the future, and the right one depends entirely on your workload.

6 min read

When an agent’s context window fills up, the obvious move is to prune: drop the oldest, biggest tool results and keep going. It reads as good hygiene. It is usually a bet, and on the wrong workload a losing one.

The reframe that makes the trade-offs visible: the window is a cache, pruning is eviction, and eviction is one of the few problems in computing we can prove you cannot do optimally without seeing the future. Once you hold it that way, “just trim the old stuff” stops looking like obvious hygiene and starts looking like what it is - a wager about what you’ll need again. This is the management half of the problem that context engineering only sets up; composing the window is one job, keeping it useful as it grows is another.

The cost axis is flatter than you think

The instinct to prune is mostly about money, and money is the wrong axis. Anthropic’s prompt cache charges a cache read at roughly 10% of the input price. A warm window is already a tenth of full freight, so carrying it is cheap. Aggressively shrinking a warm window trades that cheap read for a full-price rewrite of everything after the cut. Pruning to save money can cost money. Inference is a real marginal cost on every turn, but the cache is what flattens it, and pruning fights the cache.

If cost is not the axis, two things are. The first is attention quality: a window stuffed with stale output buries the signal the model needs, and a model reasoning over its own noise gets worse in ways no invoice shows you. The second is the ceiling: every model has a hard token limit, and a long session walks toward it whether you like the economics or not. Those are the honest reasons to prune. Naming them changes what a good policy looks like, because a policy tuned for cost and a policy tuned for attention are not the same policy.

Every edit to the prefix has to earn its rewrite

The cache has one rule: the prefix you send must match the prefix already in the cache byte-for-byte, up to the point you mark. Appending a new message keeps the prefix intact and stays cheap. Editing history does not. Change a byte and everything after it is invalidated and re-billed at full price.

A prune is an edit. It deletes bytes in the middle of the cached region, which forces a rewrite of the entire suffix that follows. Sometimes that is worth paying. The point is that it is never free, so any pruning policy is spending real money each time it fires, and a policy that fires on a bad guess is spending it for nothing.

Why blind pruning thrashes

The standard policy is age plus size: evict any result older than N turns and bigger than M tokens. That rule is a bet that old and big implies never needed again. When the bet is wrong - a large result referenced again just outside the retention window - you get a loop. You evict it, the model asks for it, you re-inflate it, it ages back out, you evict it again. Evict, recall, re-inflate, evict. Each cycle pays a rewrite and leaves dead stubs behind, for negative value.

This is not a tuning problem you can dial away. It is Belady’s result from the page-replacement literature: the provably optimal eviction policy requires knowing the future sequence of accesses, which you do not have. Every real policy approximates the future from the past, and age-and-size is a crude proxy for the future. On the wrong workload that proxy is worse than doing nothing.

So a pruning policy has two honest forms. Drop only what is provably dead, or learn from what gets recalled. Everything in between is a guess.

The two safe moves, and the one to refuse

The first honest form is provably-dead hygiene. Some results are dead with certainty, not by age guess. A file read whose contents were overwritten by a later write in the same session. The same query run twice with identical parameters. Those bytes can never be correct or useful again, so dropping them cannot thrash - there is nothing to recall, because the future access does not exist by construction. This is the only kind of prune that is unconditionally safe, and it is the one to reach for first.

The second is offload, don’t delete. Instead of removing a large ageing result, replace it in the live window with a one-line stub that points to the verbatim original in your own storage. If the model needs it, it pulls the exact bytes back from your store, never by re-running the tool against the external source. And the rule that makes this correct rather than fragile: once a result has been recalled, exempt it from eviction. A recall is proof the result is in the working set, so the policy learns from its own mistakes instead of re-evicting the same bytes every few turns. That single exemption is the whole difference between an offload tier that helps and one that thrashes.

Fit for purpose: the workload decides

None of this has a universal answer, which is the actual point. The right policy is a function of your workload and your cost structure, and those vary widely; the policy has to follow them rather than a default.

A coding agent lives on read-edit-read churn. It reads a file, edits it, reads it again, and that first read is now genuinely dead. This workload manufactures provably-dead results by the dozen, so hygiene alone reclaims a lot of window at zero risk, and an offload tier earns its keep on top.

A data-analysis agent looks nothing like that. A schema it pulled an hour ago, a statistical result from turn three - those stay live, and they get referenced again when the agent writes its conclusion. The same age-and-size rule that is roughly safe on the coding agent misfires here, because on this workload “old” simply does not predict “dead.” The results age without dying, and a policy that confuses the two throws away the working set.

And when the problem is just that the session got long, pruning is often the wrong instrument entirely. Compaction - summarising the transcript into working memory and dropping the raw turns - preserves the hard-won conclusions better than any rule guessing which raw bytes to discard. It is lossy on purpose, and the skill is choosing what to lose; a summary that keeps the chatter and drops the conclusions is worse than no summary at all. Different problem, different tool.

Measure before you wire

The order that keeps you honest is instrument first, read the distributions, then decide which policy, if any, earns its complexity. How long do sessions actually get? How often does a pruned result get recalled? How much of your window is provably dead on a real trace rather than a hypothetical one? Those numbers tell you whether you need hygiene, offload, compaction, or nothing at all, and they routinely say “less than you assumed.”

The instinct is to reach for the cleverest policy; the discipline is to use the one your workload justifies. That’s usually the simplest, and sometimes it’s none. Pruning is a bet on data you don’t have yet, so make the smallest bet that works and check the numbers.