Prompt Cache Economics

Debate about whether context compression breaks prompt caching, with concerns that verbose but cached context might be cheaper than compressed context that invalidates cache

The debate over prompt cache economics pits the cost-effectiveness of "fat" context, which remains cheap when cached as a stable prefix, against the performance benefits of aggressive context compression. Critics warn that "snipping" conversation history to save tokens can be penny-wise and pound-foolish, as it often invalidates the cache and may inadvertently degrade the model's reasoning quality. To resolve this, many proponents advocate for a retrieval-based architecture that filters and summarizes data locally before it ever enters the conversation window. This approach maintains a stable, deterministic prompt cache while avoiding the "context bloat" that otherwise slows down generation and dilutes the model’s focus.

View on HN · Topics

The FTS5 index approach here is right, but I'd push further: pure BM25 underperforms on tool outputs because they're a mix of structured data (JSON, tables, config) and natural language (comments, error messages, docstrings). Keyword matching falls apart on the structured half.

I built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.

The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.

On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.

One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.

View on HN · Topics

Does your technique break the cache? edit: Thanks.

View on HN · Topics

Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.

View on HN · Topics

Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

View on HN · Topics

Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

View on HN · Topics

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

View on HN · Topics

So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.

View on HN · Topics

This seems misguided, you have to cache a prefix due to attention.

View on HN · Topics

The compression numbers look great but I keep wondering: does the model actually produce equivalent output with compressed context vs full context? Extending sessions from 30min to 3hrs only matters if reasoning quality holds up in hour 2.

esafak's cache economics point is underrated. With prompt caching, verbose context that gets reused is basically free. If compression breaks cache continuity you might save tokens while spending more money.

The deeper issue is that most MCP tools do SELECT * when they should return summaries with drill-down. That's a protocol design problem, not a compression problem.

View on HN · Topics

> With prompt caching, verbose context that gets reused is basically free.

But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.

View on HN · Topics

If this breaks the cache it is penny wise, pound foolish; cached full queries have more information and are cheap. The article does not mention caching; does anyone know?

I just enable fat MCP servers as needed, and try to use skills instead.

View on HN · Topics

It doesn't break the cache. The raw data never enters the conversation history, so there's nothing to invalidate. A short summary goes into context instead of the full payload, and the model can search the full data from a local FTS5 index if it needs specifics later. Cache stays intact because you're just appending smaller messages to the conversation.

Summarizer