Summarizer

LLM Input

llm/9b2efe03-4d9e-4db2-a79a-13cee83b17d6/topic-2-6e27f20c-d71e-4308-aca4-29aa0b4c32b6-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Prompt Cache Economics # Debate about whether context compression breaks prompt caching, with concerns that verbose but cached context might be cheaper than compressed context that invalidates cache
</topic>

<comments_about_topic>
1. The FTS5 index approach here is right, but I'd push further: pure BM25 underperforms on tool outputs because they're a mix of structured data (JSON, tables, config) and natural language (comments, error messages, docstrings). Keyword matching falls apart on the structured half.

I built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.

The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.

On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.

One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.

2. Does your technique break the cache? edit: Thanks.

3. Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.

4. Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

5. Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

6. Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

7. So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.

8. This seems misguided, you have to cache a prefix due to attention.

9. The compression numbers look great but I keep wondering: does the model actually produce equivalent output with compressed context vs full context? Extending sessions from 30min to 3hrs only matters if reasoning quality holds up in hour 2.

esafak's cache economics point is underrated. With prompt caching, verbose context that gets reused is basically free. If compression breaks cache continuity you might save tokens while spending more money.

The deeper issue is that most MCP tools do SELECT * when they should return summaries with drill-down. That's a protocol design problem, not a compression problem.

10. > With prompt caching, verbose context that gets reused is basically free.

But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.

11. If this breaks the cache it is penny wise, pound foolish; cached full queries have more information and are cheap. The article does not mention caching; does anyone know?

I just enable fat MCP servers as needed, and try to use skills instead.

12. It doesn't break the cache. The raw data never enters the conversation history, so there's nothing to invalidate. A short summary goes into context instead of the full payload, and the model can search the full data from a local FTS5 index if it needs specifics later. Cache stays intact because you're just appending smaller messages to the conversation.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Prompt Cache Economics # Debate about whether context compression breaks prompt caching, with concerns that verbose but cached context might be cheaper than compressed context that invalidates cache

commentCount

12

← Back to job