Hybrid Retrieval Methods

Discussion of combining BM25 with vector search using Model2Vec embeddings, sqlite-vec, and Reciprocal Rank Fusion for better handling of mixed structured and natural language data in tool outputs

To address the challenge of massive tool outputs overwhelming LLM context windows, developers are advocating for a "pre-compaction" strategy that replaces raw data dumps with searchable local indexes. By utilizing a hybrid retrieval stack—combining BM25’s precision on structured identifiers with vector search’s semantic understanding via Reciprocal Rank Fusion—tools like "Context Mode" can keep conversation histories lean and prompt caches intact. This architectural shift moves beyond simple truncation, instead serving the model a concise summary while maintaining a full, searchable database of the original data in a sandbox for on-demand queries. Ultimately, this approach drastically reduces token consumption and prevents context bloat, ensuring that critical technical details remain accessible without burying the agent in noise.

View on HN · Topics

The FTS5 index approach here is right, but I'd push further: pure BM25 underperforms on tool outputs because they're a mix of structured data (JSON, tables, config) and natural language (comments, error messages, docstrings). Keyword matching falls apart on the structured half.

I built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.

The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.

On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.

One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.

View on HN · Topics

Author here. I shared the GitHub repo a few days ago ( https://news.ycombinator.com/item?id=47148025 ) and got great feedback. This is the writeup explaining the architecture.

The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.

Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.

Source: https://github.com/mksglu/claude-context-mode
Happy to answer any architecture questions.

View on HN · Topics

Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.

View on HN · Topics

Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

View on HN · Topics

That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.

View on HN · Topics

That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.

View on HN · Topics

That's true, Claude Code does truncate large outputs now. But 25k tokens is still a lot, especially when you're running multiple tools back to back. Three or four Playwright snapshots or a batch of GitHub issues and you've burned 100k tokens on raw data you only needed a few lines from. Context-mode typically brings that down to 1-2k per call while keeping the full output searchable if you need it later.

View on HN · Topics

How is this different than RAG?

View on HN · Topics

Haven't looked at rtk closely but from the description it sounds like it works at the CLI output level, trimming stdout before it reaches the model. Context-mode goes a bit further since it also indexes the full output into a searchable FTS5 database, so the model can query specific parts later instead of just losing them. It's less about trimming and more about replacing a raw dump with a summary plus on-demand retrieval.

View on HN · Topics

Yeah it's basically pre-compaction, you're right. The key difference is nothing gets thrown away. The full output sits in a searchable FTS5 index, so if the model realizes it needs some detail it missed in the summary, it can search for it. It's less "decide what's relevant upfront" and more "give me the summary now, let me come back for specifics later."

View on HN · Topics

Nice approach. Same core idea as context-mode but specialized for your build domain. You're using SQLite as a structured knowledge cache over YAML rule files with keyword lookup. Context-mode does something similar but domain-agnostic, using FTS5 with BM25 ranking so any tool output becomes searchable without needing predefined schemas. Cool to see the pattern emerge independently from a completely different use case.

Summarizer