Summarizer

TurboQuant and Optimization

Google's KV cache compression technique, 6x memory reduction claims, skepticism about marketing numbers, discussion of alternative quantization schemes like SpectralQuant

← Back to The RAM shortage could last years

While Google’s TurboQuant has sparked excitement for its claimed 6x memory reduction in KV caches, industry insiders express significant skepticism regarding its "state of the art" status and marketing-driven benchmarks. Critics argue that these performance gains are often inflated by comparing the technology against inefficient baselines, noting that alternative quantization schemes like SpectralQuant and architectural shifts often offer superior results. Real-world testing further suggests that these optimizations may even lead to speed regressions rather than the promised boosts, highlighting a persistent gap between theoretical claims and practical deployment. Despite these advances, commenters suggest that the insatiable demand for tokens will likely outpace software efficiency, potentially shifting the focus toward hardware innovations like high-bandwidth flash storage.

8 comments tagged with this topic

View on HN · Topics
I'm a bit surprised the article makes no mention of Google's TurboQuant[0] introduced 26 days prior. Given that TurboQuant results in a 6x reduction in memory usage for KV caches and up to 8x boost in speed, this optimization is already showing up in llama.cpp, enabling significantly bigger contexts without having to run a smaller model to fit it all in memory. Some people thought it might significantly improve the RAM situation, though I remain a bit skeptical - the demand is probably still larger than the reduction turboquant brings. [0] https://news.ycombinator.com/item?id=47513475
View on HN · Topics
TurboQuant is known across the industry to not be state of the art. There are superior schemes for KV quant at every bitrate. Eg, SpectralQuant: https://github.com/Dynamis-Labs/spectralquant among many, many papers. > Given that TurboQuant results in a 6x reduction in memory usage for KV caches All depends on baseline. The "6x" is by stylistic comparison to a BF16 KV cache; not a state of the art 8 or 4 bit KV cache scheme.
View on HN · Topics
BTW, a number of corrections. The TurboQuant paper was submitted to Arxiv back in April 2025: https://arxiv.org/abs/2504.19874 Current "TurboQuant" implementations are about 3.8X-4.9X on compression (w/ the higher end taking some significant hits of GSM8K performance) and with about 80-100% baseline speed (no improvement, regression): https://github.com/vllm-project/vllm/pull/38479 For those not paying attention, it's probably worth sending this and ongoing discussion for vLLM https://github.com/vllm-project/vllm/issues/38171 and llama.cpp through your summarizer of choice - TurboQuant is fine, but not a magic bullet. Personally, I've been experimenting with DMS and I think it has a lot more promise and can be stacked with various quantization schemes. The biggest savings in kvcache though is in improved model architecture. Gemma 4's SWA/global hybrid saves up to 10X kvcache, MLA/DSA (the latter that helps solve global attention compute) does as well, and using linear, SSM layers saves even more. None of these reduce memory demand (Jevon's paradox, etc), though. Looking at my coding tools, I'm using about 10-15B cached tokens/mo currently (was 5-8B a couple months ago) and while I think I'm probably above average on the curve, I don't consider myself doing anything especially crazy and this year, between mainstream developers, and more and more agents, I don't think there's really any limit to the number of tokens that people will want to consume.
View on HN · Topics
> 6x reduction in memory usage for KV caches and up to 8x boost in speed mind that you're quoting marketing material that's largely based on unfair baseline testing (like comparing 4 bit vs 32 bit to get "8x speed") https://www.youtube.com/watch?v=haoAI2lIZ74
View on HN · Topics
That's not what consumes the most memory at scale. The KV caches are per-user.
View on HN · Topics
The large models are incredibly inefficient. We'll be squeezing them down for generations.
View on HN · Topics
How efficient is AI at reducing RAM consuption?
View on HN · Topics
I've read that the chip manufacturers are looking into high bandwidth flash for on package storage of ai models. That would solve some of the cost issue, flash is significantly cheaper than dram.