Local LLM Requirements

Need for 64GB+ RAM to run local models, SSD-based inference as alternative, PCIe 5.0 throughput for slower but functional operation, democratization concerns

Running capable local LLMs remains a significant hardware challenge, as the bottleneck has shifted from raw compute power to massive memory bandwidth and capacity requirements that often exceed 64GB. While some enthusiasts leverage older, high-bandwidth GPUs or ultra-fast PCIe 5.0 SSDs as slower storage alternatives for non-chat tasks, others are banking on aggressive model optimization to "squeeze" state-of-the-art intelligence into consumer-grade hardware. Ultimately, the community is caught between the current reality of high hardware barriers that favor centralized providers and a hopeful future where a potential market glut of high-capacity RAM finally democratizes powerful local AI.

View on HN · Topics

> Both Vega (and Fiji before it) showed that excess memory BW alone is not sufficient to win.

That's correct if you're targeting gamers, but local AI inference changes this picture substantially.

View on HN · Topics

Modern GPUs like RTX 5080 are much faster for the applications that are limited by computational capabilities, mainly because they have more execution units, whose clock frequencies have also increased.

I suppose that most games are limited by computation, so they are indeed much faster on modern GPUs.

However, there are applications that are limited by memory throughput, not by computation, including AI inference and many scientific/technical computing applications.

For such applications, old GPUs with higher memory throughput are still faster.

This is why I am still using an old Radeon VII and a couple of other ancient AMD GPUs with high memory throughput.

Last year I have bought an Intel GPU, which is still slower than my old GPUs, but it at least had very good performance per dollar, competitive with that of the old GPUs, because it was very cheap, while the current AMD and especially NVIDIA GPUs have poor performance per dollar.

View on HN · Topics

And it just so happens that many people will now have to use OpenAI’s products because they can’t get enough RAM to run a local LLM. What a coincidence.

View on HN · Topics

> If AI makes software easier to create, that will drive the price down.

Supposedly AI drives down the cost of producing software,not the "price".

> How are software companies going to make enough revenue to pay for AI, when the amount of money being spent on AI is already multiples of the current total global expenditure on software?

Currently, the cost of AI is between $20/month and around $200/month per developer.

I think the huge billions you're seeing in the news are the investment cost on AI companies, who are burning through cash to invest in compute infrastructure to allow both training and serving users.

> This demand for RAM is built on a foundation of sand, there will be a glut of capacity when it all shakes out.

Who knows? What I know is that I need >64GB of RAM to run local models, and that means most people will need to upgrade from their 8Gb/16GB setup to do the same. Graphics cards follow mostly the same pattern.

View on HN · Topics

You need >64 GB of DRAM to run local models fast .

You can run huge local models slowly with the weights stored on SSDs.

Nowadays there are many computers that can have e.g. 2 PCIe 5.0 SSDs, which allow a reading throughput of 20 to 30 gigabyte per second, depending on the SSDs (or 1 PCIe 5.0 + 1 PCIe 4.0, for a throughput in the range 15-20 GB/s).

There are still a lot of improvements that can be done to inference back-ends like llama.cpp to reach the inference speed limit determined by the SSD throughput.

It seems that it is possible to reach inference speed in the range from a few seconds per token to a few tokens per second.

That may be too slow for a chat, but it should be good enough for an AI coding assistant, especially if many tasks are batched, so that they can progress simultaneously during a single read pass over the SSD data.

View on HN · Topics

You can do that, but you're going to have rather low throughput unless you have lots of PCIe lanes to attach storage to. That's going to require either a HEDT or some kind of compute cluster.

Batching inferences doesn't necessarily help that much since as models get sparser the individual inferences are going to share fewer experts. It does always help wrt. shared routing layers, of course.

View on HN · Topics

> Who knows? What I know is that I need >64GB of RAM to run local models, and that means most people will need to upgrade from their 8Gb/16GB setup to do the same. Graphics cards follow mostly the same pattern.

Depends how big the models are, how fast you want them to run and how much context you need for your usage. If you're okay with running only smaller models (which are still very capable in general, their main limitation is world knowledge) making very simple inferences at low overall throughput, you can just repurpose the RAM, CPUs/iGPUs and storage in the average setup.

View on HN · Topics

I got a 128 GB MBP, and the current models are fit enough to manage the calendar or do research on web (very slowly), not to be useful companions for coding as I hoped.

View on HN · Topics

The work going into local models seems to be targeting lower RAM/VRAM which will definately help.

For example Gemma 4 32B, which you can run on an off-the-shelf laptop, is around the same or even higher intelligence level as the SOTA models from 2 years ago (e.g. gpt-4o). Probably by the time memory prices come down we will have something as smart as Opus 4.7 that can be run locally.

Bigger models of course have more embedded knowledge, but just knowing that they should make a tool call to do a web search can bypass a lot of that.

View on HN · Topics

You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much

View on HN · Topics

Won't happen. People are ok with swapping to their SSDs, Macbook Neo confirms that

View on HN · Topics

The large models are incredibly inefficient. We'll be squeezing them down for generations.

View on HN · Topics

This could be great.

There's a future where RAM makers tool up for this massively increased demand, then the AI companies go broke as the bubble bursts, so RAM is cheap as. So laptop manufacturers get on that and start making laptops with 1TB+ memory so we can run decent LLMs on the local machine. Everyone happy :)

Summarizer