Cost vs Performance Tradeoffs

Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications

The conversation centers on the tension between high-cost reasoning benchmarks, such as the $13.62 price tag for ARC-AGI tasks, and the rapid commoditization of intelligence represented by Gemini Flash’s dominant performance-to-cost ratio. While some critics dismiss expensive inference as "practically unusable" for autonomous agents, others argue that AI is already delivering massive labor arbitrage in high-volume tasks like historical archiving and 3D modeling. This economic shift has sparked a polarizing debate over whether current prices are sustainable or if the industry is merely following a "rideshare playbook" of subsidizing costs to achieve market saturation. Ultimately, many users are finding that the "Pareto frontier" of AI utility lies not in raw intelligence, but in the efficiency of models that provide the most "vibe-coding" and tool-use potential for the lowest marginal cost.

View on HN · Topics

I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

View on HN · Topics

https://arcprize.org/leaderboard

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

View on HN · Topics

Why 5-10 years?

At current rates, price per equivalent output is dropping at 99.9% over 5 years.

That's basically $0.01 in 5 years.

Does it really need to be that cheap to be worth it?

Keep in mind, $0.01 in 5 years is worth less than $0.01 today.

View on HN · Topics

https://epoch.ai/data-insights/llm-inference-price-trends

View on HN · Topics

A grad student hour is probably more expensive…

View on HN · Topics

In my experience, a grad student hour is treated as free :(

View on HN · Topics

Grad students are incredibly cheap? In the UK for instance their stipend is £20,780 a year...

View on HN · Topics

What’s reasonable? It’s less than minimum hourly wage in some countries.

View on HN · Topics

Burned in seconds.

View on HN · Topics

Getting the work done faster for the same money doesn't make the work more expensive.

You could slow down the inference to make the task take longer, if $/sec matters.

View on HN · Topics

You're right, but I don't think we're getting an hour's worth of work out of single prompts yet. Usually it's an hour's worth of work out of 10 prompts for iteration. Now that's a day's wage for an hour of work. I'm certain the crossover will come soon, but it doesn't feel there yet.

View on HN · Topics

Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.

Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.

View on HN · Topics

Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind

https://arcprize.org/leaderboard

View on HN · Topics

At $13.62 per task it's practically unusable for agent tasks due to the cost.

I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.

View on HN · Topics

It's very hard to tell the difference between bad models and stinginess with compute.

I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).

If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.

I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.

View on HN · Topics

I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive.

View on HN · Topics

They could likely increase their budget slightly and run an LLM-based judge.

View on HN · Topics

it is interesting that the video demo is generating .stl model.
I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.

View on HN · Topics

I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.

I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."

View on HN · Topics

3 Flash is criminally under appreciated for its performance/cost/speed trifecta. Absolutely in a category of its own.

View on HN · Topics

I'm impressed with the Arc-AGI-2 results - though readers beware... They achieved this score at a cost of $13.62 per task.

For context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task.

View on HN · Topics

Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.

View on HN · Topics

They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.

This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.

10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.

That's something you can't replicate without access to the network output pre token sampling.

View on HN · Topics

So last week I tried Gemini pro 3, Opus 4.6, GLM 5, Kimi2.5 so far using Kimi2.5 yeilded the best results (in terms of cost/performance) for me in a mid size Go project. Curious to know what others think ?

View on HN · Topics

I predict Gemini Flash will dominate when you try it.

If you're going for cost performance balance choosing Gemini Pro is bewildering. Gemini Flash _outperforms_ Pro in some coding benchmarks and is the clear parento frontier leader for intelligence/cost. It's even cheaper than Kimi 2.5.

https://artificialanalysis.ai/?media-leaderboards=text-to-im...

View on HN · Topics

It's still useful as a benchmark of cost/efficiency.

View on HN · Topics

I’m someone who’d like to deploy a lot more workers than I want to manage.

Put another way, I’m on the capital side of the conversation.

The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.

View on HN · Topics

Well he also thinks $10.00 in LLM tokens is equivalent to a $1mm labor budget. These are the same people who were grifting during the NFTs days, claiming they were the future of art.

View on HN · Topics

lmao, you are an idealistic moron. If llms can replace labor at 1/100k of the cost (lmfao) why are you looking to "deploy" more workers? So are you trying to say if I have $100.00 in tokens I have the equivalent of $10mm in labor potential.... What kind of statement is this?

This is truly the dumbest statement I've ever seen on this site for too many reasons to list.

You people sound like NFT people in 2021 telling people that they're creating and redefining art.

Oh look [email protected] is a "web3" guy. Its all the same grifters from the NFT days behaving the same way.

View on HN · Topics

So what happens if the AI companies can't make money? I see more and more advances and breakthrough but they are taking in debt and no revenue in sight.

I seem to understand debt is very bad here since they could just sell more shares, but aren't (either valuation is stretched or no buyers).

Just a recession? Something else? Aren't they very very big to fall?

Edit0: Revenue isn't the right word, profit is more correct. Amazon not being profitable fucks with my understanding of buisness. Not an economist.

View on HN · Topics

They're using the ride share app playbook. Subsidize the product to reach market saturation. Once you've found a market segment that depends on your product you raise the price to break even. One major difference though is that ride share's haven't really changed in capabilities since they launched: it's a map that shows a little car with your driver coming and a pin where you're going. But it's reasonable to believe that AI will have new fundamental capabilities in the 2030s, 2040s, and so on.

View on HN · Topics

They may quantize the models after release to save money.

View on HN · Topics

I'm sorry but this is an insane take. Flash is leading its category by far. Absolutely destroys sonnet, 5.2 etc in both perf and cost.

Pro still leads in visual intelligence.

The company that most locks away their gold is Anthropic IMO and for good reason, as Opus 4.6 is expensive AF

View on HN · Topics

It indeed departs from instructions pretty regularly. But I find it very useful and for the price it beats the world.

"The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.

Summarizer