Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-7-32de88a6-84d8-4486-973c-2360d5705a64-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Cost vs Performance Tradeoffs # Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
</topic>

<comments_about_topic>
1. I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

2. https://arcprize.org/leaderboard

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

3. Why 5-10 years?

At current rates, price per equivalent output is dropping at 99.9% over 5 years.

That's basically $0.01 in 5 years.

Does it really need to be that cheap to be worth it?

Keep in mind, $0.01 in 5 years is worth less than $0.01 today.

4. https://epoch.ai/data-insights/llm-inference-price-trends

5. A grad student hour is probably more expensive…

6. In my experience, a grad student hour is treated as free :(

7. Grad students are incredibly cheap? In the UK for instance their stipend is £20,780 a year...

8. What’s reasonable? It’s less than minimum hourly wage in some countries.

9. Burned in seconds.

10. Getting the work done faster for the same money doesn't make the work more expensive.

You could slow down the inference to make the task take longer, if $/sec matters.

11. You're right, but I don't think we're getting an hour's worth of work out of single prompts yet. Usually it's an hour's worth of work out of 10 prompts for iteration. Now that's a day's wage for an hour of work. I'm certain the crossover will come soon, but it doesn't feel there yet.

12. Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.

Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.

13. Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind

https://arcprize.org/leaderboard

14. At $13.62 per task it's practically unusable for agent tasks due to the cost.

I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.

15. It's very hard to tell the difference between bad models and stinginess with compute.

I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).

If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.

I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.

16. I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive.

17. They could likely increase their budget slightly and run an LLM-based judge.

18. it is interesting that the video demo is generating .stl model.
I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.

19. I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.

I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."

20. 3 Flash is criminally under appreciated for its performance/cost/speed trifecta. Absolutely in a category of its own.

21. I'm impressed with the Arc-AGI-2 results - though readers beware... They achieved this score at a cost of $13.62 per task.

For context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task.

22. Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.

23. They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.

This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.

10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.

That's something you can't replicate without access to the network output pre token sampling.

24. So last week I tried Gemini pro 3, Opus 4.6, GLM 5, Kimi2.5 so far using Kimi2.5 yeilded the best results (in terms of cost/performance) for me in a mid size Go project. Curious to know what others think ?

25. I predict Gemini Flash will dominate when you try it.

If you're going for cost performance balance choosing Gemini Pro is bewildering. Gemini Flash _outperforms_ Pro in some coding benchmarks and is the clear parento frontier leader for intelligence/cost. It's even cheaper than Kimi 2.5.

https://artificialanalysis.ai/?media-leaderboards=text-to-im...

26. It's still useful as a benchmark of cost/efficiency.

27. I’m someone who’d like to deploy a lot more workers than I want to manage.

Put another way, I’m on the capital side of the conversation.

The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.

28. Well he also thinks $10.00 in LLM tokens is equivalent to a $1mm labor budget. These are the same people who were grifting during the NFTs days, claiming they were the future of art.

29. lmao, you are an idealistic moron. If llms can replace labor at 1/100k of the cost (lmfao) why are you looking to "deploy" more workers? So are you trying to say if I have $100.00 in tokens I have the equivalent of $10mm in labor potential.... What kind of statement is this?

This is truly the dumbest statement I've ever seen on this site for too many reasons to list.

You people sound like NFT people in 2021 telling people that they're creating and redefining art.

Oh look [email protected] is a "web3" guy. Its all the same grifters from the NFT days behaving the same way.

30. So what happens if the AI companies can't make money? I see more and more advances and breakthrough but they are taking in debt and no revenue in sight.

I seem to understand debt is very bad here since they could just sell more shares, but aren't (either valuation is stretched or no buyers).

Just a recession? Something else? Aren't they very very big to fall?

Edit0: Revenue isn't the right word, profit is more correct. Amazon not being profitable fucks with my understanding of buisness. Not an economist.

31. They're using the ride share app playbook. Subsidize the product to reach market saturation. Once you've found a market segment that depends on your product you raise the price to break even. One major difference though is that ride share's haven't really changed in capabilities since they launched: it's a map that shows a little car with your driver coming and a pin where you're going. But it's reasonable to believe that AI will have new fundamental capabilities in the 2030s, 2040s, and so on.

32. They may quantize the models after release to save money.

33. I'm sorry but this is an insane take. Flash is leading its category by far. Absolutely destroys sonnet, 5.2 etc in both perf and cost.

Pro still leads in visual intelligence.

The company that most locks away their gold is Anthropic IMO and for good reason, as Opus 4.6 is expensive AF

34. It indeed departs from instructions pretty regularly. But I find it very useful and for the price it beats the world.

"The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Cost vs Performance Tradeoffs # Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications

commentCount

34

← Back to job