Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-13-5ae136df-6a58-4786-ac71-ebd79587af56-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Real World Task Performance # Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
</topic>

<comments_about_topic>
1. Codex ranks higher for instruction following

2. >because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

"Answer "I don't know" if you don't know an answer to one of the questions"

3. I've been surprised how difficult it is for LLMs to simply answer "I don't know."

It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.

4. > I've been surprised how difficult it is for LLMs to simply answer "I don't know."

It's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is "I don't know" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of "I don't know" in the training data it also won't show up in results, so what should you do?

If you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with "idk". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations.

5. The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though.

6. This seems true for info not in the question - eg. "Calculate the volume of a cylinder with height 10 meters".

However it is less true with info missing from the training data - ie. "I have a Diode marked UM16, what is the maximum current at 125C?"

7. This seems fine...?

https://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...

I don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly ("If the marking is UM16 on an SMA/DO-214AC package...") and reads the graph in Fig. 1 correctly.

Of course, it took 18 minutes of crunching to get the answer, which seems a tad excessive.

8. Indeed that answer is awesome. Much better than Gemini 2.5 pro which invented a 16 kilovolt diode which it just hoped would be marked "UM16".

9. > The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.

I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?

10. There's no shortage of laundry-folding robot demos these days. Some claim to benefit from only minimal monkey-see/monkey-do levels of training, but I don't know how credible those claims are.

11. > Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.

I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving . I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.

And obviously what actually matters is performance on real-world tasks.

12. > Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

13. I suspect the non-spikey part is the more interesting comparison

Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.

There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai

14. What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.

None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.

15. It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.

Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.

16. > What evidence of intelligence would satisfy you?

That is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.

The reality is that we can argue about that until we're blue in the face, and get nowhere.

In this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich.

17. You're right, but I don't think we're getting an hour's worth of work out of single prompts yet. Usually it's an hour's worth of work out of 10 prompts for iteration. Now that's a day's wage for an hour of work. I'm certain the crossover will come soon, but it doesn't feel there yet.

18. It's garbage really, cannot get how they get so high in benchmarks.

19. You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all.

20. I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.

There are more novel tasks in a day than ARC provides.

21. I must be holding these things wrong because I'm not seeing any of these God like superpowers everyone seem to enjoy.

22. Who said they’re godlike today?

And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.

23. Let's come back in 12 months and discuss your singularity then. Meanwhile I spent like $30 on a few models as a test yesterday, none of them could tell me why my goroutine system was failing, even though it was painfully obvious (I purposefully added one too many wg.Done), gemini, codex, minimax 2.5, they all shat the bed on a very obvious problem but I am to believe they're 98% conscious and better at logic and math than 99% of the population.

Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks

24. On the flip side, twice I put about 800K tokens of code into Gemini and asked it to find why my code was misbehaving, and it found it.

The logic related to the bug wasn't all contained in one file, but across several files.

This was Gemini 2.5 Pro. A whole generation old.

25. Mind sharing the file?

Also, did you use Codex 5.3 Xhigh through the Codex CLI or Codex App?

26. Post the file here

27. Meanwhile I've been using Kimi K2T and K2.5 to work in Go with a fair amount of concurrency and it's been able to write concurrent Go code and debug issues with goroutines equal to, and much more complex then, your issue, involving race conditions and more, just fine.

Projects:

https://github.com/alexispurslane/oxen

https://github.com/alexispurslane/org-lsp

(Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)

shrug

28. Out of curiosity, did you give a test for them to validate the code?

I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).

29. There was a test and a very useful golang error that literally explain what was wrong. The model tried implementing a solution, failed and when I pointed out the error most of them just rolled back the "solution"

30. What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?

31. minimax 2.5, kimi k2.5, codex 5.2, gemini 3 flash and pro, glm 4.7, devstral2 123b, etc.

32. > I purposefully added one too many wg.Done

What do you believe this shows? Sometimes I have difficulty finding bugs in other people's code when they do things in ways I would never use. I can rewrite their code so it works, but I can't necessarily quickly identify the specific bug.

Expecting a model to be perfect on every problem isn't reasonable. No known entity is able to do that. AIs aren't supposed to be gods.

(Well not yet anyway - there is as yet insufficient data for a meaningful answer.)

33. It's hard to evaluate "logic" and "math", since they're made up of many largely disparate things. But I think modern AI models are clearly better at coding, for example, than 99% of the population. If you asked 100 people at your local grocery store why your goroutine system was failing, do you think multiple of them would know the answer?

34. Ok, here I am living in the real world finding these models have advanced incredibly over the past year for coding.

Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.

35. I use agentic tools daily and SOTA models have certainly improved a lot in the last year. But still in a linear, "they don't light my repo on fire as often when they get a confusing compiler error" kind of way, not a "I would now trust Opus 4.6 to respond to every work email and hands-off manage my banking and investment portfolio" kind of way.

They're still afflicted by the same fundamental problems that hold LLMs back from being a truly autonomous "drop-in human replacement" that would enable an entire new world of use cases.

And finally live up to the hype/dreams many of us couldn't help but feeling was right around in the corner circa 2022/3 when things really started taking off.

36. It's so capable at some things, and others are garbage.
I uploaded a photo of some words for a spelling bee and asked it to quiz my kid on the words. The first word it asked, wasn't on the list. After multiple attempts to get it to start asking only the words in the uploaded pic, it did, and then would get the spellings wrong in the Q&A. I gave up.

37. I had it process a photo of my D&D character sheet and help me debug it as I'm a n00b at the game. Also did a decent, although not perfect, job of adding up a handwritten bowling score sheet.

38. 100x agree. It gives inconsistent edits, would regularly try to perform things I explicitly command to not.

39. These benchmarks are super impressive. That said, Gemini 3 Pro benchmarked well on coding tasks, and yet I found it abysmal. A distant third behind Codex and Claude.

Tool calling failures, hallucinations, bad code output. It felt like using a coding model from a year ago.

Even just as a general use model, somehow ChatGPT has a smoother integration with web search (than google!!), knowing when to use it, and not needing me to prompt it directly multiple times to search.

Not sure what happened there. They have all the ingredients in theory but they've really fallen behind on actual usability.

Their image models are kicking ass though.

40. You nailed it. Gemini 3 Pro seems very "lazy" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs.

41. They seem to be optimizing for benchmarks instead of real world use

42. Google is still behind the largest models I'd say, in real world utility. Gemini 3 Pro still has many issues.

43. Fair enough. I'm always astonished how different experiences are because mine is the complete opposite. I almost solely use it for help with Go and Javascript programming and found Gemini Pro to be more useful than any other model. ChatGPT was the worst offender so far, completely useless, but Claude has also been suboptimal for my use cases.

I guess it depends a lot on what you use LLMs for and how they are prompted. For example, Gemini fails the simple "count from 1 to 200 in words" test whereas Claude does it without further questions.

Another possible explanation would be that processing time is distributed unevenly across the globe and companies stay silent about this. Maybe depending on time zones?

44. Gemini is completely unusable in VS Code.
It's rated 2/5 stars, pathetic: https://marketplace.visualstudio.com/items?itemName=Google.g...

Requests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.

It doesn't even come close to Claude or ChatGPT.

45. I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive.

46. Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.

47. Have you tried providing multiple pages at a time to the model? It might do better transcription as it have bigger context to work with.

48. It sounds like a job where one pass might also be a viable option. Until you do the manual review you won't have a full sense of the time savings involved.

49. Good idea. I’ll try modifying the prompt to transcribe, identify the language, and translate if not English, and then return a structured result. In my spot checks, most of the errors are in people’s names and if the handwriting trails into margins (especially into the fold of the binding). Even with the data still needing review, the translations from it has revealed a lot of interesting characters as well as this little anecdote from the minutes of the June 6, 1941 Annual Meeting:

It had already rained at the beginning of the meeting. During the same, however, a heavy thunderstorm set in, whereby our electric light line was put out of operation. Wax candles with beer bottles as light holders provided the lighting.
In the meantime the rain had fallen in a cloudburst-like manner, so that one needed help to get one's automobile going. In some streets the water stood so high that one could reach one's home only by detours.
In this night 9.65 inches of rain had fallen.

50. One discovery I've made with gemini is that ocr accuracy is much higher when document is perfectly aligned at 0 degree. When we provided images with handwritten text to gemini which were horizontal (90 or 180 degree) it had lots of issues reading dates, names etc. Then we used paddle ocr image orientation model to find orientation and rotate the image it solved most of our issues with ocr.

51. yes, i had the same experience. As good as LLMs are now at coding - it seems they are still far away from being useful in vision dominated engineering tasks like CAD/design. I guess it is a training data problem. Maybe world models / artificial data can help here?

52. It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.

53. > Trouble is some benchmarks only measure horse power.

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

54. Gemini has always felt like someone who was book smart to me. It knows a lot of things. But if you ask it do anything that is offscript it completely falls apart

55. I made offmetaedh.com with it. Feels pretty great to me.

56. It found a small but nice little optimization in Stockfish: https://github.com/official-stockfish/Stockfish/pull/6613

Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.

57. The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)

58. It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

59. Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.

60. Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

61. Companies are optimizing for all the big benchmarks. This is why there is so little correlation between benchmark performance and real world performance now.

62. Isn’t there? I mean, Claude code has been my biggest usecase and it basically one shots everything now

63. Yes, LLMs have become extremely good at coding (not software engineer though). But try using them for anything original that cannot be adapted from GitHub and Stack Overflow. I haven't seen much improvement at all at such tasks.

64. But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.

65. Actually faster and worse is a very common characterization of a LOT of automation.

66. That's true.

The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).

AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.

So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.

I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.

I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).

67. It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.

Arc-AGI score isn't correlated with anything useful.

68. Give it a prompt like

>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home

And get back an automatic coupon code app like the user actually wanted.

69. Right now I'm still stuck with AI that can't even install other AI.

70. Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.

71. Highly valuable right now with how high leverage it can make a good engineer. Who knows for how long.

72. I tried to debug a Wireguard VPN issue. No luck.

We need more than AGI.

73. top 10 elo in codeforces is pretty absurd

74. The benchmark should be: can you ask it to create a profitable business or product and send you the profit?

Everything else is bike shedding.

75. > Advertising, how will they kill ads any better than the current cat and mouse games with ad blockers?

The Ad Blocker cat and mouse game relies on human-written metaheuristics and rules. It's annoying for humans to keep up. It's difficult to install.

Agents/Bots or super slim detection models will easily be able to train on ads and nuke them whatever form they come in: javascript, inline DOM, text content, video content.

Train an anti-Ad model and it will cleanse the web of ads. You just need a place to run it from the top.

You wouldn't even have to embed this into a browser. It could run in memory with permissions to overwrite the memory of other applications.

> Social Media, how will they kill social media?

MoltClawd was only the beginning. Soon the signal will become so noisy it will be intolerable. Just this week, X's Nikita Bier suggested we have less than six months before he sees no solution.

Speaking of X, they just took down Higgsfield's (valued at $1.3B) main account because they were doing it across a molt bot army, and they're not the only ones. Extreme measures were the only thing they could do. For the distributed spam army, there will be no fix. People are already getting phone calls from this stuff.

76. When will AI come up with a cure / vaccine for the common cold? and then cancer next?

77. Nonsense releases. Until they allow for medical diagnosis and legal advice who cares? You own all the prompts and outputs but somehow they can still modify them and censor them? No.

These 'Ai' are just sophisticated data collection machines, with the ability to generate meh code.

78. It seems to be adept at reviewing/editing/critiquing, at least for my use cases. It always has something valuable to contribute from that perspective, but has been comparatively useless otherwise (outside of moats like "exclusive access to things involving YouTube").

79. I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)

80. Always the same with Google.

Gemini has been way behind from the start.

They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.

They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.

It's all so tiresome.

Try making models that are actually competitive, Google.

Sell them on the actual market and win on actual work product in millions of people lives.

81. Does anyone actually use Gemini 3 now? I cant stand its sleek salesy way of introduction, and it doesnt hold to instructions hard – makes it unapplicable for MECE breakdowns or for writing.

82. I do. It's excellent when paired with an MCP like context7.

83. I use Gemini Pro for basically everything. I just started learning systems biology as I didn't even know this was a subject until it came up in a conversation.

Biology is subject I am quite lacking in but it is unbelievable to me what I have learned in the last few weeks. Not even in what Gemini says exactly but in the text and papers it has led me to.

One major reason is that it has never cut me off until last night. I ran several deep researches yesterday and then finally got cut off in a sprawling 2 hour conversation.

For me it is the first model now that has something new coming out but I haven't extracted all the value from the old model that I am bored with it. I still haven't tried Opus 4.5 let alone 4.6 because I know I will get cut off right when things get rolling.

I don't think I have even logged into ChatGPT in a month now.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Real World Task Performance # Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores

commentCount

83

← Back to job