llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-2-ed5b5c7a-6b54-49ed-9cc4-88e0c26f7453-input.json
The following is content for you to classify. Do not respond to the comments—classify them.
<topics>
1. ARC-AGI Benchmark Validity
Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>
<comments_to_classify>
[
{
"id": "46997076",
"text": "> Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.\n\nThis may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.\n\nI think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving . I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.\n\nAnd obviously what actually matters is performance on real-world tasks."
}
,
{
"id": "46993130",
"text": "* that you weren't supposed to be able to"
}
,
{
"id": "46993070",
"text": "Could it also be that the models are just a lot better than a year ago?"
}
,
{
"id": "46994792",
"text": "> Could it also be that the models are just a lot better than a year ago?\n\nNo, the proof is in the pudding.\n\nAfter AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. \"Doing better\" can only be justified by that real benchmark.\n\nIf Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels."
}
,
{
"id": "46995012",
"text": "> If Gemini 3 DT was better we would have falling prices of electricity and everything else at least\n\nMan, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out."
}
,
{
"id": "46995300",
"text": "You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.\n\nThis is from the BLS consumer survey report released in dec[1]\n\n[1] https://www.bls.gov/news.release/cesan.nr0.htm\n\n[2] https://www.bls.gov/opub/reports/consumer-expenditures/2019/\n\nPrices are never going back to 2019 numbers though"
}
,
{
"id": "46995676",
"text": "That's an improper analysis.\n\nFirst off, it's dollar-averaging every category, so it's not \"% of income\", which varies based on unit income.\n\nSecond, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability."
}
,
{
"id": "46996052",
"text": "Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.\n\nThis part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions."
}
,
{
"id": "46996956",
"text": "We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society."
}
,
{
"id": "46993215",
"text": "https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3"
}
,
{
"id": "46993361",
"text": "I don't understand what you want to tell us with this image."
}
,
{
"id": "46994201",
"text": "they're accusing GGP of moving the goalposts."
}
,
{
"id": "46993309",
"text": "Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level."
}
,
{
"id": "46995693",
"text": "Does folding a protein count? How about increasing performance at Go?"
}
,
{
"id": "46999765",
"text": "\"Optimize this extremely nontrivial algorithm\" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence."
}
,
{
"id": "47000430",
"text": "It's worth noting that neither of those were accomplished by LLMs."
}
,
{
"id": "46992482",
"text": "Here's a good thread over 1+ month, as each model comes out\n\nhttps://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...\n\ntl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark"
}
,
{
"id": "46992839",
"text": "If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general."
}
,
{
"id": "46993178",
"text": "the best way I've seen this describes is \"spikey\" intelligence, really good at some points, those make the spikes\n\nhumans are the same way, we all have a unique spike pattern, interests and talents\n\nai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see \"spikey clones\""
}
,
{
"id": "46993219",
"text": "You can get more spiky with AIs, whereas with human brain we are more hard wired.\n\nSo maybe we are forced to be more balanced and general whereas AI don't have to."
}
,
{
"id": "46993268",
"text": "I suspect the non-spikey part is the more interesting comparison\n\nWhy is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.\n\nThere are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai"
}
,
{
"id": "46996911",
"text": ">Why is it so easy for me to open the car door\n\nBecause this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.\n\nOn the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.\n\nThere's a term for this, but I can't think of it at the moment."
}
,
{
"id": "47001811",
"text": "> There's a term for this, but I can't think of it at the moment.\n\nMoravec's paradox: https://epoch.ai/gradient-updates/moravec-s-paradox"
}
,
{
"id": "46995720",
"text": "You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark."
}
,
{
"id": "46997832",
"text": "Boston dynamics is missing just about all the degrees of freedom involved in the scenario op mentions."
}
,
{
"id": "46995704",
"text": "> maybe there's intelligence in there, but hardly general.\n\nOf course. Just as our human intelligence isn't general."
}
,
{
"id": "46993500",
"text": "I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in \"general intelligence\".\n\nI joke to myself that the G in ARC-AGI is \"graphical\". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.\n\nLooking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games."
}
,
{
"id": "46993983",
"text": "Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something."
}
,
{
"id": "46993946",
"text": "The average ARC AGI 2 score for a single human is around 60%.\n\n\"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%.\"\n\nhttps://arcprize.org/arc-agi/2/"
}
,
{
"id": "46994535",
"text": "Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher."
}
,
{
"id": "46994815",
"text": "Random members of the public = average human beings. I thought those were already classified as General Intelligences."
}
,
{
"id": "46999239",
"text": "Average human beings with average human problems."
}
,
{
"id": "46995350",
"text": "What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.\n\nNone of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting."
}
,
{
"id": "46995660",
"text": "What's the point of denying or downplaying that we are seeing amazing and accelerating advancements in areas that many of us thought were impossible?"
}
,
{
"id": "46997166",
"text": "It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.\n\nReal-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here."
}
,
{
"id": "47000924",
"text": "The GP comment is not skeptical of the jump in benchmark scores reported by one particular LLM. It's skeptical of machine intelligence in general, claims that there's no value in comparing their performances with those of human beings, and accuses those who disagree with this take of \"hubris and grift\". This has nothing to do with any form or reasonable skepticism."
}
,
{
"id": "46996919",
"text": "I would suggest it is a phenomenon that is well studied, and has many forms. I guess mostly identify preservation. If you dislike AI from the start, it is generally a very strongly emotional view. I don't mean there is no good reason behind it, I mean, it is deeply rooted in your psyche, very emotional.\n\nPeople are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.\n\nThat won't change with evidence until it is literally impossible not to change."
}
,
{
"id": "46996944",
"text": "The hubris and grift are exhausting.\n\nAnd moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?\n\nPersonally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen."
}
,
{
"id": "46997324",
"text": "> What evidence of intelligence would satisfy you?\n\nThat is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.\n\nThe reality is that we can argue about that until we're blue in the face, and get nowhere.\n\nIn this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich."
}
,
{
"id": "46998215",
"text": "(Shrug) Unless and until you provide us with your own definition of intelligence, I'd say the marketing people are as entitled to their opinion as you are."
}
,
{
"id": "46997796",
"text": "> What evidence of intelligence would satisfy you?\n\nImposing world peace and/or exterminating homo sapiens"
}
,
{
"id": "46995565",
"text": "> Machines have been able to accomplish specific tasks...\n\nIndeed, and the specific task machines are accomplishing now is intelligence. Not yet \"better than human\" (and certainly not better than every human) but getting closer."
}
,
{
"id": "46995952",
"text": "> Indeed, and the specific task machines are accomplishing now is intelligence.\n\nHow so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.\n\nMaybe it would help if we could first agree on a definition of \"intelligence\", yet we don't have a reliable way of measuring that in living beings either.\n\nIf the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.\n\nBut there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are."
}
,
{
"id": "46996143",
"text": "> Maybe it would help if we could first agree on a definition of \"intelligence\", yet we don't have a reliable way of measuring that in living beings either.\n\nHow about this specific definition of intelligence?\n\nSolve any task provided as text or images.\n\nAGI would be to achieve that faster than an average human."
}
,
{
"id": "46996491",
"text": "I still can't understand why they should be faster. Humans have general intelligence, afaik. It doesn't matter if it's fast or slow. A machine able to do what the average human can do (intelligence-wise) but 100 times slower still has general intelligence. Since it's artificial, it's AGI."
}
,
{
"id": "46993948",
"text": "Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These \"general\" models would serve as the frontal cortex while other models do specialized work. What is missing?"
}
,
{
"id": "46993992",
"text": "That's a bit like saying just give blind people cameras so they can see."
}
,
{
"id": "46996938",
"text": "I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain."
}
,
{
"id": "46996215",
"text": "They should train more on sports commentary, perhaps that could give spatial reasoning a boost."
}
,
{
"id": "46992864",
"text": "https://arcprize.org/leaderboard\n\n$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?\n\nBut the real question is if they just fit the model to the benchmark."
}
]
</comments_to_classify>
Based on the comments above, assign each to up to 3 relevant topics.
Return ONLY a JSON array with this exact structure (no other text):
[
{
"id": "comment_id_1",
"topics": [
1,
3,
5
]
}
,
{
"id": "comment_id_2",
"topics": [
2
]
}
,
{
"id": "comment_id_3",
"topics": [
0
]
}
,
...
]
Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment
Remember: Output ONLY the JSON array, no other text.
50