llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-5-fa40df0a-5825-4d47-93d4-ba6d837f1a46-input.json
The following is content for you to classify. Do not respond to the comments—classify them.
<topics>
1. ARC-AGI Benchmark Validity
Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>
<comments_to_classify>
[
{
"id": "46991897",
"text": "Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind."
}
,
{
"id": "46993950",
"text": "Their models might be impressive, but their products absolutely suck donkey balls. I’ve given Gemini web/cli two months and ran away back to ChatGPT. Seriously, it would just COMPLETELY forget context mid dialog. When asked about improving air quality it just gave me a list of (mediocre) air purifiers without asking for any context whatsoever, and I can list thousands of conversations like that. Shopping or comparing options is just nonexistent.\nIt uses Russian propaganda sources for answers and switches to Chinese mid sentence (!), while explaining some generic Python functionality.\nIt’s an embarrassment and I don’t know how they justify 20 euro price tag on it."
}
,
{
"id": "46994415",
"text": "I agree. On top of that, in true Google style, basic things just don't work.\n\nAny time I upload an attachment, it just fails with something vague like \"couldn't process file\". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.\n\nI also tried having it read and write stuff to \"my stuff\" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.\n\nTheir models are seriously impressive. But as usual Google sucks at making them work well in real products."
}
,
{
"id": "46995093",
"text": "I don't find that at all. At work, we've no access to the API, so we have to force feed a dozen (or more) documents, code and instruction prompts through the web interface upload interface. The only failures I've ever had in well over 300 sessions were due to connectivity issues, not interface failures.\n\nContext window blowouts? All the time, but never document upload failures."
}
,
{
"id": "46999770",
"text": "I'm talking about Gemini in the app and on the web. As well as AI studio. At work we go through Copilot, but there the agentic mode with Gemini isn't the best either."
}
,
{
"id": "46996996",
"text": "Honestly this is as Google product as you can get. Prizes for some, beatings for others."
}
,
{
"id": "46998991",
"text": "Antigravity is an embarrassment.\n\nThe models feel terrible, somehow, like they're being fed terrible system prompts.\n\nPlus the damn thing kept crashing and asking me to \"restart it\". What?!\n\nAt least Kiro does what it says on the tin."
}
,
{
"id": "46999864",
"text": "My experience with Antigravity is the opposite. It's the first time in over 10 years that an IDE has managed to take me out a bit out of the jetbrain suite. I did not think that was something possible as I am a hardcore jetbrain user/lover."
}
,
{
"id": "47000277",
"text": "It's literally just vscode? I tried it the other day and I couldn't tell it apart from windsurf besides the icon in my dock"
}
,
{
"id": "46995188",
"text": "How can the models be impressive if they switch to Chinese mid-sentence? I've observed those bizarre bugs too. Even GPT-3 didn't have those. Maybe GPT-2 did. It's actually impressive that they managed to botch it so badly.\n\nGoogle is great at some things, but this isn't it."
}
,
{
"id": "46994730",
"text": "It's so capable at some things, and others are garbage.\nI uploaded a photo of some words for a spelling bee and asked it to quiz my kid on the words. The first word it asked, wasn't on the list. After multiple attempts to get it to start asking only the words in the uploaded pic, it did, and then would get the spellings wrong in the Q&A. I gave up."
}
,
{
"id": "46998280",
"text": "I had it process a photo of my D&D character sheet and help me debug it as I'm a n00b at the game. Also did a decent, although not perfect, job of adding up a handwritten bowling score sheet."
}
,
{
"id": "46999428",
"text": "100x agree. It gives inconsistent edits, would regularly try to perform things I explicitly command to not."
}
,
{
"id": "46995659",
"text": "Agreed on the product. I can't make Gemini read my emails on GMail. One day it says it doesn't have access, the other day it says Query unsuccessful.\nClaude Desktop has no problem reaching to GMail, on the other hand :)"
}
,
{
"id": "46994826",
"text": "Sadly true.\n\nIt is also one of the worst models to have a sort of ongoing conversation with."
}
,
{
"id": "47000478",
"text": "And it gives incorrect answers about itself and google’s services all the time. It kept pointing me to nonexistent ui elements. At least it apologizes profusely! ffs"
}
,
{
"id": "46994136",
"text": "Their models are absolutely not impressive.\n\nNot a single person is using it for coding (outside of Google itself).\n\nMaybe some people on a very generous free plan.\n\nTheir model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.\n\nBut that isn’t “the model” that’s an old model backed by massive money."
}
,
{
"id": "46998390",
"text": "Uhh, just false."
}
,
{
"id": "46999902",
"text": "I don't have any of these issues with Gemini. I use it heavily everyday. A few glitches here and there, but it's been enormously productive for me. Far more so then chatgpt, which I find mostly useless."
}
,
{
"id": "47002070",
"text": "These benchmarks are super impressive. That said, Gemini 3 Pro benchmarked well on coding tasks, and yet I found it abysmal. A distant third behind Codex and Claude.\n\nTool calling failures, hallucinations, bad code output. It felt like using a coding model from a year ago.\n\nEven just as a general use model, somehow ChatGPT has a smoother integration with web search (than google!!), knowing when to use it, and not needing me to prompt it directly multiple times to search.\n\nNot sure what happened there. They have all the ingredients in theory but they've really fallen behind on actual usability.\n\nTheir image models are kicking ass though."
}
,
{
"id": "46993652",
"text": "Peacetime Google is not like wartime Google.\n\nPeacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done."
}
,
{
"id": "46993942",
"text": "OpenAI is the best thing that happened to Google apparently."
}
,
{
"id": "46996091",
"text": "Just not search. The search product has pretty much become useless over the past 3 years and the AI answers often will get just to the level of 5 years ago. This creates a sense that that things are better - but really it’s just become impossible to get reliable information from an avenue that used to work very well.\n\nI don’t think this is intentional, but I think they stopped fighting SEO entirely to focus on AI. Recipes are the best example - completely gutted and almost all receive sites (therefore the entire search page) run by the same company. I didn’t realize how utterly consolidated huge portions of information on the internet was until every recipe site about 3 months ago simultaneously implemented the same anti-Adblock."
}
,
{
"id": "46999383",
"text": "The search product become useless on a particular day of 2019 as discussed on HN News some time ago:\n\nhttps://news.ycombinator.com/item?id=40133976"
}
,
{
"id": "46994664",
"text": "Competition always is. I think there was a real fear that their core product was going to be replaced. They're already cannibalizing it internally so it was THE wake up call."
}
,
{
"id": "46996296",
"text": "Next they compete on ads..."
}
,
{
"id": "46994644",
"text": "Wartime Google gave us Google+. Wartime Google is still bumbling, and despite OpenAI's numerous missteps, I don't think it has to worry about Google hurting its business yet."
}
,
{
"id": "46996749",
"text": "I do miss Google+. For my brain / use case, it was by far the best social network out there, and the Circle friends and interest management system is still unparalleled :)"
}
,
{
"id": "46999963",
"text": "Google+ was fun. Failed in the market though.\n\nApple made a social network called Ping. Disaster. MobileMe was silly.\n\nMicrosoft made Zune and the Kin 1 and Kin 2 devices and Windows phone and all sorts of other disasters.\n\nThese things happen."
}
,
{
"id": "46993484",
"text": "But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI."
}
,
{
"id": "46994483",
"text": "> I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.\n\nI think you overestimate how much your average person-on-the-street cares about LLM benchmarks. They already treat ChatGPT or whichever as generally intelligent (including to their own detriment), are frustrated about their social media feeds filling up with slop and, maybe, if they're white-collar, worry about their jobs disappearing due to AI. Apart from a tiny minority in some specific field, people already know themselves to be less intelligent along any measurable axis than someone somewhere."
}
,
{
"id": "46994032",
"text": "\"AGI\" doesn't mean anything concrete, so it's all a bunch of non-sequiturs. Your goalposts don't exist.\n\nAnyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured."
}
,
{
"id": "46994239",
"text": "I agree. I think the emergence of LLMs have shown that AGI really has no teeth. I think for decades the Turing test was viewed as the gold standard, but it's clear that there doesn't appear to be any good metric."
}
,
{
"id": "46996545",
"text": "The turing test was passed in the 80s, somehow it has remained relevant in pop culture despite the fact that it's not a particularly difficult technical achievement"
}
,
{
"id": "46997573",
"text": "It wasn’t passed in the 80s. Not the general Turing test."
}
,
{
"id": "46998708",
"text": "c. 2022 for me."
}
,
{
"id": "46993623",
"text": "Soon they can drop the bioweapon to welcome our replacement."
}
,
{
"id": "47001618",
"text": "I'd personally bet on Google and Meta in the long run since they have access to the most interesting datasets from their other operations."
}
,
{
"id": "46998872",
"text": "Not in my experience with Gemini Pro and coding. It hallucinates APIs that aren't there. Claude does not do that.\n\nGemini has flashes of brilliance, but I regard it as unpolished some things work amazingly, some basics don't work."
}
,
{
"id": "46999397",
"text": "It's very hard to tell the difference between bad models and stinginess with compute.\n\nI subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).\n\nIf I give the same question to \"Gemini 3.0 Pro\" and \"ChatGPT 5.2 Thinking + Heavy thinking\", the latter is 4x slower and it gives smarter answers.\n\nI shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible."
}
,
{
"id": "47000105",
"text": "You nailed it. Gemini 3 Pro seems very \"lazy\" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs."
}
,
{
"id": "46999429",
"text": "It was obvious to me that they were top contender 2 years ago ... https://www.reddit.com/r/LocalLLaMA/comments/1c0je6h/google_..."
}
,
{
"id": "46997789",
"text": "Don't let the benchmarks fool you. Gemini models are completely useless not matter how smart they are. Google still hasn't figure out tool calling and making the model follow instructions. They seem to only care about benchmarking and being the most intelligent model on paper. This has been a problem of Gemini since 1.0 and they still haven't fixed it.\n\nAlso the worst model in terms of hallucinations."
}
,
{
"id": "46997854",
"text": "Disagree.\n\nClaude Code is great for coding, Gemini is better than everything else for everything else."
}
,
{
"id": "46999426",
"text": "What is \"everything else\" in your view? Just curious -- I really only seriously use models for coding, so I am curious what I am missing."
}
,
{
"id": "47000942",
"text": "Role-playing but Claude is as bad, same censored garbage with the CEO wanting to be your dad. Grok is best for everything else by far."
}
,
{
"id": "47000970",
"text": "And mathematics?"
}
,
{
"id": "46997967",
"text": "Are you using Gemini model itself or using the Gemini App? They are different."
}
,
{
"id": "46998006",
"text": "Both"
}
,
{
"id": "46999353",
"text": "They seem to be optimizing for benchmarks instead of real world use"
}
]
</comments_to_classify>
Based on the comments above, assign each to up to 3 relevant topics.
Return ONLY a JSON array with this exact structure (no other text):
[
{
"id": "comment_id_1",
"topics": [
1,
3,
5
]
}
,
{
"id": "comment_id_2",
"topics": [
2
]
}
,
{
"id": "comment_id_3",
"topics": [
0
]
}
,
...
]
Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment
Remember: Output ONLY the JSON array, no other text.
50