Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-6-ba14c543-8c75-4501-8d6c-05fdd808c2b7-input.json

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. ARC-AGI Benchmark Validity
   Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
   Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
   Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
   Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
   Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
   Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
   Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
   Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
   Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
   Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
   Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
   Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
   Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
   Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
   Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
   Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
   Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
   Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
   Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
   Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>

<comments_to_classify>
[
  
{
  "id": "46992240",
  "text": "Those black nazis in the first image model were a cause of inside trading."
}
,
  
{
  "id": "46998698",
  "text": "I'm leery to use a Google product in light of their history of discontinuing services. It'd have to be significantly better than a similar product from a committed competitor."
}
,
  
{
  "id": "46992205",
  "text": "Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff."
}
,
  
{
  "id": "46993012",
  "text": "You sound like Russ Hanneman from SV"
}
,
  
{
  "id": "46993273",
  "text": "It's not about how much you earn. It's about what you're worth."
}
,
  
{
  "id": "46993659",
  "text": "Google is still behind the largest models I'd say, in real world utility. Gemini 3 Pro still has many issues."
}
,
  
{
  "id": "46993376",
  "text": "Gemini's UX (and of course privacy cred as with anything Google) is the worst of all the AI apps. In the eyes of the Common Man, it's UI that will win out, and ChatGPT's is still the best."
}
,
  
{
  "id": "46993931",
  "text": "Google privacy cred is ... excellent? The worst data breach I know of them having was a flaw that allowed access to names and emails of 500k users."
}
,
  
{
  "id": "46994258",
  "text": "Link? Are you conflating with \"500k Gmail accounts leaked [by a third party]\" with Gmail having a breach?\n\nAfaik, Google has had no breaches ever."
}
,
  
{
  "id": "46995310",
  "text": "https://en.wikipedia.org/wiki/2018_Google_data_breach"
}
,
  
{
  "id": "46994782",
  "text": "Google is the breach."
}
,
  
{
  "id": "46998447",
  "text": "Their SECURITY cred is fantastic.\n\nPrivacy, not so much. How many hundreds of millions have they been fined for “incognito mode” in chrome being a blatant lie?"
}
,
  
{
  "id": "46998760",
  "text": "> Their SECURITY cred is fantastic.\n\nIn a world where Android vulnerabilities and exploits don't exist"
}
,
  
{
  "id": "46994335",
  "text": "If you consider \"privacy\" to be 'a giant corporation tracks every bit of possible information about you and everyone else'?"
}
,
  
{
  "id": "46995329",
  "text": "OpenAI is running ads. Do you think they'll track less?"
}
,
  
{
  "id": "46994069",
  "text": "They don't even let you have multiple chats if you disable their \"App Activity\" or whatever (wtf is with that ass naming? they don't even have a \"Privacy\" section in their settings the last time I checked)\n\nand when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.\n\nChatGPT and Grok work so much better without accounts or with high privacy settings."
}
,
  
{
  "id": "46993677",
  "text": "> Gemini's UX ... is the worst of all the AI apps\n\nBeen using Gemini + OpenCode for the past couple weeks.\n\nSuddenly, I get a \"you need a Gemini Access Code license\" error but when you go to the project page there is no mention of this or how to get the license.\n\nYou really feel the \"We're the phone company and we don't care. Why? Because we don't have to.\" [0] when you use these Google products.\n\nPS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a \"near\" monopoly but close enough).\n\n0 - https://vimeo.com/355556831"
}
,
  
{
  "id": "46996058",
  "text": "I find Gemini's web page much snappier to use than ChatGPT - I've largely swapped to it for most things except more agentic tasks."
}
,
  
{
  "id": "46993450",
  "text": "You mean AI Studio or something like that, right? Because I can't see a problem with Google's standard chat interface. All other AI offerings are confusing both regarding their intended use and their UX, though, I have to concur with that."
}
,
  
{
  "id": "46993843",
  "text": "The lack of \"projects\" alone makes their chat interface really unpleasant compared to ChatGPT and Claude."
}
,
  
{
  "id": "46993748",
  "text": "AI Studio is also significantly improved as of yesterday."
}
,
  
{
  "id": "46994012",
  "text": "No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke."
}
,
  
{
  "id": "47000526",
  "text": "Fair enough. I'm always astonished how different experiences are because mine is the complete opposite. I almost solely use it for help with Go and Javascript programming and found Gemini Pro to be more useful than any other model. ChatGPT was the worst offender so far, completely useless, but Claude has also been suboptimal for my use cases.\n\nI guess it depends a lot on what you use LLMs for and how they are prompted. For example, Gemini fails the simple \"count from 1 to 200 in words\" test whereas Claude does it without further questions.\n\nAnother possible explanation would be that processing time is distributed unevenly across the globe and companies stay silent about this. Maybe depending on time zones?"
}
,
  
{
  "id": "46995233",
  "text": "Gemini is completely unusable in VS Code.\nIt's rated 2/5 stars, pathetic: https://marketplace.visualstudio.com/items?itemName=Google.g...\n\nRequests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.\n\nIt doesn't even come close to Claude or ChatGPT."
}
,
  
{
  "id": "46998401",
  "text": "Once Google launched Antigravity, I stopped using VS Code."
}
,
  
{
  "id": "46998299",
  "text": "Smart idea to say anything against Google here from a throwaway account, I'm sitting in negative karma for that :')"
}
,
  
{
  "id": "46998784",
  "text": "Anti Google comments do pretty well on average. It's a popular sentiment. However, low effort comments don't."
}
,
  
{
  "id": "46998438",
  "text": "They were behind. Way behind. But they caught up."
}
,
  
{
  "id": "46995964",
  "text": "I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive."
}
,
  
{
  "id": "46999836",
  "text": "Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this."
}
,
  
{
  "id": "47001509",
  "text": "Look what they need to mimic a fraction of [the power of having the logit probabilities exposed so you can actually see where the model is uncertain]"
}
,
  
{
  "id": "46998966",
  "text": "Have you tried providing multiple pages at a time to the model? It might do better transcription as it have bigger context to work with."
}
,
  
{
  "id": "47000379",
  "text": "Gemini 3 long context is not good as Gemini 2.5"
}
,
  
{
  "id": "46996875",
  "text": "It sounds like a job where one pass might also be a viable option. Until you do the manual review you won't have a full sense of the time savings involved."
}
,
  
{
  "id": "46997532",
  "text": "Good idea. I’ll try modifying the prompt to transcribe, identify the language, and translate if not English, and then return a structured result. In my spot checks, most of the errors are in people’s names and if the handwriting trails into margins (especially into the fold of the binding). Even with the data still needing review, the translations from it has revealed a lot of interesting characters as well as this little anecdote from the minutes of the June 6, 1941 Annual Meeting:\n\nIt had already rained at the beginning of the meeting. During the same, however, a heavy thunderstorm set in, whereby our electric light line was put out of operation. Wax candles with beer bottles as light holders provided the lighting.\nIn the meantime the rain had fallen in a cloudburst-like manner, so that one needed help to get one's automobile going. In some streets the water stood so high that one could reach one's home only by detours.\nIn this night 9.65 inches of rain had fallen."
}
,
  
{
  "id": "46998382",
  "text": "One discovery I've made with gemini is that ocr accuracy is much higher when document is perfectly aligned at 0 degree. When we provided images with handwritten text to gemini which were horizontal (90 or 180 degree) it had lots of issues reading dates, names etc. Then we used paddle ocr image orientation model to find orientation and rotate the image it solved most of our issues with ocr."
}
,
  
{
  "id": "46998340",
  "text": "They could likely increase their budget slightly and run an LLM-based judge."
}
,
  
{
  "id": "46991436",
  "text": "Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...\n\nThe arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered \"solved\"\n\n>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview"
}
,
  
{
  "id": "46991694",
  "text": "Interestingly, the title of that PDF calls it \"Gemini 3.1 Pro\". Guess that's dropping soon."
}
,
  
{
  "id": "46991747",
  "text": "I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.\n\nedit: they just removed the reference to \"3.1\" from the pdf"
}
,
  
{
  "id": "46993081",
  "text": "I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash).\nBut they probably decided to market it as Deep Think because why not charge more for it."
}
,
  
{
  "id": "46993346",
  "text": "The Deep Think moniker is for parallel compute models though, not long CoT like pro models.\n\nIt's possible though that deep think 3 is running 3.1 models under the hood."
}
,
  
{
  "id": "46992617",
  "text": "That's odd considering 3.0 is still labeled a \"preview\" release."
}
,
  
{
  "id": "46996082",
  "text": "I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from."
}
,
  
{
  "id": "46992575",
  "text": "The rumor was that 3.1 was today's drop"
}
,
  
{
  "id": "46993296",
  "text": "Where are these rumors floating around?"
}
,
  
{
  "id": "46993760",
  "text": "One of many https://x.com/synthwavedd/status/2021983382314660075"
}
,
  
{
  "id": "46999541",
  "text": "Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway."
}
,
  
{
  "id": "46992738",
  "text": "> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered \"solved\"\n\nThey never will do on private set, because it would mean its being leaked to google."
}
,
  
{
  "id": "46991513",
  "text": "OT but my intuition says that there’s a spectrum\n\n- non thinking models\n\n- thinking models\n\n- best of N models like deep think an gpt pro\n\nEach one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.\n\nI think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.\n\nTwo open questions\n\n1) what’s the higher level here, is there a 4th option?\n\n2) can a sufficiently large non thinking model perform the same as a smaller thinking?"
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[
  
{
  "id": "comment_id_1",
  "topics": [
    1,
    3,
    5
  ]
}
,
  
{
  "id": "comment_id_2",
  "topics": [
    2
  ]
}
,
  
{
  "id": "comment_id_3",
  "topics": [
    0
  ]
}
,
  ...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

50

← Back to job