Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-0-fbb2670d-dfa3-47a4-8cc4-da1c79b49f3f-input.json

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. ARC-AGI Benchmark Validity
   Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
   Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
   Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
   Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
   Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
   Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
   Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
   Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
   Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
   Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
   Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
   Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
   Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
   Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
   Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
   Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
   Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
   Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
   Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
   Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>

<comments_to_classify>
[
  
{
  "id": "46991443",
  "text": "Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)\n\nWow.\n\nhttps://blog.google/innovation-and-ai/models-and-research/ge..."
}
,
  
{
  "id": "46993620",
  "text": "Even before this, Gemini 3 has always felt unbelievably 'general' for me.\nIt can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:\n\n1. It's an LLM, not something trained to play Balatro specifically\n\n2. Most (probably >99.9%) players can't do that at the first attempt\n\n3. I don't think there are many people who posted their Balatro playthroughs in text form online\n\nI think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.\n\n[0]: https://balatrobench.com/"
}
,
  
{
  "id": "46996183",
  "text": "Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time."
}
,
  
{
  "id": "46997898",
  "text": "It beats ante eight 9 times out of 15 attempts. I do consider 60% winning chance very good for a first time player.\n\nThe average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.\n\n[0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated."
}
,
  
{
  "id": "46997253",
  "text": "https://balatrobench.com/"
}
,
  
{
  "id": "46997481",
  "text": "Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake)."
}
,
  
{
  "id": "46998167",
  "text": "Thank you for the site! I've got a few suggestions:\n\n1. I think winrate is more telling than the average round number.\n\n2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.\n\n3. Instead of giving them \"strategy\" like \"flush is the easiest hand...\" it's fairer to clarify some mechanisms that confuse human players too. e.g. \"played\" vs \"scored\".\n\nEspecially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:\n\n> ### Antes 1-3: Foundation\n\n> - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker"
}
,
  
{
  "id": "46997344",
  "text": "My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.\n\nComparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.\n\nI do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.\n\nThey all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time."
}
,
  
{
  "id": "47002109",
  "text": "> their Deep Research usually works the hardest\n\nThat's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.\n\nSo, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.\n\nIt linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.\n\nLike, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.\n\nI'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well."
}
,
  
{
  "id": "46996197",
  "text": "Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.\n\n(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how \"skilled\" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)"
}
,
  
{
  "id": "46995056",
  "text": "It's trained on YouTube data. It's going to get roffle and drspectred at the very least."
}
,
  
{
  "id": "46997822",
  "text": "I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data."
}
,
  
{
  "id": "46994964",
  "text": "Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.\n\nNonetheless I still think it's impressive that we have LLMs that can just do this now."
}
,
  
{
  "id": "46995515",
  "text": "Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy."
}
,
  
{
  "id": "46995344",
  "text": "If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?"
}
,
  
{
  "id": "46995456",
  "text": "I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising."
}
,
  
{
  "id": "46995550",
  "text": ">Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.\n\nMaybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for \"secret\" hands like the 5 of a kind, flush 5, or flush house."
}
,
  
{
  "id": "46994026",
  "text": "DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years"
}
,
  
{
  "id": "46994072",
  "text": "What about Kimi and GLM?"
}
,
  
{
  "id": "46995127",
  "text": "These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models."
}
,
  
{
  "id": "46996977",
  "text": "Idk man, GLM 5 in my tests matches opus 4.5 which is what, two months old?"
}
,
  
{
  "id": "46999641",
  "text": "4.5 was never sota"
}
,
  
{
  "id": "46995938",
  "text": "According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 ."
}
,
  
{
  "id": "46999124",
  "text": "Yes, agentic-wise, Claude Opus is best. Complex coding is GPT-5.x. But for smartness, I always felt Gemini 3 Pro is best."
}
,
  
{
  "id": "47000458",
  "text": "Can you give an example of smartness where Gemini is better than the other 2? I have found Gemini 3 pro the opposite of smartness on the tasks I gave him (evaluation, extraction, copy writing, judging, synthesising ) with gpt 5.2 xhigh first and opus 4.5/4.6 second. Not to mention it likes to hallucinate quite a bit ."
}
,
  
{
  "id": "46997399",
  "text": "Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try."
}
,
  
{
  "id": "46997765",
  "text": "Claude is king for agentic workflows right now because it’s amazing at tool calling and following instructions well (among other things)"
}
,
  
{
  "id": "46999648",
  "text": "Codex ranks higher for instruction following"
}
,
  
{
  "id": "46994331",
  "text": "But... there's Deepseek v3.2 in your link (rank 7)"
}
,
  
{
  "id": "46997607",
  "text": "Grok (rank 6) and below didn't beat the game even once.\n\nEdit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry"
}
,
  
{
  "id": "46994317",
  "text": "> . I don't think there are many people who posted their Balatro playthroughs in text form online\n\nThere are * tons * of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model."
}
,
  
{
  "id": "46994365",
  "text": "Yeah, or just the steam text guides would be a huge advantage.\n\nI really doubt it's playing completely blind"
}
,
  
{
  "id": "46998172",
  "text": "Not sure it's 99.9%. I beat it on my first attempt, but that was probably mostly luck."
}
,
  
{
  "id": "46996059",
  "text": "How does it do on gold stake?"
}
,
  
{
  "id": "46994635",
  "text": "> Most (probably >99.9%) players can't do that at the first attempt\n\nEh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it."
}
,
  
{
  "id": "46992341",
  "text": "Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?\n\nI ask because I cannot distinguish all the benchmarks by heart."
}
,
  
{
  "id": "46993367",
  "text": "François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.\n\nHis definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI."
}
,
  
{
  "id": "46994613",
  "text": "> His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.\n\nThat is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.\n\nThats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.\n\nEdit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human."
}
,
  
{
  "id": "46994798",
  "text": "> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.\n\nThis is not a good test.\n\nA dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.\n\nGPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other."
}
,
  
{
  "id": "46998319",
  "text": "Agreed, it's a truly wild take. While I fully support the humility of not knowing, at a minimum I think we can say determinations of consciousness have some relation to specific structure and function that drive the outputs, and the actual process of deliberating on whether there's consciousness would be a discussion that's very deep in the weeds about architecture and processes.\n\nWhat's fascinating is that evolution has seen fit to evolve consciousness independently on more than one occasion from different branches of life. The common ancestor of humans and octopi was, if conscious, not so in the rich way that octopi and humans later became. And not everything the brain does in terms of information processing gets kicked upstairs into consciousness. Which is fascinating because it suggests that actually being conscious is a distinctly valuable form of information parsing and problem solving for certain types of problems that's not necessarily cheaper to do with the lights out. But everything about it is about the specific structural characterizations and functions and not just whether it's output convincingly mimics subjectivity."
}
,
  
{
  "id": "46995534",
  "text": "An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy."
}
,
  
{
  "id": "47000861",
  "text": "A classic relevant comic:\n\nhttps://www.threepanelsoul.com/comic/dog-philosophy"
}
,
  
{
  "id": "46997764",
  "text": "My dog wags his tail hard when I ask \"hoosagoodboi?\". Pretty definitive I'd say."
}
,
  
{
  "id": "46995011",
  "text": ">because we can no longer find tasks that are feasible for normal humans but unsolved by AI.\n\n\"Answer \"I don't know\" if you don't know an answer to one of the questions\""
}
,
  
{
  "id": "46996182",
  "text": "I've been surprised how difficult it is for LLMs to simply answer \"I don't know.\"\n\nIt also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers."
}
,
  
{
  "id": "47000318",
  "text": "> I've been surprised how difficult it is for LLMs to simply answer \"I don't know.\"\n\nIt's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is \"I don't know\" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of \"I don't know\" in the training data it also won't show up in results, so what should you do?\n\nIf you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with \"idk\". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations."
}
,
  
{
  "id": "46996880",
  "text": "The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though."
}
,
  
{
  "id": "46997995",
  "text": "This seems true for info not in the question - eg. \"Calculate the volume of a cylinder with height 10 meters\".\n\nHowever it is less true with info missing from the training data - ie. \"I have a Diode marked UM16, what is the maximum current at 125C?\""
}
,
  
{
  "id": "46998566",
  "text": "This seems fine...?\n\nhttps://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...\n\nI don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly (\"If the marking is UM16 on an SMA/DO-214AC package...\") and reads the graph in Fig. 1 correctly.\n\nOf course, it took 18 minutes of crunching to get the answer, which seems a tad excessive."
}
,
  
{
  "id": "47000603",
  "text": "Indeed that answer is awesome. Much better than Gemini 2.5 pro which invented a 16 kilovolt diode which it just hoped would be marked \"UM16\"."
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[
  
{
  "id": "comment_id_1",
  "topics": [
    1,
    3,
    5
  ]
}
,
  
{
  "id": "comment_id_2",
  "topics": [
    2
  ]
}
,
  
{
  "id": "comment_id_3",
  "topics": [
    0
  ]
}
,
  ...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

50

← Back to job