Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-4-a06c376c-631f-4eaa-a216-bba5f1206632-input.json

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. ARC-AGI Benchmark Validity
   Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
   Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
   Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
   Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
   Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
   Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
   Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
   Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
   Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
   Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
   Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
   Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
   Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
   Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
   Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
   Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
   Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
   Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
   Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
   Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>

<comments_to_classify>
[
  
{
  "id": "47000143",
  "text": "In fact, many Asian countries use lunisolar calendars, which basically follow the moon for the months but add an extra month every few years so the seasons don't drift.\n\nAs these calendars also rely on time zones for date calculation, there are rare occasions where the New Year start date differs by an entire month between 2 countries."
}
,
  
{
  "id": "46999546",
  "text": "If that's a sole problem, it should be called \"Chinese-Japanese-Korean-whateverelse new year\" instead. Maybe \"East Asian new year\" for short. (Not that there are absolutely no discrepancies within them, but they are so similar enough that new year's day almost always coincide.)"
}
,
  
{
  "id": "47001381",
  "text": "It's not Japanese either.\n\nThis non-problem sounds like it's on the same scale as \"The British Isles\", a term which is mildly annoying to Irish people but in common use everywhere else."
}
,
  
{
  "id": "46994104",
  "text": "I'm having trouble just keeping track of all these different types of models.\n\nIs \"Gemini 3 Deep Think\" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.\n\nAlso, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?\n\nCan someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!"
}
,
  
{
  "id": "46997115",
  "text": "The term “model” is one of those super overloaded terms. Depending on the conversation it can mean:\n\n- a product (most accurate here imo)\n\n- a specific set of weights in a neural net\n\n- a general architecture or family of architectures (BERT models)\n\nSo while you could argue this is a “model” in the broadest sense of the term, it’s probably more descriptive to call it a product. Similarly we call LLMs “language” models even if they can do a lot more than that, for example draw images."
}
,
  
{
  "id": "47000175",
  "text": "I'm pretty sure only the second is properly called a model, and \"BERT models\" are simply models with the BERT architecture."
}
,
  
{
  "id": "47001890",
  "text": "It depends on time. 5 years ago it was quite well defined that it’s the last one, maybe the second one in some context. Especially when distinction was important, it was always the last one. In our case it was. We trained models to have weights. We even stored models and weights separately, because models change slower than weights. You could choose a model and a set of weights, and run them. You could change weights any time.\n\nThen marketing, and huge amount of capital came."
}
,
  
{
  "id": "47001930",
  "text": "It seems unlikely \"model\" was ever equivalent in meaning to \"architecture\". Otherwise there would be just one \"CNN model\" or just one \"transformer model\" insofar there is a single architecture involved."
}
,
  
{
  "id": "46994641",
  "text": "> Also, I don't understand the comments about Google being behind in agentic workflows.\n\nIt has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results."
}
,
  
{
  "id": "46994137",
  "text": "There are hints this is a preview to Gemini 3.1."
}
,
  
{
  "id": "47001352",
  "text": "More focus has been put on post-training recently. Where a full model training run can take a month and often requires multiple tries because it can collapse and fail, post-training is don't on the order of 5 or 6 days.\n\nMy assumption is that they're all either pretty happy with their base models or unwilling to do those larger runs, and post-training is turning out good results that they release quickly."
}
,
  
{
  "id": "46997231",
  "text": "So, yes, for the past couple weeks it has felt that way to me. But it seems to come in fits and starts. Maybe that will stop being the case, but that's how it's felt to me for awhile."
}
,
  
{
  "id": "47001548",
  "text": "its cause of a chain of events.\n\nNext week Chinese New year -> Chinese labs release all the models at once before it starts -> US labs respond with what they have already prepared\n\nalso note that even in US labs a large proportion of researchers and engineers are chinese and many celebrate the Chinese New Year too.\n\nTLDR: Chinese New Year. Happy Horse year everybody!"
}
,
  
{
  "id": "46993464",
  "text": "Fast takeoff."
}
,
  
{
  "id": "46999360",
  "text": "They are spending literal trillions. It may even accelerate"
}
,
  
{
  "id": "46994115",
  "text": "There's more compute now than before."
}
,
  
{
  "id": "46993545",
  "text": "Anthropic took the day off to do a $30B raise at a $380B valuation."
}
,
  
{
  "id": "46993732",
  "text": "Most ridiculous valuation in the history of markets. Cant wait to watch these compsnies crash snd burn when people give up on the slot machine."
}
,
  
{
  "id": "46995113",
  "text": "As usual don't take financial advice from HN folks!"
}
,
  
{
  "id": "46997594",
  "text": "not as if you could get in on it even if you wanted to"
}
,
  
{
  "id": "46993830",
  "text": "WeWork almost IPO’s at $50bn. It was also a nice crash and burn."
}
,
  
{
  "id": "46994432",
  "text": "Why? They had $10+ billion arr run rate in 2025 trippeled from 2024\nI mean 30x is a lot but also not insane at that growth rate right?"
}
,
  
{
  "id": "46995587",
  "text": "It's a 13 days old account with IHateAI handle."
}
,
  
{
  "id": "46994059",
  "text": "They are using the current models to help develop even smarter models. Each generation of model can help even more for the next generation.\n\nI don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity."
}
,
  
{
  "id": "46994162",
  "text": "I must be holding these things wrong because I'm not seeing any of these God like superpowers everyone seem to enjoy."
}
,
  
{
  "id": "46994707",
  "text": "Who said they’re godlike today?\n\nAnd yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement."
}
,
  
{
  "id": "46994852",
  "text": "Let's come back in 12 months and discuss your singularity then. Meanwhile I spent like $30 on a few models as a test yesterday, none of them could tell me why my goroutine system was failing, even though it was painfully obvious (I purposefully added one too many wg.Done), gemini, codex, minimax 2.5, they all shat the bed on a very obvious problem but I am to believe they're 98% conscious and better at logic and math than 99% of the population.\n\nEvery new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks"
}
,
  
{
  "id": "46995436",
  "text": "On the flip side, twice I put about 800K tokens of code into Gemini and asked it to find why my code was misbehaving, and it found it.\n\nThe logic related to the bug wasn't all contained in one file, but across several files.\n\nThis was Gemini 2.5 Pro. A whole generation old."
}
,
  
{
  "id": "46997859",
  "text": "Mind sharing the file?\n\nAlso, did you use Codex 5.3 Xhigh through the Codex CLI or Codex App?"
}
,
  
{
  "id": "46994984",
  "text": "You are fighting straw men here. Any further discussion would be pointless."
}
,
  
{
  "id": "46995609",
  "text": "Of course, n-1 wasn't good enough but n+1 will be singularity, just two more weeks my dudes, two more week... rinse and repeat ad infinitum"
}
,
  
{
  "id": "46995771",
  "text": "Like I said, pointless strawmanning.\n\nYou’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.\n\nIf you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?"
}
,
  
{
  "id": "46997968",
  "text": "It's presumably a reference to this saying: https://www.urbandictionary.com/define.php?term=2%20more%20w..."
}
,
  
{
  "id": "47000879",
  "text": "It's basically bunch of people who see themselves as too smart to believe in God, instead they have just replaced it with AI and Singularity and attribute similar stuff to it eg. eternal life which is just heaven in religion. Amodei was hawking doubling of human lifespan to a bunch of boomers not too long ago. Ponce de León also went to search for the fountain of youth. It's a very common theme across human history. AI is just the new iteration where they mirror all their wishes and hopes."
}
,
  
{
  "id": "46995244",
  "text": "Post the file here"
}
,
  
{
  "id": "46994954",
  "text": "Meanwhile I've been using Kimi K2T and K2.5 to work in Go with a fair amount of concurrency and it's been able to write concurrent Go code and debug issues with goroutines equal to, and much more complex then, your issue, involving race conditions and more, just fine.\n\nProjects:\n\nhttps://github.com/alexispurslane/oxen\n\nhttps://github.com/alexispurslane/org-lsp\n\n(Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)\n\nshrug"
}
,
  
{
  "id": "46995390",
  "text": "Out of curiosity, did you give a test for them to validate the code?\n\nI had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed)."
}
,
  
{
  "id": "46995558",
  "text": "There was a test and a very useful golang error that literally explain what was wrong. The model tried implementing a solution, failed and when I pointed out the error most of them just rolled back the \"solution\""
}
,
  
{
  "id": "46997973",
  "text": "What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?"
}
,
  
{
  "id": "47000767",
  "text": "minimax 2.5, kimi k2.5, codex 5.2, gemini 3 flash and pro, glm 4.7, devstral2 123b, etc."
}
,
  
{
  "id": "46995987",
  "text": "Ok, thanks for the info"
}
,
  
{
  "id": "46997884",
  "text": "> I purposefully added one too many wg.Done\n\nWhat do you believe this shows? Sometimes I have difficulty finding bugs in other people's code when they do things in ways I would never use. I can rewrite their code so it works, but I can't necessarily quickly identify the specific bug.\n\nExpecting a model to be perfect on every problem isn't reasonable. No known entity is able to do that. AIs aren't supposed to be gods.\n\n(Well not yet anyway - there is as yet insufficient data for a meaningful answer.)"
}
,
  
{
  "id": "46997871",
  "text": "It's hard to evaluate \"logic\" and \"math\", since they're made up of many largely disparate things. But I think modern AI models are clearly better at coding, for example, than 99% of the population. If you asked 100 people at your local grocery store why your goroutine system was failing, do you think multiple of them would know the answer?"
}
,
  
{
  "id": "46997330",
  "text": "> using the current models to help develop even smarter models.\n\nThat statement is plausible. However, extrapolating that to assert all the very different things which must be true to enable any form of 'singularity' would be a profound category error. There are many ways in which your first two sentences can be entirely true, while your third sentence requires a bunch of fundamental and extraordinary things to be true for which there is currently zero evidence.\n\nThings like LLMs improving themselves in meaningful and novel ways and then iterating that self-improvement over multiple unattended generations in exponential runaway positive feedback loops resulting in tangible, real-world utility. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing."
}
,
  
{
  "id": "46995491",
  "text": "> I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.\n\nWe're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics"
}
,
  
{
  "id": "46995684",
  "text": "Ok, here I am living in the real world finding these models have advanced incredibly over the past year for coding.\n\nBenchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage."
}
,
  
{
  "id": "46998910",
  "text": "I use agentic tools daily and SOTA models have certainly improved a lot in the last year. But still in a linear, \"they don't light my repo on fire as often when they get a confusing compiler error\" kind of way, not a \"I would now trust Opus 4.6 to respond to every work email and hands-off manage my banking and investment portfolio\" kind of way.\n\nThey're still afflicted by the same fundamental problems that hold LLMs back from being a truly autonomous \"drop-in human replacement\" that would enable an entire new world of use cases.\n\nAnd finally live up to the hype/dreams many of us couldn't help but feeling was right around in the corner circa 2022/3 when things really started taking off."
}
,
  
{
  "id": "46996715",
  "text": "Yet even Anthropic has shown the downsides to using them. I don't think it is a given that improvements in models scores and capabilities + being able to churn code as fast as we can will lead us to a singularity, we'll need more than that."
}
,
  
{
  "id": "46998926",
  "text": "I agree completely. I think we're in alignment with Elon Musk who says that AI will bypass coding entirely and create the binary directly.\n\nIt's going to be an exciting year."
}
,
  
{
  "id": "46999559",
  "text": "There’s about as much sense doing this as there is in putting datacenters in orbit, i.e. it isn’t impossible, but literally any other option is better."
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[
  
{
  "id": "comment_id_1",
  "topics": [
    1,
    3,
    5
  ]
}
,
  
{
  "id": "comment_id_2",
  "topics": [
    2
  ]
}
,
  
{
  "id": "comment_id_3",
  "topics": [
    0
  ]
}
,
  ...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

50

← Back to job