Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-3-4b9da79c-74d9-4993-8700-6b98dd3942dd-input.json

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. ARC-AGI Benchmark Validity
   Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
   Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
   Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
   Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
   Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
   Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
   Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
   Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
   Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
   Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
   Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
   Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
   Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
   Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
   Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
   Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
   Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
   Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
   Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
   Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>

<comments_to_classify>
[
  
{
  "id": "46993722",
  "text": "Why 5-10 years?\n\nAt current rates, price per equivalent output is dropping at 99.9% over 5 years.\n\nThat's basically $0.01 in 5 years.\n\nDoes it really need to be that cheap to be worth it?\n\nKeep in mind, $0.01 in 5 years is worth less than $0.01 today."
}
,
  
{
  "id": "46994366",
  "text": "Wow that's incredible! Could you show your work?"
}
,
  
{
  "id": "46994962",
  "text": "https://epoch.ai/data-insights/llm-inference-price-trends"
}
,
  
{
  "id": "46995760",
  "text": "A grad student hour is probably more expensive…"
}
,
  
{
  "id": "46996350",
  "text": "In my experience, a grad student hour is treated as free :("
}
,
  
{
  "id": "46998041",
  "text": "You never applied for a grant, have you?"
}
,
  
{
  "id": "47000684",
  "text": "Grad students are incredibly cheap? In the UK for instance their stipend is £20,780 a year..."
}
,
  
{
  "id": "47000998",
  "text": "As it should be. They're a human!"
}
,
  
{
  "id": "46994118",
  "text": "What’s reasonable? It’s less than minimum hourly wage in some countries."
}
,
  
{
  "id": "46994385",
  "text": "Burned in seconds."
}
,
  
{
  "id": "46995804",
  "text": "Getting the work done faster for the same money doesn't make the work more expensive.\n\nYou could slow down the inference to make the task take longer, if $/sec matters."
}
,
  
{
  "id": "47001013",
  "text": "You're right, but I don't think we're getting an hour's worth of work out of single prompts yet. Usually it's an hour's worth of work out of 10 prompts for iteration. Now that's a day's wage for an hour of work. I'm certain the crossover will come soon, but it doesn't feel there yet."
}
,
  
{
  "id": "46993096",
  "text": "That's not a long time in the grand scheme of things."
}
,
  
{
  "id": "46993260",
  "text": "Speak for yourself. Five years is a long time to wait for my plans of world domination."
}
,
  
{
  "id": "46995730",
  "text": "This concerns me actually. With enough people (n>=2) wanting to achieve world domination, we have a problem."
}
,
  
{
  "id": "46996334",
  "text": "It’s not that I want to achieve world domination (imagine how much work that would be!), it’s just that it’s the inevitable path for AI and I’d rather it be me than then next shmuck with a Claude Max subscription."
}
,
  
{
  "id": "47001068",
  "text": "Don't build your castle in someone else's kingdom."
}
,
  
{
  "id": "46996968",
  "text": "I mean everyone with prompt access to the model says these things, but people like Sam and Elon say these things and mean it."
}
,
  
{
  "id": "46995811",
  "text": "n = 2 is Pinky and the Brain."
}
,
  
{
  "id": "46997821",
  "text": "I'm convinced that a substantial fraction of current tech CEOs were unwittingly programmed as children by that show."
}
,
  
{
  "id": "46993860",
  "text": "Yes, you better hurry."
}
,
  
{
  "id": "46999283",
  "text": "Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.\n\nGemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all."
}
,
  
{
  "id": "47001492",
  "text": "maybe it depends on the usage, but in my experience most of the times the Gemini produces much better results for coding, especially for optimization parts. The results that were produced by Claude wasn't even near that of Gemini. But again, depends on the task I think."
}
,
  
{
  "id": "47000847",
  "text": "It's garbage really, cannot get how they get so high in benchmarks."
}
,
  
{
  "id": "46999365",
  "text": "You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all."
}
,
  
{
  "id": "46999306",
  "text": "I find the quality is not consistent at all and of all the LLMs I use Gemini is the one most likely to just verge off and ignore my instructions."
}
,
  
{
  "id": "47002005",
  "text": "Same, as far as I am concerned, Gemini is optimized for benchmarks.\n\nI mean last week it insisted suddenly on two consecutive prompts that my code was in python. It was in rust."
}
,
  
{
  "id": "46992340",
  "text": "Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think."
}
,
  
{
  "id": "46994544",
  "text": "Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind\n\nhttps://arcprize.org/leaderboard"
}
,
  
{
  "id": "46999398",
  "text": "I read somewhere that Google will ultimately always produce the best LLMs, since \"good AI\" relies on massive amounts of data and Google owns the most data.\n\nIs that a based assumption?"
}
,
  
{
  "id": "46998536",
  "text": "At $13.62 per task it's practically unusable for agent tasks due to the cost.\n\nI found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents."
}
,
  
{
  "id": "46997518",
  "text": "I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately"
}
,
  
{
  "id": "46992958",
  "text": "Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.\n\nIt's completely misnamed. It should be called useless visual puzzle benchmark 2.\n\nIt's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!\n\nSo the idea that if an AI can solve \"Arc-AGI\" or \"Arc-AGI-2\" it's super smart or even \"AGI\" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve \"Arc-AGI\""
}
,
  
{
  "id": "46993055",
  "text": "The puzzles are calibrated for human solve rates, but otherwise I agree."
}
,
  
{
  "id": "46993151",
  "text": "My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.\n\nI would say they do have \"general intelligence\", so whatever Arc-AGI is \"solving\" it's definitely not \"AGI\""
}
,
  
{
  "id": "46993569",
  "text": "You are confusing fluid intelligence with crystallised intelligence."
}
,
  
{
  "id": "46993655",
  "text": "I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.\n\nThere are more novel tasks in a day than ARC provides."
}
,
  
{
  "id": "46993800",
  "text": "Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before."
}
,
  
{
  "id": "46996634",
  "text": "My late grandma learnt how to use an iPad by herself during her 70s to 80s without any issues, mostly motivated by her wish to read her magazines, doomscroll facebook and play solitaire. Her last job was being a bakery cashier in her 30s and she didn't learn how to use a computer in-between, so there was no skill transfer going on.\n\nHumans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/\"think\" leaders wants us to think."
}
,
  
{
  "id": "46994700",
  "text": "It really depends on motivation. My 90 year old grandmother can use a smartphone just fine since she needs it to see pictures of her (great) grandkids."
}
,
  
{
  "id": "46991558",
  "text": "It is over"
}
,
  
{
  "id": "46991653",
  "text": "I for one welcome our new AI overlords."
}
,
  
{
  "id": "46993441",
  "text": "Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had Kimi K2.5."
}
,
  
{
  "id": "46994169",
  "text": "I think it is because of the Chinese new year.\nThe Chinese labs like to publish their models arround the Chinese new year, and the US labs do not want to let a DeepSeek R1 (20 January 2025) impact event happen again, so i guess they publish models that are more capable then what they imagine Chinese labs are yet capable of producing."
}
,
  
{
  "id": "46995230",
  "text": "Singularity or just Chinese New Year?"
}
,
  
{
  "id": "46996485",
  "text": "The Singularity will occur on a Tuesday, during Chinese New Year"
}
,
  
{
  "id": "46997412",
  "text": "I guess. Deepseek v3 was released on boxing day a month prior\n\nhttps://api-docs.deepseek.com/news/news1226"
}
,
  
{
  "id": "46999744",
  "text": "And made almost zero impact, it was just a bigger version of Deepseek V2 and when mostly unnoticed because its performances weren't particularly notable especially for its size.\n\nIt was R1 with its RL-training that made the news and crashed the srock market."
}
,
  
{
  "id": "46998664",
  "text": "Aren't we saying \"lunar new year\" now?"
}
,
  
{
  "id": "46999013",
  "text": "I don't think so; there are different lunar calendars."
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[
  
{
  "id": "comment_id_1",
  "topics": [
    1,
    3,
    5
  ]
}
,
  
{
  "id": "comment_id_2",
  "topics": [
    2
  ]
}
,
  
{
  "id": "comment_id_3",
  "topics": [
    0
  ]
}
,
  ...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

50

← Back to job