Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-9-378e7b54-e316-4c9f-9f0f-99686d568815-input.json

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. ARC-AGI Benchmark Validity
   Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
   Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
   Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
   Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
   Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
   Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
   Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
   Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
   Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
   Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
   Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
   Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
   Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
   Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
   Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
   Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
   Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
   Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
   Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
   Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>

<comments_to_classify>
[
  
{
  "id": "46993212",
  "text": "Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike."
}
,
  
{
  "id": "46993144",
  "text": "Highly disagree.\n\nI was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.\n\nIf it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple."
}
,
  
{
  "id": "46993352",
  "text": "I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.\n\nIn contrast, the only \"realistic\" SVGs I've seen are created using tools like potrace, and look terrible .\n\nI also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task."
}
,
  
{
  "id": "46993243",
  "text": "The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly."
}
,
  
{
  "id": "46993093",
  "text": "I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode."
}
,
  
{
  "id": "46993115",
  "text": "And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)"
}
,
  
{
  "id": "46993462",
  "text": "You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)\n\nIt's agents all the way down."
}
,
  
{
  "id": "46993646",
  "text": "The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent."
}
,
  
{
  "id": "46998793",
  "text": "Claude Cowork does this by default and you can see how exactly it is coordinating them etc."
}
,
  
{
  "id": "46993536",
  "text": "Start with 1024 and use half the number of agents each turn to distill the final result."
}
,
  
{
  "id": "46998952",
  "text": "They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.\n\nThis is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.\n\n10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.\n\nThat's something you can't replicate without access to the network output pre token sampling."
}
,
  
{
  "id": "46996902",
  "text": "It’s incredible how fast these models are getting better. I thought for sure a wall would be hit, but these numbers smashes previous benchmarks. Anyone have any idea what the big unlock that people are finding now?"
}
,
  
{
  "id": "46997205",
  "text": "Companies are optimizing for all the big benchmarks. This is why there is so little correlation between benchmark performance and real world performance now."
}
,
  
{
  "id": "46997622",
  "text": "Isn’t there? I mean, Claude code has been my biggest usecase and it basically one shots everything now"
}
,
  
{
  "id": "46997731",
  "text": "Yes, LLMs have become extremely good at coding (not software engineer though). But try using them for anything original that cannot be adapted from GitHub and Stack Overflow. I haven't seen much improvement at all at such tasks."
}
,
  
{
  "id": "46998728",
  "text": "Strongly disagree with this. And I'm going to provide as much evidence as you did."
}
,
  
{
  "id": "47001208",
  "text": "I don't get it, why is Claude still number 1 while the numbers say different, let's see that new Gemini in the terminal also"
}
,
  
{
  "id": "46992505",
  "text": "Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities."
}
,
  
{
  "id": "46992547",
  "text": "I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5"
}
,
  
{
  "id": "46994510",
  "text": "At the very least gemini 3's flyer claims 1T parameters."
}
,
  
{
  "id": "47000040",
  "text": "Too bad we can’t use it. Whenever Google releases something, I can never seem to use it in their coding cli product."
}
,
  
{
  "id": "47001069",
  "text": "You can but only via Gemini Ultra plan which you can buy or Gemini API with early access."
}
,
  
{
  "id": "47001327",
  "text": "I know, and neither of these options are feasible for me. I can't get the early access and I am not willing to drop $250 in order to just try their new model. By the time I can use it, the other two companies have something similar and I lose my interest in Google's models."
}
,
  
{
  "id": "47000390",
  "text": "So last week I tried Gemini pro 3, Opus 4.6, GLM 5, Kimi2.5 so far using Kimi2.5 yeilded the best results (in terms of cost/performance) for me in a mid size Go project. Curious to know what others think ?"
}
,
  
{
  "id": "47001277",
  "text": "I predict Gemini Flash will dominate when you try it.\n\nIf you're going for cost performance balance choosing Gemini Pro is bewildering. Gemini Flash _outperforms_ Pro in some coding benchmarks and is the clear parento frontier leader for intelligence/cost. It's even cheaper than Kimi 2.5.\n\nhttps://artificialanalysis.ai/?media-leaderboards=text-to-im..."
}
,
  
{
  "id": "46992151",
  "text": "Less than a year to destroy Arc-AGI-2 - wow."
}
,
  
{
  "id": "46992321",
  "text": "I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month"
}
,
  
{
  "id": "46994813",
  "text": "Not very likely?\n\nARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs."
}
,
  
{
  "id": "46999275",
  "text": "We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).\n\nBut i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun."
}
,
  
{
  "id": "46992791",
  "text": "The AGI bar has to be set even higher, yet again."
}
,
  
{
  "id": "46998795",
  "text": "And that's the way it should be. We're past the \"Look! It can talk! How cute!\" stage. AGI should be able to deal with any problem a human can."
}
,
  
{
  "id": "46993517",
  "text": "wow solving useless puzzles, such a useful metric!"
}
,
  
{
  "id": "46995585",
  "text": "How is spatial reasoning useless??"
}
,
  
{
  "id": "46993798",
  "text": "It's still useful as a benchmark of cost/efficiency."
}
,
  
{
  "id": "46993244",
  "text": "But why only a +0.5% increase for MMMU-Pro?"
}
,
  
{
  "id": "46994375",
  "text": "Its possibly label noise. But you can't tell from a single number.\n\nYou would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.\n\nIt happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore."
}
,
  
{
  "id": "46993520",
  "text": "Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago."
}
,
  
{
  "id": "46997382",
  "text": "But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give."
}
,
  
{
  "id": "46999243",
  "text": "I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)"
}
,
  
{
  "id": "46997566",
  "text": "Are humans 100%?"
}
,
  
{
  "id": "46997624",
  "text": "If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.\n\nBut the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse."
}
,
  
{
  "id": "46998759",
  "text": "Actually faster and worse is a very common characterization of a LOT of automation."
}
,
  
{
  "id": "47000548",
  "text": "That's true.\n\nThe problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).\n\nAI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.\n\nSo the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.\n\nI still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.\n\nI don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%)."
}
,
  
{
  "id": "46993203",
  "text": "It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have \"AGI\", which is clearly rubbish.\n\nArc-AGI score isn't correlated with anything useful."
}
,
  
{
  "id": "46995231",
  "text": "It's correlated with the ability to solve logic puzzles.\n\nIt's also interesting because it's very very hard for base LLMs, even if you try to \"cheat\" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem."
}
,
  
{
  "id": "46993259",
  "text": "how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?"
}
,
  
{
  "id": "46994360",
  "text": "Give it a prompt like\n\n>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home\n\nAnd get back an automatic coupon code app like the user actually wanted."
}
,
  
{
  "id": "46996713",
  "text": "ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful"
}
,
  
{
  "id": "46997349",
  "text": "IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular \"benchmarks\"."
}
,
  
{
  "id": "47001024",
  "text": "We're getting to the point where we can ask AI to invent new programming languages."
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[
  
{
  "id": "comment_id_1",
  "topics": [
    1,
    3,
    5
  ]
}
,
  
{
  "id": "comment_id_2",
  "topics": [
    2
  ]
}
,
  
{
  "id": "comment_id_3",
  "topics": [
    0
  ]
}
,
  ...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

50

← Back to job