llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/batch-8-7c751277-a30b-4510-9305-32ba6c4f3ddd-input.json
The following is content for you to classify. Do not respond to the comments—classify them.
<topics>
1. ARC-AGI Benchmark Validity
Related: Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
2. Gemini vs Claude for Coding
Related: Strong consensus that Claude dominates agentic coding workflows while Gemini lags behind, discussion of tool calling failures, instruction following issues, and hallucinations when using Gemini for development tasks
3. Benchmarkmaxxing Concerns
Related: Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
4. Definition of AGI
Related: Philosophical debate about what constitutes artificial general intelligence, whether consciousness is required, Chollet's definition involving tasks feasible for humans but unsolved by AI, and moving goalposts in AI evaluation
5. Google Product Quality Issues
Related: Complaints about Gemini app UX problems including context loss, Russian propaganda sources, switching languages mid-sentence, document upload failures, and poor integration compared to ChatGPT
6. Balatro Gaming Benchmark
Related: Discussion of Gemini 3's ability to play the card game Balatro from text descriptions alone, debate over whether this demonstrates generalization, and comparisons showing other models like DeepSeek failing at the task
7. Model Release Acceleration
Related: Observation that AI model releases are accelerating dramatically, multiple frontier models released within days, connection to Chinese New Year timing, and competition between US and Chinese labs
8. Cost vs Performance Tradeoffs
Related: Analysis of inference costs versus capabilities, Gemini Flash praised for cost-performance ratio, concerns about $13.62 per ARC-AGI task, and debate over what price makes models practical for real applications
9. Deep Research Reliability
Related: Mixed experiences with AI deep research capabilities, complaints about garbage citations, hallucinated sources, contradictory information, and questions about whether it saves time when sources must be verified
10. Google's Competitive Position
Related: Debate over whether Google is leading or behind in AI, discussion of their data advantages from YouTube and Books, claims they let competitors think they were behind, and analysis of their strengths in visual AI
11. Pelican on Bicycle Benchmark
Related: Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark
12. AI Consciousness Claims
Related: Pushback against suggestions that passing tests indicates consciousness, comparisons to simple programs claiming consciousness, discussion of self-awareness research, and skepticism about anthropomorphizing AI capabilities
13. Test Time Compute Approaches
Related: Analysis of thinking vs non-thinking models, best-of-N approaches like Deep Think, computational complexity differences, and questions about whether sufficiently large non-thinking models can match smaller thinking ones
14. Real World Task Performance
Related: Frustration that benchmark gains don't translate to practical improvements, examples of models failing simple debugging tasks, and arguments that actual work product matters more than test scores
15. AI Job Displacement Fears
Related: Concerns about software engineers being replaced, comparisons to factory worker displacement, debate over whether AI creates or destroys jobs, and skepticism about optimistic narratives from AI company executives
16. Spatial Reasoning Limitations
Related: Discussion of LLMs struggling with spatial tasks, image orientation affecting OCR accuracy, and whether ARC-AGI improvements indicate genuine spatial reasoning advances or benchmark-specific solutions
17. Model Architecture Secrecy
Related: Observation that frontier labs no longer share architecture details like parameter counts, shift from technical discussions to capability-focused marketing, and desire for more transparency
18. Academic vs Practical Intelligence
Related: Distinction between Gemini excelling at academic benchmarks while feeling less useful for practical tasks, discussion of book smart vs street smart analogies for AI capabilities
19. First Proof Mathematical Challenge
Related: Discussion of newly released unsolved math problems designed to test frontier models, predictions about whether current models can solve genuine research-level mathematics
20. Subscription Pricing Frustration
Related: Complaints about $250/month Google AI Ultra subscription required for Deep Think access, desire to test new models without platform lock-in, and calls for OpenRouter availability
0. Does not fit well in any category
</topics>
<comments_to_classify>
[
{
"id": "46992906",
"text": "The 1st proof original solutions are due to be published in about 24h, AIUI."
}
,
{
"id": "46998986",
"text": "Feels like an unforced blunder to make the time window so short after going to so much effort and coming up with something so useful."
}
,
{
"id": "47001010",
"text": "5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.\n\nThe 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas."
}
,
{
"id": "47001142",
"text": "5 days is short for memetic propagation on social media to reach everyone who has their own harness and agentic setup that wants to have a go."
}
,
{
"id": "46996339",
"text": "Really surprised that 1stproof.org was submitted three times and never made front page at HN.\n\nhttps://hn.algolia.com/?q=1stproof\n\nThis is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer . I appreciate the huge amount of social capital and coordination that must have taken.\n\nI'm really glad they did it."
}
,
{
"id": "47000036",
"text": "Of course it isn't made the front page. If something is promising they hunt it down, and when conquered they post about it. Lot of times the new category has much better results, than the default HN view."
}
,
{
"id": "47001466",
"text": "I'm impressed with the Arc-AGI-2 results - though readers beware... They achieved this score at a cost of $13.62 per task.\n\nFor context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task."
}
,
{
"id": "46992644",
"text": "The pelican riding a bicycle is excellent . I think it's the best I've seen.\n\nhttps://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"
}
,
{
"id": "46999072",
"text": "So, you've said multiple times in the past that you're not concerned about AI labs training for this specific test because if they did, it would be so obviously incongruous that you'd easily spot the manipulation and call them out.\n\nWhich tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to.\n\nTo me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector?\n\nI know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI \"influencers\" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?"
}
,
{
"id": "47001739",
"text": "The other SVGs I tried from my private collection of prompts were all similarly impressive."
}
,
{
"id": "46995836",
"text": "Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle..."
}
,
{
"id": "46999458",
"text": "This benchmark outcome is actually really impressive given the difficulty of this task. It shows that this particular model manages to \"think\" coherently and maintain useful information in its context for what has to be an insane overall amount of tokens, likely across parallel \"thinking\" chains. Likely also has access to SVG-rendering tools and can \"see\" and iterate on the result via multimodal input."
}
,
{
"id": "46998577",
"text": "Wow. I wonder how it would do with pure CSS a la https://diana-adrianne.com/"
}
,
{
"id": "46997551",
"text": "We've reached PGI"
}
,
{
"id": "46994537",
"text": "I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators."
}
,
{
"id": "46992681",
"text": "How likely this problem is already on the training set by now?"
}
,
{
"id": "46992849",
"text": "If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans."
}
,
{
"id": "46993675",
"text": "Why would they train on that? Why not just hire someone to make a few examples."
}
,
{
"id": "46993901",
"text": "I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks."
}
,
{
"id": "47001019",
"text": "Would it not be better to have 100 such tests \"Pelican on bicycle\", \"Tiger on stilts\"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.\n\nGiven the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue \"but I will know because of the quality of ocelots on skateboards\" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.\n\nThe pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.\n\nThere is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt."
}
,
{
"id": "46994247",
"text": "But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate."
}
,
{
"id": "46994326",
"text": "The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me."
}
,
{
"id": "46994368",
"text": "When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense."
}
,
{
"id": "46998644",
"text": "Vetting them for the potential for whistleblowing might be a bit more involved. But conspiracy theories have an advantage because the lack of evidence is evidence for the theory."
}
,
{
"id": "46999565",
"text": "Huh? AI labs are routinely spending millions to billions to various 3rd party contractors specializing in creating/labeling/verifying specialized content for pre/post-training.\n\nThis would just be one more checkbox buried in hundreds of pages of requests, and compared to plenty of other ethical grey areas like copyright laundering with actual legal implications, leaking that someone was asked to create a few dozen pelican images seems like it would be at the very bottom of the list of reputational risks."
}
,
{
"id": "46999808",
"text": "How do you think who's in on that? Not only pelicans, I mean, the whole thing. CEOs, top researchers, select mathematicians, congressmen? Does China participate in maintaining the bubble?\n\nI, myself, prefer the universal approximation theorem and empirical finding that stochastic gradient descent is good enough (and \"no 'magic' in the brain\", of course)."
}
,
{
"id": "47001852",
"text": "Well, since we're all talking about sourcing training material to \"benchmaxx\" for social proof, and not litigating the whole \"AI bubble\" debate, just the entire cottage industry of data curation firms:\n\nhttps://scale.com/data-engine\n\nhttps://www.appen.com/llm-training-data\n\nhttps://www.cogitotech.com/generative-ai/\n\nhttps://www.telusdigital.com/solutions/data-for-ai-training/...\n\nhttps://www.nexdata.ai/industries/generative-ai\n\n---\n\nP.S. Google Comms would have been consulted re putting a pelican in the I/O keynote :-)\n\nhttps://x.com/simonw/status/1924909405906338033"
}
,
{
"id": "47002276",
"text": "Cool. At least they are working across the board and benchmaxing random things like the theory of mind."
}
,
{
"id": "46994829",
"text": "The embarrassment of getting caught doing that would be expensive."
}
,
{
"id": "46992761",
"text": "For every combination of animal and vehicle? Very unlikely.\n\nThe beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want."
}
,
{
"id": "46992830",
"text": "No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here."
}
,
{
"id": "47002038",
"text": "You can easily make a RLAIF loop.\n\n- Take a list of n animals * m vehicule\n\n- Ask a LLM to generate SVG for this n*m options\n\n- Generate png from the svg\n\n- Ask a Model with vision to grade the result\n\n- Change your weight accordingly\n\nNo need to human to draw the dataset, no need of human to evaluate."
}
,
{
"id": "46993480",
"text": "More likely you would just train for emitting svg for some description of a scene and create training data from raster images."
}
,
{
"id": "46996985",
"text": "None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly."
}
,
{
"id": "46992798",
"text": "You can always ask for a tyrannosaurus driving a tank."
}
,
{
"id": "46992740",
"text": "I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too"
}
,
{
"id": "46994083",
"text": "Is there a list of these for each model, that you've catalogued somewhere?"
}
,
{
"id": "47001901",
"text": "At the moment that's mostly my tag page here but I really need to formalize it: https://simonwillison.net/tags/pelican-riding-a-bicycle/"
}
,
{
"id": "46992710",
"text": "The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)"
}
,
{
"id": "46993360",
"text": "It's not actually, look up some photos of the sun setting over the ocean. Here's an example:\n\nhttps://stockcake.com/i/sunset-over-ocean_1317824_81961"
}
,
{
"id": "46993619",
"text": "That’s only if the sun is above the horizon entirely."
}
,
{
"id": "46994287",
"text": "No, it's not.\n\nhttps://stockcake.com/i/serene-ocean-sunset_1152191_440307"
}
,
{
"id": "46996173",
"text": "Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds."
}
,
{
"id": "46992984",
"text": "Do you have to still keep trying to bang on about this relentlessly?\n\nIt was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.\n\nAgain, like I said before, it's also a terrible benchmark."
}
,
{
"id": "46995192",
"text": "I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?"
}
,
{
"id": "46993913",
"text": "It being a terrible benchmark is the bit."
}
,
{
"id": "46993047",
"text": "Eh, i find it more of a not very informative but lighthearted commentary"
}
,
{
"id": "46992669",
"text": "It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a \"human made art\" perspective. In other words, it's still got a ways to go!\n\nEdit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?"
}
,
{
"id": "46993092",
"text": "It depends, if you meant from a human coding an SVG \"manually\" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah."
}
,
{
"id": "46993508",
"text": "maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh"
}
]
</comments_to_classify>
Based on the comments above, assign each to up to 3 relevant topics.
Return ONLY a JSON array with this exact structure (no other text):
[
{
"id": "comment_id_1",
"topics": [
1,
3,
5
]
}
,
{
"id": "comment_id_2",
"topics": [
2
]
}
,
{
"id": "comment_id_3",
"topics": [
0
]
}
,
...
]
Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment
Remember: Output ONLY the JSON array, no other text.
50