llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-0-f2bc5078-7b4f-43bd-9e26-2342f499437b-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> ARC-AGI Benchmark Validity # Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities </topic> <comments_about_topic> 1. Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6) Wow. https://blog.google/innovation-and-ai/models-and-research/ge... 2. Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering: 1. It's an LLM, not something trained to play Balatro specifically 2. Most (probably >99.9%) players can't do that at the first attempt 3. I don't think there are many people who posted their Balatro playthroughs in text form online I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all. [0]: https://balatrobench.com/ 3. Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets. (i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking) 4. Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like? I ask because I cannot distinguish all the benchmarks by heart. 5. François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4. His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI. 6. > His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI. That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it. Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse. Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human. 7. Normal humans don't pass this benchmark either, as evidenced by the existence of religion, among other things. 8. > The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human. Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at. I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught? 9. > Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human. I think being better at this particular benchmark does not imply they're 'smarter'. 10. https://x.com/fchollet/status/2022036543582638517 11. I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient. But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either. 12. ARC-AGI-3 uses dynamic games that LLMs must determine the rules and is MUCH harder. LLMs can also be ranked on how many steps they required. 13. Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models) 14. Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow? 15. How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady? I tell this as a person who really enjoys AI by the way. 16. > does leak per definition. As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem. The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal. IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi . 17. > which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal. So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private: > ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private ( https://storage.googleapis.com/deepmind-media/gemini/gemini_... ) This also seems to contradict what ARC-AGI claims about what "Verified" means on their site. > How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI ( https://arcprize.org/blog/arc-prize-verified-program ) So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing. EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say ( https://arcprize.org/policy ): "To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process." But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b... . It is just too easy to cheat without being caught here. 18. Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894 . The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction. There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter. 19. They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter. But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient. 20. Here's a good thread over 1+ month, as each model comes out https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22... tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark 21. If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general. 22. I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence". I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked. Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games. 23. Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something. 24. The average ARC AGI 2 score for a single human is around 60%. "100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%." https://arcprize.org/arc-agi/2/ 25. Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher. 26. https://arcprize.org/leaderboard $13.62 per task - so we need another 5-10 years for the price to run this to become reasonable? But the real question is if they just fit the model to the benchmark. 27. Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind https://arcprize.org/leaderboard 28. At $13.62 per task it's practically unusable for agent tasks due to the cost. I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents. 29. I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately 30. Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though. It's completely misnamed. It should be called useless visual puzzle benchmark 2. It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves! So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI" 31. The puzzles are calibrated for human solve rates, but otherwise I agree. 32. My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc. I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI" 33. You are confusing fluid intelligence with crystallised intelligence. 34. I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours. There are more novel tasks in a day than ARC provides. 35. But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI. 36. Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_... The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved" >Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview 37. Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway. 38. > If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved" They never will do on private set, because it would mean its being leaked to google. 39. I'm impressed with the Arc-AGI-2 results - though readers beware... They achieved this score at a cost of $13.62 per task. For context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task. 40. Less than a year to destroy Arc-AGI-2 - wow. 41. I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month 42. Not very likely? ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs. 43. wow solving useless puzzles, such a useful metric! 44. How is spatial reasoning useless?? 45. It's still useful as a benchmark of cost/efficiency. 46. It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish. Arc-AGI score isn't correlated with anything useful. 47. It's correlated with the ability to solve logic puzzles. It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem. 48. how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI? 49. ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful 50. IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks". 51. Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible! </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
ARC-AGI Benchmark Validity # Debate over whether ARC-AGI measures general intelligence or just spatial reasoning puzzles, concerns about benchmarkmaxxing, semi-private vs private test sets, cost per task at $13.62, and whether solving it indicates anything meaningful about AGI capabilities
51