Summarizer

LLM Input

llm/2ad2a7bb-5462-4391-a2da-bf11064993c9/topic-2-48bb7e6f-1652-47ac-b8c3-3e93419babcf-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Benchmarkmaxxing Concerns # Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed
</topic>

<comments_about_topic>
1. Yeah, or just the steam text guides would be a huge advantage.

I really doubt it's playing completely blind

2. > Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

I think being better at this particular benchmark does not imply they're 'smarter'.

3. Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

4. Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

5. How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

6. > does leak per definition.

As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.

The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi .

7. > which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:

> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private ( https://storage.googleapis.com/deepmind-media/gemini/gemini_... )

This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.

> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI ( https://arcprize.org/blog/arc-prize-verified-program )

So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.

EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say ( https://arcprize.org/policy ):

"To uphold this trust, we follow strict confidentiality agreements.
[...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."

But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b... . It is just too easy to cheat without being caught here.

8. Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894 .

The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.

There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.

9. They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter.

But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.

10. Particularly for the large organizations at the frontier, the risk-reward does not seem worth it.

Cheating on the benchmark in such a blatantly intentional way would create a large reputational risk for both the org and the researcher personally.

When you're already at the top, why would you do that just for optimizing one benchmark score?

11. Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.

12. > Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.

I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving . I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.

And obviously what actually matters is performance on real-world tasks.

13. "Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.

14. It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.

Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.

15. https://arcprize.org/leaderboard

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

16. It's garbage really, cannot get how they get so high in benchmarks.

17. You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all.

18. Same, as far as I am concerned, Gemini is optimized for benchmarks.

I mean last week it insisted suddenly on two consecutive prompts that my code was in python. It was in rust.

19. > I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.

We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics

20. Ok, here I am living in the real world finding these models have advanced incredibly over the past year for coding.

Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.

21. Yet even Anthropic has shown the downsides to using them. I don't think it is a given that improvements in models scores and capabilities + being able to churn code as fast as we can will lead us to a singularity, we'll need more than that.

22. Don't let the benchmarks fool you. Gemini models are completely useless not matter how smart they are. Google still hasn't figure out tool calling and making the model follow instructions. They seem to only care about benchmarking and being the most intelligent model on paper. This has been a problem of Gemini since 1.0 and they still haven't fixed it.

Also the worst model in terms of hallucinations.

23. They seem to be optimizing for benchmarks instead of real world use

24. Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

25. > If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.

26. It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.

27. > Trouble is some benchmarks only measure horse power.

IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.

28. 5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.

The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.

29. So, you've said multiple times in the past that you're not concerned about AI labs training for this specific test because if they did, it would be so obviously incongruous that you'd easily spot the manipulation and call them out.

Which tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to.

To me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector?

I know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI "influencers" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?

30. Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...

31. How likely this problem is already on the training set by now?

32. If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.

33. Why would they train on that? Why not just hire someone to make a few examples.

34. I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.

35. Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.

Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.

The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.

There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.

36. But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.

37. The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.

38. When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.

39. Vetting them for the potential for whistleblowing might be a bit more involved. But conspiracy theories have an advantage because the lack of evidence is evidence for the theory.

40. Huh? AI labs are routinely spending millions to billions to various 3rd party contractors specializing in creating/labeling/verifying specialized content for pre/post-training.

This would just be one more checkbox buried in hundreds of pages of requests, and compared to plenty of other ethical grey areas like copyright laundering with actual legal implications, leaking that someone was asked to create a few dozen pelican images seems like it would be at the very bottom of the list of reputational risks.

41. How do you think who's in on that? Not only pelicans, I mean, the whole thing. CEOs, top researchers, select mathematicians, congressmen? Does China participate in maintaining the bubble?

I, myself, prefer the universal approximation theorem and empirical finding that stochastic gradient descent is good enough (and "no 'magic' in the brain", of course).

42. Well, since we're all talking about sourcing training material to "benchmaxx" for social proof, and not litigating the whole "AI bubble" debate, just the entire cottage industry of data curation firms:

https://scale.com/data-engine

https://www.appen.com/llm-training-data

https://www.cogitotech.com/generative-ai/

https://www.telusdigital.com/solutions/data-for-ai-training/...

https://www.nexdata.ai/industries/generative-ai

---

P.S. Google Comms would have been consulted re putting a pelican in the I/O keynote :-)

https://x.com/simonw/status/1924909405906338033

43. Cool. At least they are working across the board and benchmaxing random things like the theory of mind.

44. The embarrassment of getting caught doing that would be expensive.

45. You can easily make a RLAIF loop.

- Take a list of n animals * m vehicule

- Ask a LLM to generate SVG for this n*m options

- Generate png from the svg

- Ask a Model with vision to grade the result

- Change your weight accordingly

No need to human to draw the dataset, no need of human to evaluate.

46. None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.

47. I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too

48. It’s incredible how fast these models are getting better. I thought for sure a wall would be hit, but these numbers smashes previous benchmarks. Anyone have any idea what the big unlock that people are finding now?

49. Companies are optimizing for all the big benchmarks. This is why there is so little correlation between benchmark performance and real world performance now.

50. We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).

But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.

51. wow solving useless puzzles, such a useful metric!

52. But why only a +0.5% increase for MMMU-Pro?

53. Its possibly label noise. But you can't tell from a single number.

You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.

It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.

54. Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.

55. But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.

56. I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)

57. IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".

58. The benchmark should be: can you ask it to create a profitable business or product and send you the profit?

Everything else is bike shedding.

59. Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible!
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Benchmarkmaxxing Concerns # Skepticism that high benchmark scores reflect real-world performance, suspicions that labs optimize specifically for popular tests, concerns about training data leakage, and debate over whether improvements are genuine or gamed

commentCount

59

← Back to job