Benchmark Skepticism

Questions about benchmark reliability, accusations of gaming benchmarks, noting regressions in long-context retrieval, and debates about whether benchmarks reflect real-world performance

While official benchmarks suggest steady progress in coding capabilities, many users report a growing "vibe" that models are being optimized specifically for test scores at the expense of general reliability and long-context retrieval. This skepticism is fueled by dramatic performance regressions in specific metrics like the MRCR benchmark, leading critics to argue that labs are "benchmaxxing" or trading away core logic to inflate their marketing claims. The discourse highlights a widening gap between sterile automated scores and the messy reality of daily workflows, where perceived declines in intelligence are often attributed to "silent nerfing" to save compute costs. Ultimately, the community remains divided over whether these regressions are a calculated engineering trade-off or a psychological byproduct of shifting user expectations and "anecdata."

View on HN · Topics

Well, at least we know that's one gotcha/benchmark they aren't gaming.

View on HN · Topics

A secret backup test to the pelican? This is as noteworthy as 4.7 dropping.

View on HN · Topics

A major reason for that is because there's no way to objectively evaluate the performance of LLMs. So the meme projects are equally as valid as the serious ones, since the merits of both are based entirely on anecdata.

It also doesn't help that projects and practices are promoted and adopted based on influencer clout. Karpathy's takes will drown out ones from "lesser" personas, whether they have any value or not.

View on HN · Topics

It really wasn't. Most of the argument was around product portfolio and agentic coding performance.

View on HN · Topics

Proof they are nerfing the model? It is stable in benchmarks: https://marginlab.ai/trackers/claude-code-historical-perform...

All this just reads like just another case of mass psychosis to me

View on HN · Topics

Proof they don't nerf it only after testing that the benchmarks there stay the same? So overall performance degrades but they isolate those benchmarks?

View on HN · Topics

Currently GPT just works much better, and so does Gemini but it's more expensive right now. Going through Opencode stats, their claim is that Gemini is the current best model followed by GPT 5.4 on their benchmarks, but the difference is slim.

My personal experience is best with GPT but it could be the specific kind of work I use it for which is heavy on maths and cpp (and some LISP).

View on HN · Topics

Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.

View on HN · Topics

Part of me wonders if there's some subtle behavioral change with it too. Early on we're distrusting of a model and so we're blown away, we were giving it more details to compensate for assumed inability, but the model outperformed our expectations. Weeks later we're more aligned with its capabilities and so we become lazy. The model is very good, why do we have to put in as much work to provide specifics, specs, ACs, etc. So then of course the quality slides because we assumed it's capabilities somehow absolved the need for the same detailed guardrails (spec, ACs, etc) for the LLM.

This scenario obviously does not apply to folks who run their own benches with the same inputs between models. I'm just discussing a possible and unintentional human behavioral bias.

Even if this isn't the root cause, humans are really bad at perceiving reality. Like, really really bad. LLMs are also really difficult to objectively measure. I'm sure the coupling of these two facts play a part, possibly significant, in our perception of LLM quality over time.

View on HN · Topics

Or it could be a selection bias. The ground truth is not what HN herd mentality complains about, but the usage stats.

View on HN · Topics

The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.

Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.

View on HN · Topics

The model card confirms the chain-of-thought supervision error from Mythos was present during Opus 4.7 training too, affecting 7.8% of episodes. That's not a one-time bug that got patched. It's a training pipeline issue that persisted across model generations. The long-context regression from 91.9% to 59.2% is also worth noting — they traded retrieval accuracy for coding benchmarks, which is a reasonable engineering choice, but the framing buries it.

View on HN · Topics

From a quick tests, it seems to hallucinate a lot more than opus 4.6. I like to ask random knowledge questions like "What are the best chinese rpgs with a decent translations for someone who is not familiar with them? The classics one should not miss?" and 4.6 gave accurate answers, 4.7 hallucinated the name of games, gave wrong information on how to run them etc...

Seems common for any type of slightly obscure knowledge.

View on HN · Topics

These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)

> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.

The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.

> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.

View on HN · Topics

Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.

View on HN · Topics

But it majorly regressed in long context retrieval? Which is arguably getting more and more important?

View on HN · Topics

Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6

View on HN · Topics

Are you one of those naive people that still take these coding benchmarks seriously?

View on HN · Topics

Some of the benchmarks went down, has that happened before?

View on HN · Topics

If you mean for Anthropic in particular, I don't think so. But it's not the first time a major AI lab publishes an incremental update of a model that is worse at some benchmarks. I remember that a particular update of Gemini 2.5 Pro improved results in LiveCodeBench but scored lower overall in most benchmarks.

https://news.ycombinator.com/item?id=43906555

View on HN · Topics

Ask it to create an iOS app which natively runs Gemma via Litert-lm.

It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

View on HN · Topics

Constantly. Minor revisions can easily "wobble" on benchmarks that the training didn't explicitly push them for.

Whether it's genuine loss of capability or just measurement noise is typically unclear.

View on HN · Topics

looking at the system card for opus 4.7 the MCRC benchmark used for long context tasks dropped significantly from 78% to 32%

I wonder what caused such a large regression in this benchmark

View on HN · Topics

> Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online

This is reductive. You're both calling people unreasonably angry but then acknowledging there's a limit in compute that is a practical reality for Anthropic. This isn't that hard. They have two choices, rate limit, or silently degrade to save compute.

I have never hit a rate limit, but I have seen it get noticeably stupider. It doesn't make me angry, but comments like these are a bit annoying to read, because you are trying to make people sound delusional while, at the same time, confirming everything they're saying.

I don't think they have turned a big knob that makes it stupider for everyone. I think they can see when a user is overtapping their $20 plan and silently degrade them. Because there's no alert for that. Which is why AI benchmark sites are irrelevant.

View on HN · Topics

I recognize the sarcasm. The data I can find says it's performing at baseline however?

https://marginlab.ai/trackers/claude-code/

View on HN · Topics

Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

View on HN · Topics

funny how they use mythos preview in these benchmarks like a carrot on a stick

View on HN · Topics

If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.

Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Swapping out the tokenizer is a massive change. Not an incremental one.

View on HN · Topics

> Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Benchmarks say it all. Gains over previous model are too small to announce it as a major release. That would be humiliating for Anthropic. It may scare investors that the curve flattened and there are only diminishing returns.

View on HN · Topics

The most important question is: does it perform better than 4.6 in real world tasks? What's your experience?

View on HN · Topics

How should one compare benchmark results?
For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?

View on HN · Topics

There is no hallucination benchmark currently.

I was researching how to predict hallucinations using the literature (fastowski et al, 2025) (cecere et al, 2025) and the general-ish situation is that there are ways to introspect model certainty levels by probing it from the outside to get the same certainty metric that you _would_ have gotten if the model was trained as a bayesian model, ie, it knows what it knows and it knows what it doesn't know.

This significantly improves claim-level false-positive rates (which is measured with the AUARC metric, ie, abstention rates; ie have the model shut up when it is actually uncertain).

This would be great to include as a metric in benchmarks because right now the benchmark just says "it solves x% of benchmarks", whereas the real question real-world developers care about is "it solves x% of benchmarks *reliably*" AND "It creates false positives on y% of the time".

So the answer to your question, we don't know. It might be a cherry picked result, it might be fewer hallucinations (better metacognition) it might be capability to solve more difficult problems (better intelligence).

The benchmarks don't make this explicit.

View on HN · Topics

Benchmarks are meaningless. Try it on your own problems and see if it has improved for what you want to use it for.

View on HN · Topics

Benchmark results don’t directly translate to actual real world improvement. So we might guess it’s somewhat better but hard to say exactly in what way

View on HN · Topics

11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.

A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.

View on HN · Topics

Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?

View on HN · Topics

Do we have any performance benchmark with token length? Now that the context size is 1 M. I would want to know if I can exhaust all of that or should I clear earlier?

View on HN · Topics

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

View on HN · Topics

Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6

There's other small single digit differences, but I doubt that the benchmark is that unreliable...?

View on HN · Topics

page is updated to state:

MCP-Atlas: The Opus 4.6 score has been updated to reflect revised grading methodology from Scale AI.

View on HN · Topics

Tried it, after about 10 messages, Opus 4.7 ceased to be able to recall conversation beyond the initial 10 messages. Super weird.

View on HN · Topics

Even using Mythos with their own benchmarks as a comparison that isn't available for most people to use, what a joke.

View on HN · Topics

Might be sticking with 4.6 it's only been 20 minutes of using 4.7 and there are annoyances I didn't face with 4.6 what the heck. Huge downgrade on MRCR too....

256K:

- Opus 4.6: 91.9%
- Opus 4.7: 59.2%

1M:

- Opus 4.6: 78.3%
- Opus 4.7: 32.2%

View on HN · Topics

just started using codex. claude is just marketing machine and benchmaxxing and only if you pay gazillion and show your ID you can use their dangerous model.

View on HN · Topics

Gemini and Codex already scored higher on benchmarks than Opus 4.6 and they recently added a $100 tier with limited 2x limits, that's their answer and it seems people have caught on.

View on HN · Topics

You are operating purely on vibes, https://marginlab.ai/trackers/claude-code-historical-perform...

View on HN · Topics

not rejecting reality, but increasing doubts about the effectiveness of these tests. and yes its subjective n=1, but I literally create and ship projects for many months now always from the same github template repository forked and essentially do the same steps with a few differnt brand touches and nearly muscle memory prompting to do the just right next steps mechanically over and over again, and the amount of things getting done per step gots worse and the quality degraded too, forgetting basic things along the way a few prompts in. as I said n=1 but the very repetitive nature of my current work days alwyas doing a new thing from the exact same start point that hasn't changed in half a year is kind of my personal benchmark. YMMV but on my end the effects are real, specifically when tracking hours over this stuff.

View on HN · Topics

It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?

View on HN · Topics

SWE-bench pro is ~20% higher than the previous .1 generation which was released 2 months ago. For their SWE benchmark, the token consumption iso-performance is down 2x from the model they released 2 months ago.

If this is a plateau I struggle to imagine what you consider fast progress.

Summarizer