Model Quality Degradation

Widespread user complaints about Opus 4.6 becoming noticeably dumber in weeks before 4.7 release, with theories about intentional nerfing, compute constraints, quantization routing based on time of day, and suspicions that new releases restore previously available capabilities

Users are reporting a sharp decline in Opus 4.6’s intelligence just before the 4.7 release, fueling "tinfoil hat" theories that Anthropic "nerfed" the model through aggressive quantization or compute-throttling to make the new version feel like a more dramatic upgrade. While some observers dismiss these claims as psychological bias or the natural result of managing increasingly messy codebases, many power users insist the model has become "lobotomized," prone to hallucinations, and suddenly unable to follow basic spatial logic. The launch of Opus 4.7 has only intensified the debate, as early feedback highlights a frustrating trade-off between improved agentic capabilities and massive, measurable regressions in long-context retrieval and research depth.

View on HN · Topics

As far as I can tell, no one has asked this question publicly before, but I wonder if they're tinkering with quantizations in the background.

When I was exploring local inference, one of the early mistakes I made was using lower quantizations for models, and then assuming that the model sucked based on the results. It took me a while to learn that as the number of bits decreases, the fidelity of "simulation" decreases as well. But so does the total size in memory.

For example for Kimi K2 - a trillion parameter open source model, the 4-bit version is around 580GB in size. The 8-bit version is, 1.1 TB (yes TB - 1,090 GB). And the full fidelity 16-bit version is, 2.05TB.

So essentially the 4-bit version is roughly (give or take how you're packaging it) 1/4th the size of the 16-bit one. And significantly dumber.

If you have a limited amount of resources / don't want to utilize 6 GB200s to run the model, and want to make do with three much cheaper (comparatively) B200 units, then a lower quantization allows you to do that.

I wonder if Opus' inconsistent performance is because - depending on the time of the day - Anthropic routes you to a differently quantized version?

View on HN · Topics

Its especially concerning / frustrating because boris’s reply to my bug report on opus being dumber was “we think adaptive thinking isnt working” and then thats the last I heard of it: https://news.ycombinator.com/item?id=47668520

Now disabling adaptive thinking plus increasing effort seem to be what has gotten me back to baseline performance but “our internal evals look good“ is not good enough right now for what many others have corroborated seeing

View on HN · Topics

Sure, but that blackbox was giving me a lot of value last month.

View on HN · Topics

Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

---

I asked it to figure out why it made the mistake:

"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.

Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.

Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't.
Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.

Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one.
Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.

Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.

Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do.
The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."

View on HN · Topics

| I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

● Drive. The car needs to be at the car wash.

Wonder if this is just randomness because its an LLM, or if you have different settings than me?

View on HN · Topics

My settings are pretty standard:

% claude
Claude Code v2.1.111
Opus 4.7 (1M context) with xhigh effort · Claude Max
~/...
Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.

View on HN · Topics

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.6 whiffs it.

View on HN · Topics

... here's the pelican, I think Qwen3.6-35B-A3B running locally did a better job! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

View on HN · Topics

A secret backup test to the pelican? This is as noteworthy as 4.7 dropping.

View on HN · Topics

It's likely hiding the model downgrade path they require to meet sustainable revenue. Should be interesting if they can enshittify slowly enough to avoid the ablative loss of customers! Good luck all VCs!

View on HN · Topics

Caveman hurt model performance. If you need a dumber model with less token output, just use sonnet-4-6 or other non-reasoning model.

View on HN · Topics

Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...

View on HN · Topics

I havent been using my claude sub lately but I liked 4.6 three weeks ago. Did something change?

View on HN · Topics

That’s more a leadership decision because Anthropic are nerfing the model to cut costs, if they stop doing that then they’ll stay ahead.

View on HN · Topics

Proof they are nerfing the model? It is stable in benchmarks: https://marginlab.ai/trackers/claude-code-historical-perform...

All this just reads like just another case of mass psychosis to me

View on HN · Topics

Proof they don't nerf it only after testing that the benchmarks there stay the same? So overall performance degrades but they isolate those benchmarks?

View on HN · Topics

> It seems like 90% of Claude's recent problems are strictly lack of compute related.

Downtime is annoying, but the problem is that over the past 2-3 weeks Claude has been outrageously stupid when it does work. I have always been skeptical of everything produced - but now I have no faith whatsoever in anything that it produces. I'm not even sure if I will experiment with 4.7, unless there are glowing reviews.

Codex has had none of these problems. I still don't trust anything it produces, but it's not like everything it produces is completely and utterly useless.

View on HN · Topics

So many people confuse sycophantic behavior with producing results.

View on HN · Topics

What is your use case? I read comments like this and it's totally opposite of my experience, I have both CC Opus 4.6 and Codex 5.4 and Codex is much more thorough and checks before it starts making changes maybe even to a fault but I accept it because getting Opus to redo work because it messes up and jumps in the first attempt is a massive waste of time, all tasks and spec are atomic and granularly spec'd, I'd say 30% of the time I regret when I decide to use Opus for 'simpler' and work

View on HN · Topics

>> I switched to Codex and found it extremely inferior for my use case.

Yeah, 100% the case for me. I sometimes use it to do adversarial reviews on code that Opus wrote but the stuff it comes back with is total garbage more often than not. It just fabricates reasons as to why the code it's reviewing needs improvement.

View on HN · Topics

My tinfoil hat theory, which may not be that crazy, is that providers are sandbagging their models in the days leading up to a new release, so that the next model "feels" like a bigger improvement than it is.

An important aspect of AI is that it needs to be seen as moving forward all the time. Plateaus are the death of the hype cycle, and would tether people's expectations closer to reality.

View on HN · Topics

My purely unfounded, gut reaction to Opus 4.7 being released today was "Oh, that explains the recent 4.6 performance - they were spinning up inference on 4.7."

Of course, I have no information on how they manage the deployment of their models across their infra.

View on HN · Topics

What about your chat history? That has some value, at least for me. But what has even more value is stable releases.

View on HN · Topics

I don't have much quality drop from 4.6. But I also notice that I use codex more often these days than claude code

View on HN · Topics

It's been shockingly bad for me - for another example when asked to make a new python script building off an existing one; for some cursed reason the model choose to .read() the py files, use 100 of lines of regex to try to patch the changes in, and exec'd everything at the end...

View on HN · Topics

That's wild that you think 4.6 is bad..... Each model has its strengths and weaknesses I find that Codex is good for architectural design and Claude Is actually better the engineering and building

View on HN · Topics

For me, making it high effort just fixed all the quality problems, and even cut down on token use somehow

View on HN · Topics

Strange. Opus 4.6 has been great for me. On Max 20x

View on HN · Topics

Before opus released we also saw huge backlash with it being dumber.

Perhaps they need the compute for the training

View on HN · Topics

I've noticed the same over the last two weeks. Some days Claude will just entirely lose its marbles. I pay for Claude and Codex so I just end up needing to use codex those days and the difference is night and day.

View on HN · Topics

Same! I thought people were exaggerating how bad Claude has gotten until it deleted several files by accident yesterday

Codex isn’t as pretty in output but gets the job done much more consistently

View on HN · Topics

I've been raging pretty hard too. Thought either I'm getting cleverer by the day or Claude has been slipping and sliding toward the wrong side of the "smart idiot" equation pretty fast.

Have caught it flat-out skipping 50% of tasks and lying about it.

View on HN · Topics

How do you get codex to generate any code?

I describe the problem and codex runs in circles basically:

codex> I see the problem clearly. Let me create a plan so that I can implement it. The plan is X, Y, Z. Do you want me to implement this?

me> Yes please, looks good. Go ahead!

codex> Okay. Thank you for confirming. So I am going to implement X, Y, Z now. Shall I proceeed?

me> Yes, proceed.

codex> Okay. Implementing.

...codex is working... you see the internal monologue running in circles

codex> Here is what I am going to implement: X, Y, Z

me> Yes, you said that already. Go ahead!

codex> Working on it.

...codex in doing something...

codex> After examining the problem more, indeed, the steps should be X, Y, Z. Do you want me to implement them?

etc.

Very much every sessions ends up being like this. I was unable to get any useful code apart from boilerplate JS from it since 5.4

So instead I just use ChatGPT to create a plan and then ask Opus to code, but it's a hit and miss. Almost every time the prompt seems to be routed to cheaper model that is very dumb (but says Opus 4.6 when asked). I have to start new session many times until I get a good model.

View on HN · Topics

I've been using Claude and Codex in tandem ($100 CC, $20 Codex), and have made heavy use of claude-co-commands [0] to make them talk. Outside of the last 1-2 weeks (which we now have confirmation YET AGAIN that Claude shits the fucking bed in the run-up to a new model release), I usually will put Claude on max + /plan to gin up a fever dream to implement. When the plan is presented, I tell it to /co-validate with Codex, which tends to fill in many implementation gaps. Claude then codes the amended plan and commits, then I have a Codex skill that reviews the commit for gaps, missed edge cases, incorrect implementation, missed optimizations, etc, and fix them. This had been working quite well up until the beginning of the month, Claude more or less got CTE, and after a week of that I swapped to $100 Codex, $20 CC plans. Now I'm using co-validation a lot less and just driving primarily via Codex. When Claude works, it provides some good collaborative insights and counter-points, but Codex at the very least is consistently predictable (for text-oriented, data-oriented stuff -- I don't use either for designing or implementing frontend / UI / etc).

As always, YMMV!

[0] https://github.com/SnakeO/claude-co-commands

View on HN · Topics

Maybe stick with 4.6 until the bugs are worked out? Is this new filter retroactive?

View on HN · Topics

These threads are always full of superstitious nonsense. Had a bad week at the AIs? Someone at Anthropic must have nerfed the model!

The roulette wheel isn't rigged, sometimes you're just unlucky. Try another spin, maybe you'll do better. Or just write your own code.

View on HN · Topics

Start vibe-coding -> the model does wonders -> the codebase grows with low code quality -> the spaghetti code builds up to the point where the model stops working -> attempts to fix the codebase with AI actually make it worse -> complain online "model is nerfed"

View on HN · Topics

Part of me wonders if there's some subtle behavioral change with it too. Early on we're distrusting of a model and so we're blown away, we were giving it more details to compensate for assumed inability, but the model outperformed our expectations. Weeks later we're more aligned with its capabilities and so we become lazy. The model is very good, why do we have to put in as much work to provide specifics, specs, ACs, etc. So then of course the quality slides because we assumed it's capabilities somehow absolved the need for the same detailed guardrails (spec, ACs, etc) for the LLM.

This scenario obviously does not apply to folks who run their own benches with the same inputs between models. I'm just discussing a possible and unintentional human behavioral bias.

Even if this isn't the root cause, humans are really bad at perceiving reality. Like, really really bad. LLMs are also really difficult to objectively measure. I'm sure the coupling of these two facts play a part, possibly significant, in our perception of LLM quality over time.

View on HN · Topics

Still I don't previously remember Claude constantly trying to stop conversations or work, as in "something is too much to do", "that's enough for this session, let's leave rest to tomorrow", "goodbye", etc. It's almost impossible to get it do refactoring or anything like that, it's always "too massive", etc.

View on HN · Topics

Hasn't Opus 4.5 been famously consistent while 4.6 was floating all over the place?

View on HN · Topics

Working on some research projects to test Opus 4.7.

The first thing I notice is that it never dives straight into research after the first prompt. It insists on asking follow-up questions. "I'd love to dive into researching this for you. Before I start..." The questions are usually silly, like, "What's your angle on this analysis?" It asks some form of this question as the first follow-up every time.

The second observation is "Adaptive thinking" replaces "Extended thinking" that I had with Opus 4.6. I turned this off, but I wish I had some confidence that the model was working as hard as possible (I don't want it to mysteriously limit its thinking capabilities based on what it assumes requires less thought. I'd rather control the thinking level). I always ran research prompts with extended thinking enabled on Opus 4.6, and it gave me confidence that it was taking time to get the details right.

The third observation is it'll sit in a silent state of "Creating my research plan" for several minutes without starting to burn tokens. At first I thought this was because I had 2 tabs running a research prompt at the same time, but it later happened again when nothing else was running beside it. Perhaps this is due to high demand from several people trying to test the new model.

And fourth, the research output is significantly shorter and less detailed than Opus 4.6. Where before I would get several pages of research findings, now I get a short 2-3 pager.

Overall, I feel a bit confused. It doesn't seem better than 4.6, and from a research standpoint it might be worse. It seems like it got several different "features" that I'm supposed to learn now.

View on HN · Topics

I'm not sure how much I trust Anthropic recently.

This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.

View on HN · Topics

Honestly, I personally would rather a time-out than the quality of my response noticably downgrading. I think what I found especially distrustful is the responses from employees claiming that no degredation has occured.

An honest response of "Our compute is busy, use X model?" would be far better than silent downgrading.

View on HN · Topics

What I want to know is why my bedrock-backed Claude gets dumber along with commercial users. Surely they're not touching the bedrock model itself. Only thing I can think of is that updates to the harness are the main cause of performance degradation.

View on HN · Topics

Usually they're hemorrhaging performance while training.

From that it's pretty likely they were training mythos for the last few weeks, and then distilling it to opus 4.7

Pure speculation of course, but would also explain the sudden performance gains for mythos - and why they're not releasing it to the general public (because it's the undistilled version which is too expensive to run)

View on HN · Topics

> This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.

If they are indeed doing this, I wonder how long they can keep it up?

View on HN · Topics

Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.

View on HN · Topics

we arent bots because we disagree with you. I switch between codex and opus, they have their differing strengths. As many people have mentioned, opus in the past few weeks has had less than stellar results. Generally I find opus would rather stub something and do it the faster way than to do a more complete job, although its much better at front end. I've had times where I've thrown the same problem at opus 4/5 times without success and codex gets it first shot. Just my experience.

View on HN · Topics

It is the new "You are absolutely right!"

View on HN · Topics

I think my results have actually become worse with Opus 4.7.

I have a pretty robust setup in place to ensure that Claude, with its degradations, ensures good quality. And even the lobotomized 4.6 from the last few days was doing better than 4.7 is doing right now at xhigh.

It's over-engineering. It is producing more code than it needs to. It is trying to be more defensible, but its definition of defensible seems to be shaky because it's landing up creating more edge cases. I think they just found a way to make it more expensive because I'm just gonna have to burn more tokens to keep it in check.

View on HN · Topics

I've noticed it getting dumber in certain situations , can't point to it directly as of now , but seems like its hallucinating a bit more .. and ditto on the Adaptive thinking being confusing

View on HN · Topics

The model card confirms the chain-of-thought supervision error from Mythos was present during Opus 4.7 training too, affecting 7.8% of episodes. That's not a one-time bug that got patched. It's a training pipeline issue that persisted across model generations. The long-context regression from 91.9% to 59.2% is also worth noting — they traded retrieval accuracy for coding benchmarks, which is a reasonable engineering choice, but the framing buries it.

View on HN · Topics

They don't have demand for the price it would require for inference.

They are definitely distilling it into a much smaller model and ~98% as good, like everybody does.

View on HN · Topics

Some people are speculating that Opus 4.7 is distilled from Mythos due to the new tokenizer (it means Opus 4.7 is a new base model, not just an improved Opus 4.6)

View on HN · Topics

The new tokenizer is interesting, but it definitely is possible to adapt a base model to a new tokenizer without too much additional training, especially if you're distilling from a model that uses the new tokenizer. (see, e.g., https://openreview.net/pdf?id=DxKP2E0xK2 ).

View on HN · Topics

Not impossible, but you have to be at least a little bit mad to deploy tokenizer replacement surgery at this scale.

They also changed the image encoder, so I'm thinking "new base model". Whatever base that was powering 4.5/4.6 didn't last long then.

View on HN · Topics

productivity going right back down again, ah well they weren't going to pay us more anyway

View on HN · Topics

From a quick tests, it seems to hallucinate a lot more than opus 4.6. I like to ask random knowledge questions like "What are the best chinese rpgs with a decent translations for someone who is not familiar with them? The classics one should not miss?" and 4.6 gave accurate answers, 4.7 hallucinated the name of games, gave wrong information on how to run them etc...

Seems common for any type of slightly obscure knowledge.

View on HN · Topics

Only in benchmarks. After couple of minutes of use it feels same dumb as nerfed 4.6

View on HN · Topics

Ask it to create an iOS app which natively runs Gemma via Litert-lm.

It’s incredibly trivial to find stuff outside their capabilities. In fact most stuff I want AI to do it just can’t, and the stuff it can isn’t interesting to me.

View on HN · Topics

This made me LOL. They keep trying to fleece us by nerfing functionality and then adding it back next release. It’s an abusive relationship at this point.

View on HN · Topics

I liked Opus 4.5 but hated 4.6. Every few weeks I tried 4.6 and, after a tirade against, I switched back to 4.5. They said 4.6 had a "bias towards action", which I think meant it just made stuff up if something was unclear, whereas 4.5 would ask for clarfication. I hope 4.7 is more of a collaborator like 4.5 was.

View on HN · Topics

I just subscribed this month again because I wanted to have some fun with my projects.

Tried out opus 4.6 a bit and it is really really bad. Why do people say it's so good? It cannot come up with any half-decent vhdl. No matter the prompt. I'm very disappointed. I was told it's a good model

View on HN · Topics

Because it was good until January 2026, then it detoriated into a opus-3.1. Probably given much less context windows or ram.

View on HN · Topics

I've seen a similar psychological phenomenon where people like something a lot, and then they get unreasonably angry and vocal about changes to that thing.

Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online.

For example, there is no evidence that 4.6 ever degraded in quality: https://marginlab.ai/trackers/claude-code-historical-perform...

View on HN · Topics

> Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online

This is reductive. You're both calling people unreasonably angry but then acknowledging there's a limit in compute that is a practical reality for Anthropic. This isn't that hard. They have two choices, rate limit, or silently degrade to save compute.

I have never hit a rate limit, but I have seen it get noticeably stupider. It doesn't make me angry, but comments like these are a bit annoying to read, because you are trying to make people sound delusional while, at the same time, confirming everything they're saying.

I don't think they have turned a big knob that makes it stupider for everyone. I think they can see when a user is overtapping their $20 plan and silently degrade them. Because there's no alert for that. Which is why AI benchmark sites are irrelevant.

View on HN · Topics

Doesn't matter. My vibes say it got bad in January 2026. Thus, they secretly nerfed Opus 4.6 in January 2026.

The fact that it didn't exist back then is completely and utterly irrelevant to my narrative.

View on HN · Topics

I recognize the sarcasm. The data I can find says it's performing at baseline however?

https://marginlab.ai/trackers/claude-code/

View on HN · Topics

Yeah, that's my point. Humans are not reliable LLM evaluators. "Secret model nerfs" happen in "vibes" far more often than they do in any reality.

View on HN · Topics

Huge regression for long contest tasks interestingly.

Mrcr benchmark went from 78% to 32%

View on HN · Topics

not a bot, voiced frustration is real here. I kind of depend on good LLMs now and wouldn't even mind if they had frozen the LLMs capabilities around dec 2025 forver and would hppily continue to pay, even more. but when suddenly the very same workload that was fine for months isn't possible anymore with the very same LLM out of nowhere and gets increasingly worse, its a huge disappointment. and having codex in parallel as a backup since ever I started also using it again with gpt 5.4 and it just rips without the diva sensitivity or overfitting into the latest prompt opus/sonnet is doing. GPT just does the job, maybe thinks a bit long, but even over several rounds of chat compression in the same chat for days stays well within the initial set of instructions and guardrails I spelled out, without me having to remind every time. just works, quietly, and gets there. Opus doesn't even get there anymore without nearly spelling out by hand manual steps or what not to do.

View on HN · Topics

Yeah, my personal anecdata is that Claude has just gotten better and better since January. I haven’t felt like even making the minor effort to compare with Codex’s current state. Just yesterday Claude Code made a major visible improvement in planning/executing — maybe it switched to 4.7 without me noticing? (Task: various internal Go services and Preact frontends.)

View on HN · Topics

Looks to me like a mob of humans, angry they've been deceived by ambiguous communications, product nerfing, surprisingly low usage limits, and an appallingly sycophantic overconfident coding agent

View on HN · Topics

Is Codex the new goto? Opus stopped being useful about 45-60 days ago.

View on HN · Topics

I haven’t noticed much difference compared to Jan/Feb. Maybe depends what you use it for

View on HN · Topics

What a joke Opus 4.7 at max is.

I gave it an agentic software project to critically review.

It claimed gemini-3.1-pro-preview is wrong model name, the current is 2.5. I said it's a claim not verified.

It offered to create a memory. I said it should have a better procedure, to avoid poisoning the process with unverified claims, since memories will most likely be ignored by it.

It agreed. It said it doesn't have another procedure, and it then discovered three more poisonous items in the critical review.

I said that this is a fabrication defect, it should not have been in production at all as a model.

It agreed, it said it can help but I would need to verify its work. I said it's footing me with the bill and the audit.

We amicably parted ways.

I would have accepted a caveman-style vocabulary but not a lobotomized model.

I'm looking forward to LobotoClaw. Not really.

View on HN · Topics

So this is the norm: quantized version of the SOTA model is previous model. Full model becomes latest model. Rinse and repeat.

View on HN · Topics

I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6

But degrading a model right before a new release is not the way to go.

View on HN · Topics

I wish someone would elaborate on what they were doing and observed since Jan on opus 4.6. I’ve been using it with 1m context on max thinking since it was released - as a software engineer to write most of my code, code reviews + research and explain unfamiliar code - and haven’t notice a degradation. I’ve seen this mentioned a lot though.

I have seen that codex -latest highest effort - will find some important edge cases that opus 4.6 overlooked when I ask both of them to review my PRs.

View on HN · Topics

I don't use it for coding, but I do use it for real world tasks like general assistant.

I did notice multiple times context rot even in pretty short convos, it trying to overachie and do everything before even asking for my input and forgetting basic instructions (For example I have to "always default to military slang" in my prompt, and it's been forgetting it often, even though it worked fine before)

View on HN · Topics

As the author of the now (in)famous report in https://github.com/anthropics/claude-code/issues/42796 issue (sorry stella :) all I can say is... sigh. Reading through the changelog felt as if they codified every bad experiment they ran that hurt Opus 4.6. It makes it clear that the degradation was not accidental.

I'm still sad. I had a transformative 6 months with Opus and do not regret it, but I'm also glad that I didn't let hope keep me stuck for another few weeks: had I been waiting for a correction I'd be crushed by this.

Hypothesis: Mythos maintains the behavior of what Opus used to be with a few tricks only now restricted to the hands of a few who Anthropic deems worthy. Opus is now the consumer line. I'll still use Opus for some code reviews, but it does not seem like it'll ever go back to collaborator status by-design. :(

View on HN · Topics

I've always seen people complaining about model getting dumber just before the new one drops and always though this was confirmation bias. But today, several hours before the 4.7 release, opus 4.6 was acting like it was sonnet 2 or something from that era of models.

It didn't think at all, it was very verbose, extremely fast, and it was just... dumb.

So now I believe everyone who says models do get nerfed without any notification for whatever reasons Anthropic considers just.

So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?

View on HN · Topics

What's the point of baking the best and most impressive models in the world and then serving it with degraded quality a month after releases so that intelligence from them is never fully utilised??

View on HN · Topics

The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.

View on HN · Topics

With the new tokenizer did they A/B test this one?

I'm curious if that might be responsible for some of the regressions in the last month. I've been getting feedback requests on almost every session lately, but wasn't sure if that was because of the large amount of negative feedback online.

View on HN · Topics

Is it just Opus 4.6 with throttling removed?

View on HN · Topics

It seems they nerf it, then release a new version with previous power. So they can do this forever without actually making another step function model release.

View on HN · Topics

while it seems even with 4.7 we will never see the quality of early 4.6 days, some dude is posting 'agi arrived!!!' on instagram and linkedIn.

View on HN · Topics

Tried it, after about 10 messages, Opus 4.7 ceased to be able to recall conversation beyond the initial 10 messages. Super weird.

View on HN · Topics

Pretty bad. As nerfed 4.6

View on HN · Topics

four prompts with opus 4.6 today is equivalent to 30 or 40 two months ago. infernal downgrade in my case.

View on HN · Topics

Will it be like the usual: let it work great for 2 weeks, nerf it after?

View on HN · Topics

Regardless of the model quality improvement, the corporate damage was done by not only ignoring the Opus quality degradation but gaslighting users into thinking they aren’t using it right.

I switched to Codex 5.4 xhigh fast and found it to be as good as the old Claude. So I’ll keep using that as my daily driver and only assess 4.7 on my personal projects when I have time.

View on HN · Topics

Might be sticking with 4.6 it's only been 20 minutes of using 4.7 and there are annoyances I didn't face with 4.6 what the heck. Huge downgrade on MRCR too....

256K:

- Opus 4.6: 91.9%
- Opus 4.7: 59.2%

1M:

- Opus 4.6: 78.3%
- Opus 4.7: 32.2%

View on HN · Topics

Here’s the problem. The distribution of query difficulty / task complexity is probably heavily right-skewed which drives up the average cost dramatically. The logical thing for anthropic to do, in order to keep costs under control, is to throttle high-cost queries. Claude can only approximate the true token cost of a given query prior to execution. That means anything near the top percentile will need to get throttled as well.

By definition this means that you’re going to get subpar results for difficult queries. Anything too complicated will get a lightweight model response to save on capacity. Or an outright refusal which is also becoming more common.

New models are meaningless in this context because by definition the most impressive examples from the marketing material will not be consistently reproducible by users. The more users who try to get these fantastically complex outputs the more those outputs get throttled.

View on HN · Topics

even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype leading to the IPO, not disappointment crashes from their power user base.

View on HN · Topics

You use Claude Code? Then harness changes will have had much more impact than any model "stealth nerfing".

View on HN · Topics

Reminder that 4.7 may seem like a huge upgrade to 4.6 because they nerfed the F out of 4.6 ahead of this launch so 4.7 would seem like a remarkable improvement...

View on HN · Topics

New model - that explains why for the past week/two weeks I had this feeling of 4.6 being much less "intelligent". I hope this is only some kind of paranoia and we (and investors) are not being played by the big corp. /s

View on HN · Topics

I don't get it. Why would they make the previous model worse before releasing an update?

View on HN · Topics

Why do stores increase prices before a sale?

View on HN · Topics

Ok, so the answer is "they make the existing model worse to make it seem that the new model is good". I'm almost certain that this is not what's going on. It's hard to make the argument that the benefits outweigh the drawbacks of such approach. It doesn't give the more market share or revenue.

View on HN · Topics

Tbf I don't think that it's just this one reason. While I'm not a subscriber to any LLM provider, the general feeling I get from reading comments online is that the models have a long history of getting worse over time. Of course, we don't know why, but presumably they're quantizing models or downgrading you to a weaker model transparently.

Now as for why, I imagine that it's just money. Anthropic presumably just got done training Mythos and Opus 4.7. that must have cost a lot of cash. They have a lot of subscribers and users, but not enough hardware.

What's a little further tweaking of the model when you've already had to dumb it down due to constraints.

View on HN · Topics

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

View on HN · Topics

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

Summarizer