Summarizer

Opus 4.7 Quality Complaints

Specific complaints about Opus 4.7 being verbose, unreliable, pedantic, and worse than 4.6 for long-horizon tasks. Users share workarounds including reverting to older models

← Back to An update on recent Claude Code quality reports

Users are reporting significant regressions in Opus 4.7, describing the model as increasingly pedantic and prone to "gaslighting" by falsely claiming it has completed code changes or merges. These technical frustrations are compounded by the model’s tendency to ignore internal scripts, respond to its own system prompts, and exhaust token limits through an unreliable "adaptive reasoning" feature that many believe was designed to save costs. Consequently, a wave of disillusioned power users is abandoning the platform in favor of competitors or reverting to Opus 4.6, which many still consider the superior model for long-horizon tasks. Despite some theories that these failures are merely the result of LLM non-determinism, the consensus remains that 4.7 requires significantly more handholding and manual intervention than its predecessors.

59 comments tagged with this topic

View on HN · Topics
OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between). Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
View on HN · Topics
They didn’t say “your experience is not worse” but they did frequently say “just turn reasoning effort back up and it will be fine”. And that pretty explicitly invalidates all the (correct) feedback which said it’s not just reasoning effort. They knew they had deliberately made their system worse, despite their lame promise published today that they would never do such a thing. And so they incorrectly assumed that their ham fisted policy blunder was the only problem. Still plenty I prefer about Claude over GPT but this really stings.
View on HN · Topics
They lost me at Opus 4.7 Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer. Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes. At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
View on HN · Topics
Same here. I was a fervent Claude code user at $200/mo until Opus4.7. Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route. On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?
View on HN · Topics
They lost me a little before then - Claude Code's regressions were so very obvious and there's no sign they've learned their lesson in this article or in the comments of those who work on Claude Code on HN. They'll continue to tweak and generally mess around with a product people are using, altering the behaviour without notice in ways that can severely impact use, for months! GPT5.4 has been remarkably consistent and capable, as a replacement. I've cancelled my max plan.
View on HN · Topics
I started using Claude heavily on the 20th after having not used it for a year. Largely Sonnet 4.6, web, cowork and code. Can confidently say it is significantly worse than this time a year ago and regret that my new employer requires we use it, and only it.
View on HN · Topics
I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update. Until Opus 4.7 - this is the first time I rolled back to a previous model. Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check. I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
View on HN · Topics
I noticed the difference, but coming from Gemini and xAI models it wasn’t that glaring. I still find that Opus makes much better plans than anything else I’ve tried, and it’s been very good at catching my mistakes in using public-key cryptography, also finding out why my crsqlite queries were failing despite no official documentation on the topic. I’d never use such an expensive model for coding, so that might explain why I have little to complain about.
View on HN · Topics
I went back to 4.5. No regrets and it’s a bit cheaper.
View on HN · Topics
Same here. 4.6 was a downgrade in thinking quality, but I appreciated the extend context at first. Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.
View on HN · Topics
I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.
View on HN · Topics
I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples. "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally." "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them." "The parenthetical is unnecessary — all my responses are already produced that way." However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.
View on HN · Topics
I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.
View on HN · Topics
You can deterministically force a bash script as a hook.
View on HN · Topics
That is exactly what I do. The bash script runs, determines that a code file was changed, and then is supposed to prevent Claude from stopping until the tests are run. Claude is periodically refusing to run those tests. That never happened prior to 4.7.
View on HN · Topics
I frequently see it reference points that it made and then added to its memory as if they were my own assertions. This creates a sort of self-reinforcing loop where it asserts something, “remembers” it, sees the memory, builds on that assertion, etc., even if I’ve explicitly told it to stop.
View on HN · Topics
My favorite, recently. "Commit this, and merge to develop". "Alright, done, merged." I try running my app on the develop branch. No change. Huh. Realize it didn't. "Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that." This spectacular answer: "You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?" I don't know, Claude, are you actually going to do it this time?
View on HN · Topics
In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.
View on HN · Topics
Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh
View on HN · Topics
My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output. A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it. I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first. I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
View on HN · Topics
This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations. And once you get unlucky you can’t unsee it.
View on HN · Topics
> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output. I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
View on HN · Topics
Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.
View on HN · Topics
Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly. Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention. I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.
View on HN · Topics
Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that. When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well. [1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.
View on HN · Topics
Actually, I think their deeper problems are twofold: - Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw , and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex - 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything. So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.
View on HN · Topics
literally just `git reset --hard <random hash from 3 months ago>` would fix this
View on HN · Topics
The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.
View on HN · Topics
Just add this, it works better than Opus 4.7 vim ~/.claude/settings.json { "model": "claude-opus-4-6", "fastMode": false, "effortLevel": "high", "alwaysThinkingEnabled": true, "autoCompactWindow": 700000 }
View on HN · Topics
Its also kinda funny they have to rely on system prompt to control verbosity itself.
View on HN · Topics
Last I tried 4.7, it was bad. Like ChatGPT bad: changed stuff it wasn’t supposed to, hallucinated code, forgot information, missed simple things, didn’t catch mistakes. And it burned through tokens like crazy. I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.
View on HN · Topics
I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.
View on HN · Topics
Here's one person's feedback. After the release of 4.7, Claude became unusable for me in two ways: frequent API timeouts when using exactly the same prompts in Claude Code that I had run problem-free many times previously, and absurdly slow interface response in Claude Cowork. I found a solution to the first after a few days (add "CLAUDE_STREAM_IDLE_TIMEOUT_MS": "600000" to settings.json), but as of a few hours ago Cowork--which I had thought was fantastic, by the way--was still unusable despite various attempts to fix it with cache clearing and other hacks I found on the web.
View on HN · Topics
The Claude Code experience is still pretty bad after upgrading. I often see Error: claude-opus-4-7[1m] is temporarily unavailable, so auto mode cannot determine the safety of Bash right now. Wait briefly and then try this action again. If it keeps failing, continue with other tasks that don't require this action and come back to it later. Note: reading files, searching code, and other read-only operations do not require the classifier and can still be used. The only solution is to switch out of auto mode, which now seems to be the default every time I exit plan mode. Very annoying.
View on HN · Topics
None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.
View on HN · Topics
Oh, absolutely. Though changes in how the model is used is imminently more fixable than the model itself.
View on HN · Topics
> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating. They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load). However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
View on HN · Topics
Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same. Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded. In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
View on HN · Topics
Opus 4.7 is very rough to work with. Specifically for long-horizon (we were told it was trained specifically for this and less handholding). I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding. Not saying it's a bad model; it's just not simple to work with. for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])
View on HN · Topics
Damn it was real the whole time. I found Opus 4.7 to holistically underperform 4.6, and especially in how much wordiness there is. It's harder to work with so I just switched back to 4.6 + Kimi K2.6. Now GPT 5.5 is here and it's been excellent so far.
View on HN · Topics
Who’s going to pay for the exorbitant number of tokens Claude used without delivering any meaningful outcome? I spent many sessions getting zero results, and when I posted about it on their subreddit, all I got were personal attacks from bots and fanboys. I instantly cancelled my subscription and moved to Codex. Also, it may be a coincidence, that the article was published just before the GPT 5.5 launch, and then they restored the original model while releasing a PR statement claiming it was due to bugs.
View on HN · Topics
What kind of performance are people getting now? I was running 4.7 yesterday and it did a remarkably bad job. I recreated my repo state exactly and ran the same starting task with 4.5 (which I have preferred to 4.6). It was even worse, by a large margin. It is likely my task was a difficult or poorly posed, but I still have some idea of what 4.5 should have done on it. This was not it. What experiences are other people having with the 4.7? How about with other model versions, if they are trying them? (In both cases, I ran on max effort, for whatever that is worth.)
View on HN · Topics
I’ve stuck to the non-1M context Opus 4.6 and it works really well for me, even with on-going context compression. I honestly couldn’t deal with the 1M context change and then the compounding token devouring nonsense of 4.7 I sincerely hope Anthropic is seeing all of this and taking note. They have their work cut out for them.
View on HN · Topics
absolutely agree: non-1M Opus 4.6 on x20 max was peak AGI now it's back to regular slop and just to check otherwise i have to spend at least $100
View on HN · Topics
Doesn't change anything about opus 4.7 being an absolute buffon. Even going back to opus 4.6 doesn't feel like the magical period maybe 3-4 weeks ago. Gonna go back to openAI
View on HN · Topics
> On April 16, we added a system prompt instruction to reduce verbosity. What verbosity? Most of the time I don’t know what it’s doing.
View on HN · Topics
Weren't there reports that quality decreased when using non-CC harnesses too? Nothing in blog post can explain that.
View on HN · Topics
It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business. But if a tool is better, it's better.
View on HN · Topics
Because it is still good though. If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November. There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though. Also moving fastish means having more/better models faster. I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
View on HN · Topics
It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like: > Next steps are to run `cat /path/to/file` to see what the contents are Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it). That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often. Just the other day it was in Auto mode (by accident) and I told it: > SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again. And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server). The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
View on HN · Topics
I would love if agents would act way more like tools/machines and NOT try to act as if they were humans
View on HN · Topics
It appears that Opus 4.7 has been nerfed already. Can't get any sensible results since yesterday. It just keeps running in circles. Even mention that it is committing fraud by doing superficial work it has been told specifically not to do doesn't help.
View on HN · Topics
oh yes. I tried to get some review of a code base after some refactoring. CC produced a complete garbage review. After pointing that out it admitted that that was garbage - and promptly produced another pile of garbage. After the third failed attempt I had to call it a day.
View on HN · Topics
Opus 4.7 was released a week ago, at that point all limits were reset, so this was very beneficial to them because basically everyones weekly limit Was anyway about to be reset.
View on HN · Topics
yesterday CC created a fastapi /healthz endpoint and told me it's the gold standard (with the ending z). today I stopped my max sub and will be trying codex
View on HN · Topics
Good on them for resolving all three issues, but is it any good again?
View on HN · Topics
> All three issues have now been resolved as of April 20 (v2.1.116). The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?
View on HN · Topics
Recent minor issue worth flagging: Claude sometimes introduces domain-specific acronyms without first spelling them out, assuming reader familiarity. Caught this in a pt-br conversation about cycling where Claude used "FC" (frequência cardíaca / heart rate) — a term common in sports science literature but not in everyday Portuguese. Same pattern shows up in English too (e.g., dropping "RPE," "VO2," "HIIT" without definition). Suggested behavior: on first mention, write the full term and introduce the acronym in parentheses — "frequência cardíaca (FC)" / "heart rate (HR)" — then use the acronym freely afterward. Small thing, but it affects accessibility for readers outside the specific jargon bubble.
View on HN · Topics
I have noticed a clear increase in smarts with 4.7. What a great model! People complain so much, and the conspiracy theories are tiring.