Summarizer

Enterprise Trust Issues

Concerns about using Claude for production code when behavior changes unpredictably, impact on building workflows and pipelines, and comparison to other enterprise software stability expectations

← Back to An update on recent Claude Code quality reports

Developers are increasingly disillusioned by Anthropic’s silent model regressions and cost-saving optimizations, with many accusing the company of "gaslighting" users who noticed significant performance drops in their production workflows. This perceived instability has sparked a debate over whether Anthropic has fallen into a "complexity trap," prioritizing rapid feature development over the core reliability and transparency expected of expensive enterprise software. While some defenders view these hiccups as the inevitable growing pains of bleeding-edge technology, others are actively diversifying their toolsets, arguing that ethical alignment cannot compensate for unpredictable "rug-pulls" on performance. Ultimately, the community is demanding better version control and honesty, asserting that it is impossible to build stable engineering pipelines on a foundation that shifts randomly beneath them.

48 comments tagged with this topic

View on HN · Topics
> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing. You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
View on HN · Topics
> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it. Does mmap(2) educate the developer on how disk I/O works? At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
View on HN · Topics
I've spent my entire working career dealing with companies that do the opposite. The product still goes stale. Find a better excuse. You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
View on HN · Topics
I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us. I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale. OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
View on HN · Topics
I agree with this. I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment. Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
View on HN · Topics
We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care
View on HN · Topics
I'm aiming for intellectual honesty here. I'm not taking a side for a person or an org, but I'm taking a stand for a quality bar. > They knew they had deliberately made their system worse Define "they". The teams that made particular changes? In real-world organizations, not all relevant information flows to all the right places at the right time. Mistakes happen because these are complex systems. Define "worse". There are lot of factors involved. With a given amount of capacity at a given time, some aspect of "quality" has to give. So "quality" is a judgment call. It is easy to use a non-charitable definition to "gotcha" someone. (Some concepts are inherently indefensible. Sometimes you just can't win. "Quality" is one of those things. As soon as I define quality one way, you can attack me by defining it another way. A particular version of this principle is explained in The Alignment Problem by Brian Christian, by the way, regarding predictive policing iirc.) I'm seeing a lot of moral outrage but not enough intellectual curiosity. It embarrassingly easy to say "they should have done better" ... ok. Until someone demonstrates to me they understand the complexity of a nearly-billion dollar company rapidly scaling with new technology, growing faster than most people comprehend, I think ... they are just complaining and cooking up reasons so they are right in feeling that way. This possible truth: complex systems are hard to do well apparently doesn't scratch that itch for many people. So they reach for blame . This is not the way to learn. Blaming tends to cut off curiosity. I suggest this instead: redirect if you can to "what makes these things so complicated?" and go learn about that. You'll be happier, smarter, and ... most importantly ... be building a habit that will serve you well in life. Take it from an old guy who is late to the game on this. I've bailed on companies because "I thought I knew better". :/
View on HN · Topics
Same here. I was a fervent Claude code user at $200/mo until Opus4.7. Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route. On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?
View on HN · Topics
I often have Claude commit and pr; on the last week I've seen several instances of it deciding to do extra work as part of the commit. It falls over when it tries to 'git add', but it got past me when I was trying auto mode once
View on HN · Topics
I have found Claude to be especially unpredictable. I've mostly switched to GPT-5.4 now - although it's slightly less capable, it's massively more reliable.
View on HN · Topics
I wonder how well the "good" versions worked if you threw awkward edge cases at it.
View on HN · Topics
That implies it's broken. Juicing revenue and slashing opex at the expense of brand and customer retention is the feature.
View on HN · Topics
I think most frustrating is the system prompt issue after the postmortem from September[1]. These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions. I have some follow up questions to this update: - Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug? - How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues? - There was no refund or comped tokens in September. Will there be some sort of comp to affected users? - How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric. - Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed). [1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
View on HN · Topics
IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should. PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently
View on HN · Topics
Essentially they should hire a few of the old school product guys from Apple. Best me to it, but the obsession on UX and quality from earlier Apple is exactly what they urgently need instead of tech folks trying to engineer themselves into complicated rabbit holes and shenanigans.
View on HN · Topics
They're losing customers because of quality concerns. Pausing development and focusing 100% on quality is how you fix that. That said, that may not have been obvious at all in the Jan/Feb time frame when they got a wave of customers due to ethical concerns.
View on HN · Topics
I think they've dug themselves into a complexity trap. Beyond the stochastic nature of the models themselves, I don't think they're able to reason about their software anymore. Too many levers, too many dials, and code that likely nobody understands. But worse, based on the pronouncements of Dario et al I suspect management is entirely unsympathetic because they believe we (SWEs) are on the chopping block to be replaced. And intimation that putting guard rails around these tools for quality concerns ... I'm suspecting is being ignored or discouraged. In the end, I feel like Claude Code itself started as a bit of a science experiment and it doesn't smell to me like it's adopted mature best practices coming out of that.
View on HN · Topics
I agree. My real fear if this is how the company works, how are systems with real implications (e.g. defense) being treated.
View on HN · Topics
This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.
View on HN · Topics
They would honestly have been better off refusing customers if compute is so limited. Degrading the quality leads to customers leaving in the short term, and ruins their long term reputation. But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.
View on HN · Topics
If Anthropic couldn't catch these issues before people started screaming at them, do we really believe 50% of software engineering jobs are going away?
View on HN · Topics
Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors. Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift. I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
View on HN · Topics
> investment in polish, quality, and reliability For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend. Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
View on HN · Topics
Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance. I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one. Frontier LLMs still suck a lot, you can't afford planned degradation yet.
View on HN · Topics
My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC. Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
View on HN · Topics
I am considering proving my feedback by not providing my money any longer.
View on HN · Topics
I think you're being a bit harsh. ... But then again, many of us are paying out of pocket $100, $200USD a month. Far more than any other development tools. Services that cost that much money generally come with expectations.
View on HN · Topics
I've noticed the same thing in my own AI assisted work. Feels like I'm moving too fast and it's easy to implement decisions quickly but they really have to be the right f--ing decisions. In the past dev was so slow so you had a lot of time to vet the hard decisions and now you don't.
View on HN · Topics
None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.
View on HN · Topics
One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not. But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
View on HN · Topics
ngl lost alot of trust in cc after reading this, specially point 1 how do you just do that to millions of users building prod code with your shit
View on HN · Topics
At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.
View on HN · Topics
It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.
View on HN · Topics
What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality? We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
View on HN · Topics
Because it is still good though. If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November. There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though. Also moving fastish means having more/better models faster. I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
View on HN · Topics
> It’s incredible how forgiving you guys are with Anthropic and their errors. Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues. I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
View on HN · Topics
Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.
View on HN · Topics
>I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat I never understood why people cheered for Anthropic then when they happily work together with Palantir.
View on HN · Topics
Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here
View on HN · Topics
Exactly. They've done now like 6 rug-pulls. Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further". And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots. And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
View on HN · Topics
something i note from this is that this is not a model weights change, but it is a hidden state change anthropic is doing to the outputs that can tune the quality and down on the "model" without breaking the "we arent changing the model" promise. how often do these changes happen?
View on HN · Topics
The funny thing is, in the last 3 days Claude has gotten substantially worse. So this claim, "All three issues have now been resolved as of April 20 (v2.1.116)" does not land with me at all.
View on HN · Topics
for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so. What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
View on HN · Topics
My takeaway is that they knew they were changing a bunch of stuff while their reps were gaslighting us in the comments here. Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
View on HN · Topics
If you think that you can just silently modify the model without any announcements and only react when it doesn't go through unnoticed, then be 100% sure that your clients will check every possible alternative and will leave you as soon as they find anything similar in quality (and no, not a degraded one).
View on HN · Topics
So it turns out Anthropic was gaslighting everyone on twitter about this then? Swearing that nothing had changed and people were imagining the models got worse?
View on HN · Topics
I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect. My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
View on HN · Topics
Resuming from sessions are still broken since Feb (I had to get claude to write a hook to fix that itself), the monitoring tool doesn't work and blocks usage of what does (simple sleep - except it doesn't even block correctly so you just sidestep in more ridiculous ways), and yet there seems to be more annoying activity proxies/spinner wheels (staring into middle distance)... Like I don't know how in a span of a few months you lose such focus on your product goals. Has Anthropic reached that point in their lifecycle already where their product team is no longer staffed by engineers and they have more and more non-technical MBAs joining trying to ride the hype train?