Dogfooding Failures

Discussion of how Anthropic employees use different internal builds than customers, internal experiments masking bugs, and suggestions that eating your own dog food only works with the shipped version

Critics argue that Anthropic's engineering culture has devolved into a "vibe-coding" approach where severe, avoidable bugs persist because internal staff do not experience the same constraints or environments as paying customers. A central point of contention is the fundamental failure of their "dogfooding" process; commenters highlight that by using internal experimental builds and infinite tokens, employees effectively mask regressions—such as the "forgetful" Claude bug—that would be immediately obvious in the public version. This disconnect leads to a perception that the company is prioritizing speed over stability, ultimately forcing high-paying users to act as involuntary beta testers for software that lacks rigorous unit testing and quality control.

View on HN · Topics

> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.

I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.

View on HN · Topics

“after evals and dogfooding” couldn’t have done this before releasing the model? We are paying $200/month to beta test the software for you.

View on HN · Topics

It's likely a corner case for their developers. The dangers of working on a project is assuming user behavior like your own.

View on HN · Topics

Even for all of us plan users, where we got barely any use from our plan because we'd destroy our 5h and 1w usage limits, also unlikely, after all they have an out of "your usage limits are guaranteed to be 5x of Pro users" (who are also being screwed).

Of course, all their vibe coding is being done with effectively infinite tokens, so...

View on HN · Topics

I guess it's a bit of desperation to find a sustainable business model.

The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.

That and all the dogfooding by slop coding their user facing application(s).

View on HN · Topics

> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!

Seems like a very basic software engineering error that would be caught by normal unit testing.

View on HN · Topics

They should really test everything thoroughly and then make it available to general public to avoid these issues!!

View on HN · Topics

This is a very interesting read on failure modes of AI agents in prod.

Curious about this section on the system prompt change:
>> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.

Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?

View on HN · Topics

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

View on HN · Topics

The third bug is the one worth dwelling on. Dropping thinking blocks every turn instead of just once is the kind of regression that only shows up in production traffic. A unit test for "idle-threshold clearing" would assert "was thinking cleared after an hour of idle" (yes) without asserting "is thinking preserved on subsequent turns" (no). The invariant is negative space.

The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.

View on HN · Topics

Experienced engineers that know the codebase and system well, and with enough time to consider the problem properly would likely consider this case.

But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.

But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.

View on HN · Topics

Those are exactly the kind of issues you run into when your app is ai coded you built one thing and kill something else.

You have too many and the wrong benchmarks

View on HN · Topics

I had similar experience just before 4.5 and before 4.6 were released.

Somehow, three times makes me not feel confident on this response.

Also, if this is all true and correct, how the heck they validate quality before shipping anything?

Shipping Software without quality is pretty easy job even without AI. Just saying....

View on HN · Topics

> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.

Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?

View on HN · Topics

It's really hard to understand. There needs to be really loud batman sign in the sky type signals from some hero third party calling out objective product degradation. Do they use cc internally? If so do they use a different version? This should've been almost as loud a break as service just going down altogether, yet it took 2 weeks to fix?!

View on HN · Topics

> ... we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features) ...

Apparently they are using another version internally.

Summarizer