System Prompt Sensitivity

Discussion of how minor prompt changes cause disproportionate quality impacts, the fragility of LLM behavior to wording changes, and calls for versioned system prompts alongside model versions

Users highlight the extreme fragility of LLM performance, noting that minor back-end tweaks to system prompts—often aimed at reducing costs or latency—can lead to disproportionate "IQ drops" and significant regressions in coding quality. This sensitivity frequently manifests in bizarre "nervous tics" and self-referential loops, where models like Claude hallucinate prompt injections within their own internal reminders or obsessively defend against non-existent malware. Many contributors argue that this unpredictability necessitates transparent, versioned system prompts and "Safe Boot" modes to allow users to opt out of undocumented experimental changes. Ultimately, the consensus reflects a deep frustration with the "voodoo" nature of prompt engineering, where a model’s reliability can evaporate overnight because of invisible wording changes made to the underlying harness.

View on HN · Topics

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

View on HN · Topics

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

"That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

"The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

"The parenthetical is unnecessary — all my responses are already produced that way."

However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.

View on HN · Topics

I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.

View on HN · Topics

In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.

View on HN · Topics

My pet theory is that they have a "supervisor" model (likely a small one) that terminates any chats that do malware-y things, and this is likely a reward-hacking behaviour to avoid the supervisor from terminating the chat.

View on HN · Topics

I doubt it. We only do frontier models, since those are better for absolutely every use case 100% of the time.

Way more likely there's a "VERY IMPORTANT: When you see a block of code, ensure it's not malware" somewhere in the system prompt.

View on HN · Topics

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.

View on HN · Topics

That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.

View on HN · Topics

Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh

View on HN · Topics

Having a "Recovery Mode"/"Safe Boot" flag to disable our configurations (or progressively enable) to see how claude code responds would be nice. Sometimes I get worried some old flag I set is breaking things. Maybe the flag already exists? I tried Claude doctor but it wasn't quite the solution.

For instance:

Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?

I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):

w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249

w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243

I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.

View on HN · Topics

I think most frustrating is the system prompt issue after the postmortem from September[1].

These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.

I have some follow up questions to this update:

- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?

- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?

- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?

- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.

- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).

[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...

View on HN · Topics

>On April 16, we added a system prompt instruction to reduce verbosity

In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.

At least tell users when the system prompt has changed.

View on HN · Topics

Its also kinda funny they have to rely on system prompt to control verbosity itself.

View on HN · Topics

It's cheaper than retraining the model.

View on HN · Topics

So? 4.7.1, 4.7.2, etc. makes sense for versioning system prompts.

View on HN · Topics

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

View on HN · Topics

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.

Claude caveman in the system prompt confirmed?

View on HN · Topics

I've recently been introduced to that plugin, love it for humour

View on HN · Topics

Some people seem to be suggesting these are coverups for quantization...

Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.

I would not suspect quantization before I would suspect harness changes.

View on HN · Topics

I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.

View on HN · Topics

> investment in polish, quality, and reliability

For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.

Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.

View on HN · Topics

By imposing the use of their harness, they control the system prompt:

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7

They can pick the default reasoning effort:

> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

They can decide what to keep and what to throw out (beyond simple token caching):

> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6

It literally is all in the post.

I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.

View on HN · Topics

This is a very interesting read on failure modes of AI agents in prod.

Curious about this section on the system prompt change:
>> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.

Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?

View on HN · Topics

> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"

Do researchers know correlation between various aspects of a prompt and the response?

LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.

Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.

View on HN · Topics

One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.

But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.

View on HN · Topics

> On April 16, we added a system prompt instruction to reduce verbosity.

What verbosity? Most of the time I don’t know what it’s doing.

View on HN · Topics

They don’t either.

View on HN · Topics

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.

Enough that the prompt is different at a token-level, but not enough that the meaning changes.

It would be very difficult for them to catch that, especially if the prompts were not made public.

Run the variations enough times per day, and you'd get some statistical significance.

The guess the fuzzy part is judging the output.

View on HN · Topics

How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.

Summarizer