Token Anxiety and Pricing

Discussion of mental overhead from token-based pricing, comparisons to per-minute phone billing, requests for clearer usage indicators, and debates about whether Anthropic should simply charge more for better quality

Users are grappling with significant "token anxiety" as unpredictable billing and the technical overhead of KV caching turn AI interactions into a stressful experience reminiscent of outdated per-minute phone billing. While Anthropic attempted to mitigate costs by pruning conversation history and "thinking" logs during idle sessions, many power users expressed frustration with the resulting quality degradation, arguing they would rather pay a premium for full context than suffer the "silent lobotomy" of a degraded model. Consequently, the community is calling for greater transparency through real-time usage indicators and the introduction of higher-priced, high-performance tiers that prioritize context retention and reliability over forced efficiency.

View on HN · Topics

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

View on HN · Topics

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

View on HN · Topics

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

View on HN · Topics

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

View on HN · Topics

Token anxiety is real mental overhead.

View on HN · Topics

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

View on HN · Topics

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.

View on HN · Topics

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

View on HN · Topics

Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.

The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.

Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state ; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.

View on HN · Topics

I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.

View on HN · Topics

> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.

View on HN · Topics

That might be an absurd comparison, but we can fix that.

If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:

You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.

Which is true of this issue to.

View on HN · Topics

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

View on HN · Topics

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

View on HN · Topics

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

View on HN · Topics

What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.

View on HN · Topics

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.

View on HN · Topics

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

View on HN · Topics

He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.

View on HN · Topics

Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?

I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.

For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.

Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?

View on HN · Topics

Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.

1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.

https://fortune.com/2026/04/20/italian-court-netflix-refunds...

View on HN · Topics

The trace goes back fine, that's not the issue.

The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.

So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.

View on HN · Topics

>and doing that will cause a huge one-time hit against your token limit if the session has grown large.

Anthropic already profited from generating those tokens. They can afford subsidize reloading context.

View on HN · Topics

They are sending it back to the cache, the part you are missing is they were charging you for it.

View on HN · Topics

The blog post says they prune them now not to charge you. That’s the change they implemented.

View on HN · Topics

right. they were charging you for it, now they aren't because they are just dropping your conversation history.

View on HN · Topics

> There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them

The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.

View on HN · Topics

Don't you have that by just resuming old convo?

The only issue is that it didn't hit the cache so it was expensive if you resume later.

View on HN · Topics

As some others have mentioned.

I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.

(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)

I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.

View on HN · Topics

I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.

View on HN · Topics

> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session

This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.

View on HN · Topics

Right, and reloading that context is the same cost as refilling the cache, so really, they're charging the same, and making it hard.

View on HN · Topics

This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.

Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?

View on HN · Topics

> As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later.

As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.

View on HN · Topics

The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process

View on HN · Topics

I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.

View on HN · Topics

> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?

To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.

View on HN · Topics

I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.

My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.

This cache information should probably get displayed somewhere within Claude Code

View on HN · Topics

Yep, agree. We added a little "/clear to save XXX tokens" notice in the bottom right, and will keep iterating on this. Thanks for being an early user!

View on HN · Topics

But.. that doesn't solve the problem of having no indication in-session when it'll lose the cache. A nudge to /clear does nothing to indicate "or else face significant cost" nor does it indicate "your cache is stale".

Love the product. <3

View on HN · Topics

I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.

What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.

View on HN · Topics

That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.

It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.

View on HN · Topics

> that would be >900k tokens written to cache all at once

Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.

Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?

View on HN · Topics

For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.

View on HN · Topics

what about selling long term cache space to users?

or even, let the user control the cache expiry on a per request basis. with a /cache command

that way they decide if they want to drop the cache right away , or extend it for 20 hours etc

it would cost tokens even if the underlying resource is memory/SSD space, not compute

View on HN · Topics

I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.

View on HN · Topics

Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.

View on HN · Topics

as a variation:

how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.

the cost of reloading the window didnt go away, it just went up even more

View on HN · Topics

> tokens written to cache all at once, which would eat up a significant % of your rate limits

Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.

Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

View on HN · Topics

Bit surprised about the amount of flak they're getting here. I found the article seemed clear, honest and definitely plausible.

The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.

The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.

View on HN · Topics

extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.

View on HN · Topics

Note mini-high is similar perf/latency to medium, but much cheaper

View on HN · Topics

Rework burns tokens.

View on HN · Topics

Not a problem if they're offering unlimited, lol

View on HN · Topics

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.

View on HN · Topics

Sure it is. They're well aware their product is a money furnace and they'd have to charge users a few orders of magnitude more just to break even, which is obviously not an option. So all that's left is.. convince users to burn tokens harder, so graphs go up, so they can bamboozle more investors into keeping the ship afloat for a bit longer.

View on HN · Topics

If this claim is true (inference is priced below cost), it makes little sense that there are tens of small inference providers on OpenRouter. Where are they getting their investor money? Is the bubble that big?

Incidentally, the hardware they run is known as well. The claim should be easy to check.

View on HN · Topics

To be clear, I'm talking about subscription pricing. API pricing for Anthropic is probably at-cost.

I dare you to run CC on API pricing and see how much your usage actually costs.

(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)

View on HN · Topics

It's an option and they are going to do it. Chinese models will be banned and the labs will happily go dollar for dollar in plan price increases. $20 plans won't go away, but usage limits and model access will drive people to $40-$60-$80 plans.

At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.

View on HN · Topics

That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.

View on HN · Topics

No, the argument is they want to sell more product to more people , not just more product (to the same people.) Given that a lot of their income is from flat-rate subscriptions, they make money with more people burning tokens rather than just burning more tokens.

After all, "the first hit's free" model doesn't apply to repeat customers ;-)

View on HN · Topics

You don’t have to use compute to pad the token count.

View on HN · Topics

This, so much this!

Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.

View on HN · Topics

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

View on HN · Topics

I also think some of this stems from the default 1m context window. Performance starts to degrade when context size increases, and each token over (i think the level is) 400k counts more towards your usage limit. Defaulting to 1m context size, if people arent carefully managing context (which they shouldnt ever have to in an ideal world), they would notice somewhat degraded performance and increased token usage regardless.

View on HN · Topics

So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.

View on HN · Topics

“after evals and dogfooding” couldn’t have done this before releasing the model? We are paying $200/month to beta test the software for you.

View on HN · Topics

Actually, I think their deeper problems are twofold:

- Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw , and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex

- 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything.

So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.

View on HN · Topics

I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

View on HN · Topics

Everything else aside, their brief "experiment" with removing CC support from the Pro plan got me seriously considering other options. I've been wary of vendor lock-in the whole time, but it was a useful reminder. (opencode+openrouter will probably be my first port of call)

View on HN · Topics

I'm 3 weeks into switching from CC to OpenCode, and in some ways it is far superior to CC right out of the box, and I've maybe burned $200 in tokens to make a private fork that is my ultimate development and personal agent platform. Totally worth it.

Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.

View on HN · Topics

The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.

View on HN · Topics

I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.

I'm using Zed and Claude Code as my harnesses.

View on HN · Topics

A suggestion to Anthropic, just start charging the real price for your software. Of course you have to dumb it down, when the $200 tier in reality produces 5-10 thousand dollars in monthly costs when used by people who know how to max it out.
So then you come up with creative nonsense like "adaptive thinking" when your tool is sometimes working and sometimes outright not - the irony of "intelligent tools" not "thinking" aside. Of course this would kind of ruin your current value proposition as charging the actual price would make your core idea of making large swaths of skilled population un-employed, unfeasible but I am sure if you feed it into the Claude, it will find some points for and against, just like how Karpathy uses his LLM of choice to excrement his blog posts.

View on HN · Topics

I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?

View on HN · Topics

They would honestly have been better off refusing customers if compute is so limited. Degrading the quality leads to customers leaving in the short term, and ruins their long term reputation.

But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.

View on HN · Topics

this is one reason i will not pay for extra usage - it is an incentive for them to be inefficient, or at least to not spend any effort on improving my token usage efficiency.

View on HN · Topics

Even for all of us plan users, where we got barely any use from our plan because we'd destroy our 5h and 1w usage limits, also unlikely, after all they have an out of "your usage limits are guaranteed to be 5x of Pro users" (who are also being screwed).

Of course, all their vibe coding is being done with effectively infinite tokens, so...

View on HN · Topics

> Today we are resetting usage limits for all subscribers.

I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.

Will they restore my account usage limits? Since I no longer have Max?

Is that one week usage restored, or the entire buggy timespan?

View on HN · Topics

Last I tried 4.7, it was bad. Like ChatGPT bad: changed stuff it wasn’t supposed to, hallucinated code, forgot information, missed simple things, didn’t catch mistakes. And it burned through tokens like crazy.

I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.

View on HN · Topics

Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.

View on HN · Topics

My biggest problem with CC as a harness is that I can't trust "Plan" mode. Long running sessions frequently start bypassing plan mode and executing, updating files and stuff, without permission, while still in plan mode. And the only recovery seems to be to quit and reload CC.

Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.

View on HN · Topics

I think you're being a bit harsh.

... But then again, many of us are paying out of pocket $100, $200USD a month.

Far more than any other development tools.

Services that cost that much money generally come with expectations.

View on HN · Topics

Yeah you don't have to convince me. I switched to Codex mid-January in part because of the dubious quality of the tui itself and the unreliability of the model. Briefly switched back through March, and yep, still a mistake.

Once OpenAI added the $100 plan, it was kind of a no-brainer.

View on HN · Topics

Nothing you wrote makes sense. The limits are so Anthropic isn't on a loss. If they can customize Claude using Code, I see no reason why they couldn't do so with other wrappers. Other wrappers can also make use of cache.

If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.

View on HN · Topics

Given the price I don't really think they're the best option. They're sloppy and competitors are catching up. I'm having same results with other models, and very close with Kimi, which is waaay cheaper.

View on HN · Topics

Right a very simple UI thing that they should have that would have prevented so much misunderstanding. Is a simple counter. How much usage do a have i used and how much is left.

If a message will do a cache recreation the cost for that should be viewable.

View on HN · Topics

Just as a note to CC fans/users here since I had an opportunity to do so... I tested resuming a session that was stale at 950k tokens after returning from a full day or so of being idle, thus a fully empty quota/session.

Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.

View on HN · Topics

I’ve stuck to the non-1M context Opus 4.6 and it works really well for me, even with on-going context compression. I honestly couldn’t deal with the 1M context change and then the compounding token devouring nonsense of 4.7
I sincerely hope Anthropic is seeing all of this and taking note. They have their work cut out for them.

View on HN · Topics

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.

View on HN · Topics

I pay for 20x max and get so much more value out of it than I pay.

View on HN · Topics

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.

View on HN · Topics

What high price? I pay $200/m for an insane number of tokens.

View on HN · Topics

A lot of people are provided their access through work.

They don't actually pay the bill or see it.

View on HN · Topics

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price.
The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time.
I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.

The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...

YES... DO IT... FRICKING MACHINE..

View on HN · Topics

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:

> Next steps are to run `cat /path/to/file` to see what the contents are

Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).

That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.

Just the other day it was in Auto mode (by accident) and I told it:

> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.

And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).

The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.

View on HN · Topics

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".

If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).

If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:

1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that

2. Dumb the models down (basically decreasing their cost per token)

3. Send less tokens (ie capping thinking budgets aggressively).

2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.

View on HN · Topics

$1000/mo for guaranteed functionality >= Opus 4.6 at its peak? Yes, I'd probably grumble a bit and then whip out the credit card.

I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.

Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.

View on HN · Topics

Whatever they did, with the max plan, my daily usage quota was consumed in less than 10 minutes. Weird, let's hope they fix the usage now.

View on HN · Topics

Please for the love of god just put the max price plan up like 4x or 5x in cost and make it actually work.

View on HN · Topics

I think that would also have busted cache all the time, and uncached requests consume usage limits rapidly.

View on HN · Topics

I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect.

My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.

Summarizer