Token Usage Concerns

Frustration over 35% tokenizer overhead, unclear subscription limits, rapid token consumption, confusing billing multipliers, and lack of transparency about what usage is actually included in subscription tiers

Users are increasingly frustrated by a "calorie counting" experience where Claude’s recent tokenizer overhead feels like a shadow price hike, with many reporting that their entire five-hour session limits are now exhausted in mere minutes. This lack of transparency has sparked a wave of "self-defense" tactics, ranging from users adopting "caveman" dialects to compress prompts to others abandoning the platform for competitors like Codex that offer more predictable usage. Beyond the technical friction, there is a deep-seated cynicism regarding corporate incentives; critics argue that providers may be training models to be intentionally wordy to maximize revenue at the expense of user efficiency. Ultimately, the community feels caught in a "black box" of confusing multipliers and shifting defaults that prioritize infrastructure costs over the practical needs of power users.

View on HN · Topics

I keep saying even if there's not current malfeasance, the incentives being set up where the model ultimately determines the token use which determines the model provider's revenue will absolutely overcome any safeguards or good intentions given long enough.

View on HN · Topics

'Hey Claude, these tokens are utter unrelated bollocks, but obviously we still want to charge the user for them regardless. Please construct a plausible explanation as to why we should still be able to do that.'

View on HN · Topics

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.

caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.

[0] https://github.com/JuliusBrussee/caveman/tree/main

View on HN · Topics

I hope people realize that tools like caveman are mostly joke/prank projects - almost the entirety of the context spent is in file reads (for input) and reasoning (in output), you will barely save even 1% with such a tool, and might actually confuse the model more or have it reason for more tokens because it'll have to formulate its respone in the way that satisfies the requirements.

View on HN · Topics

Who would suspect that the companies selling 'tokens' would (unintentionally) train their models to prefer longer answers, reaping a HIGHER ROI (the thing a publicly traded company is legally required to pursue: good thing these are all still private...)... because it's not like private companies want to make money...

View on HN · Topics

I don’t think this is a plausible argument, as they’re generally capacity constrained, and everyone would like shorter (= faster) responses.

I’m fairly certain that in a few more releases we’ll have models with shorter CoT chains. Whether they’ll still let us see those is another question, as it seems like Anthropic wants to start hiding their CoT, potentially because it reveals some secret sauce.

View on HN · Topics

Some labs do it internally because RLVR is very token-expensive. But it degrades CoT readability even more than normal RL pressure does.

It isn't free either - by default, models learn to offload some of their internal computation into the "filler" tokens. So reducing raw token count always cuts into reasoning capacity somewhat. Getting closer to "compute optimal" while reducing token use isn't an easy task.

View on HN · Topics

Yeah the readability suffers, but as long as the actual output (ie the non-CoT part) stays unaffected it’s reasonably fine.

I work on a few agentic open source tools and the interesting thing is that once I implemented these things, the overall feedback was a performance improvement rather than performance reduction, as the LLM would spend much less time on generating tokens.

I didn’t implement it fully, just a few basic things like “reduce prose while thinking, don’t repeat your thoughts” etc would already yield massive improvements.

View on HN · Topics

Exactly. The model is exquisitely sensitive to language. The idea that you would encourage it to think like a caveman to save a few tokens is hilarious but extremely counter-productive if you care about the quality of its reasoning.

View on HN · Topics

This specific form may be a joke, but token conscious work is becoming more and more relevant..
Look at
https://github.com/AgusRdz/chop

And

https://github.com/toon-format/toon

View on HN · Topics

I hesitated 100% when i saw caveman gaining steam, changing something like this absolutely changes the behaviour of the models responses, simply including like a "lmao" or something casual in any reply will change the tone entirely into a more relaxed style like ya whatever type mode.

I think a lot of people echo my same criticism, I would assume that the major LLM providers are the actual winners of that repo getting popular as well, for the same reason you stated.

> you will barely save even 1% with such a tool

For the end user, this doesnt make a huge impact, in fact it potentially hurts if it means that you are getting less serious replies from the model itself. However as with any minor change across a ton of users, this is significant savings for the providers.

I still think just keeping the model capable of easily finding what it needs without having to comb through a lot of files for no reason, is the best current method to save tokens. it takes some upfront tokens potentially if you are delegating that work to the agent to keep those navigation files up to date, but it pays dividends when future sessions your context window is smaller and only the proper portions of the project need to be loaded into that window.

View on HN · Topics

Output tokens are more expensive

View on HN · Topics

Help me understand: I get that the file reading can be a lot. But I also expand the box to see its “reasoning” and there’s a ton of natural language going on there.

View on HN · Topics

They are indeed impractical in agentic coding.

However in deep research-like products you can have a pass with LLM to compress web page text into caveman speak, thus hugely compressing tokens.

View on HN · Topics

Good catch actually.

Okay maybe not exactly caveman dialect, but text compression using LLM is definitely possible to save on tokens in deep research.

View on HN · Topics

Caveman is fun, but the real tool you want to reduce token usage is headroom

https://github.com/gglucass/headroom-desktop (mac app)

https://github.com/chopratejas/headroom (cli)

View on HN · Topics

Headroom looks great for client-side trimming. If you want to tackle this at the infrastructure level, we built Edgee ( https://www.edgee.ai ) as an AI Gateway that handles context compression, caching, and token budgeting across requests, so you're not relying on each client to do the right thing.

(I work at Edgee, so biased, but happy to answer questions.)

View on HN · Topics

I tried to use rtk for the same, and my agent session would just loop the same tool call over and over again. Does headroom work better?

View on HN · Topics

I was doing some experiments with removing top 100-1000 most common English words from my prompts. My hypothesis was that common words are effectively noise to agents. Based on the first few trials I attempted, there was no discernible difference in output. Would love to compare results with caveman.

Caveat: I didn’t do enough testing to find the edge cases (eg, negation).

View on HN · Topics

I literally just posted a blog on this. Some seemingly insignificant words are actually highly structural to the model. https://www.ruairidh.dev/blog/compressing-prompts-with-an-au...

View on HN · Topics

Doesn't it just use more tokens in reasoning?

View on HN · Topics

Oh wow, I love this idea even if it's relatively insignificant in savings.

I am finding my writing prompt style is naturally getting lazier, shorter, and more caveman just like this too. If I was honest, it has made writing emails harder.

While messing around, I did a concept of this with HTML to preserve tokens, worked surprisingly well but was only an experiment. Something like:

> <h1 class="bg-red-500 text-green-300"><span>Hello</span></h1>

AI compressed to:

> h1 c bgrd5 tg3 sp hello sp h1

Or something like that.

View on HN · Topics

Interesting, it doesn't seem intuitive at all to me.

My (wrong?) understanding was that there was a positive correlation between how "good" a tokenizer is in terms of compression and the downstream model performance. Guess not.

View on HN · Topics

To reduce token count on command outputs you can also use RTK [0]

[0]: https://github.com/rtk-ai/rtk

View on HN · Topics

I find grep and common cli command spam to be the primary issue. I enjoy Rust Token Killer https://github.com/rtk-ai/rtk , and agents know how to get around it when it truncates too hard.

View on HN · Topics

caveman stops being a style tool and starts being self-defense. once prompt comes in up to 1.35x fatter, they've basically moved visibility and control entirely into their black box.

View on HN · Topics

1.35 times! For Input!
For what kinds of tokens precisely? Programming? Unicode? If they seriously increased token usage by 35% for typical tasks this is gonna be rough.

View on HN · Topics

Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...

View on HN · Topics

they've also introduced a lot of caching and token burn related bugs which makes things worse. any bug that multiplies the token burn also multiplies their infrastructure problems.

View on HN · Topics

You can believe whatever you want. I found claude unusable due to limits. Codex works very well for my use cases.

View on HN · Topics

It works better, until you run out of tokens. Running out of tokens is something that used to never happen to me, but this month now regularly happens.

Maybe I could avoid running out of tokens by turning off 1M tokens and max effort, but that's a cure worse than the disease IMO.

View on HN · Topics

I would risk a guess that people have a wrong intuition about the long-context pricing and are complaining because of that.

Yeah, the per-token price stays the same, even with large context. But that still means that you're spending 4x more cache-read tokens in a 400k context conversation, on each turn, than you would be in a 100k context conversation.

View on HN · Topics

Personally I find using and managing Claude sessions and limits is getting exhausting and feels similar to calorie counting. You think you are going to have an amazing low calories meal only to realize the meal is full of processed sugars and you overshot the limit within 2-3 bites. Now "you have exhausted your limit for this time. Your session limits resets in next 4 hrs".

View on HN · Topics

Yep, it just feels terrible, the usage bars give me anxiety, and I think that's in their interest as they definitely push me towards paying for higher limits. Won't do that, though.

View on HN · Topics

so even with a new tokenizer that can map to more tokens than before, their answer is still just "you're not managing your context well enough"

"Opus 4.7 uses an updated tokenizer that [...] can map to more tokens—roughly 1.0–1.35× depending on the content type.

[...]

Users can control token usage in various ways: by using the effort parameter, adjusting their task budgets, or prompting the model to be more concise."

View on HN · Topics

Same for me.

I cancelled my subscription and will be moving to Codex for the time being.

Tokens are way too opaque and Claude was way smarter for my work a couple of months ago.

View on HN · Topics

Anecdotally, codex has been burning through way more tokens for me lately. Claude seems to just sit and spin for a long time doing nothing, but at least token use is moderate.

All options are starting to suck more and more

View on HN · Topics

More like 2 hours considering these usage limits

View on HN · Topics

Perhaps on the 10x plan.

It went through my $20 plan's session limit in 15 minutes, implementing two smallish features in an iOS app.

That was with the effort on auto.

It looks like full time work would require the 20x plan.

View on HN · Topics

No, I am happy with the results.

For a first test, it did seem like it burned through the usage even faster than usual.

GitHub Copilot’s 7.5x billing factor over 3x with Opus 4.6 seems to suggest it indeed consumes more tokens.

Now I’m just waiting for OpenAI to show their hand before deciding which of the plans to upgrade from the $20 to the $100 plan.

View on HN · Topics

What a waste of tokens. No wonder Anthropic can't serve their customers. It's not just a lack of compute, it's a ridiculous waste of the limited compute they have. I think (hope?) we look back at the insanity of all this theatre, the same way we do about GPT-2 [1].

1. https://techcrunch.com/2019/02/17/openai-text-generator-dang...

View on HN · Topics

i think you are being needlessly paranoid here

openai doest offer affiliate marketing links

the reason you see lot of users switching to codex is for the dismal weekly usage you get from claude

what users care about is actual weekly usage , they dont care a model is a few points smarter , let us use the damn thing for actual work

only codex pro really offers that

View on HN · Topics

I think my results have actually become worse with Opus 4.7.

I have a pretty robust setup in place to ensure that Claude, with its degradations, ensures good quality. And even the lobotomized 4.6 from the last few days was doing better than 4.7 is doing right now at xhigh.

It's over-engineering. It is producing more code than it needs to. It is trying to be more defensible, but its definition of defensible seems to be shaky because it's landing up creating more edge cases. I think they just found a way to make it more expensive because I'm just gonna have to burn more tokens to keep it in check.

View on HN · Topics

The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.

Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.

View on HN · Topics

That depends a bit on token efficiency. From their "Agentic coding performance by effort level" graph, it looks like they get similar outcome for 4.7 medium at half the token usage as 4.6 at high.

Granted that is, as you say, a single prompt, but it is using the agentic process where the model self prompts until completion. It's conceivable the model uses fewer tokens for the same result with appropriate effort settings.

View on HN · Topics

Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?

View on HN · Topics

The more efficient tokenizer reduces usage by representing text more efficiently with fewer tokens. But the lack of transparancy does indeed mean Anthropic could still scale down limits to account for that.

View on HN · Topics

a few months ago it was for weekly:

pro = 5m tokens, 5x = 41m tokens, 20x = 83m tokens

making 5x the best value for the money (8.33x over pro for max 5x). this information may be outdated though, and doesn't apply to the new on peak 5h multipliers. anything that increases usage just burns through that flat token quota faster.

View on HN · Topics

Will they actually give you enough usage ? Biggest complaint is that codex offers way more weekly usage. Also this means GPT 5.5 release is imminent (I suspect thats what Elephant is on OR)

View on HN · Topics

Interestingly github-copilot is charging 2.5x as much for opus 4.7 prompts as they charged for opus 4.6 prompts (7.5x instead of 3x). And they're calling this "promotional pricing" which sounds a lot like they're planning to go even higher.

Note they charge per-prompt and not per-token so this might in part be an expectation of more tokens per prompt.

https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...

View on HN · Topics

I've been using up way more tokens in the past 10 days with 4.6 1M context.

So I've grown wary of how Anthropic is measuring token use. I had to force the non-1M halfway through the week because I was tearing through my weekly limit (this is the second week in a row where that's happened, whereas I never came CLOSE to hitting my weekly limit even when I was in the $100 max plan).

So something is definitely off. and if they're saying this model uses MORE tokens, I'm getting more nervous.

View on HN · Topics

> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

I guess that means bad news for our subscription usage.

View on HN · Topics

In GitHub Copilot it costs 7.5x whereas Opus 4.6 is 3x

View on HN · Topics

I have not seen any comment from the early tests of 4.7 claiming that it does not work better than the previous version.

However, there have been some valuable warnings about problems that have been hit in the first minutes after switching to 4.7.

For instance that the new guardrails can block working at projects where the previous version could be used without problems and that if you are not careful the changed default settings can make you reach the subscription limits much faster than with the previous version.

View on HN · Topics

If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.

Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.

Swapping out the tokenizer is a massive change. Not an incremental one.

View on HN · Topics

It doesn't need to be. Text can be tokenized in many different ways even if the token set is the same.

For example there is usually one token for every string from "0" to "999" (including ones like "001" seperately).

This means there are lots of ways you can choose to tokenize a number. Like 27693921. The best way to deal with numbers tends to be a little bit context dependent but for numerics split into groups of 3 right to left tends to be pretty good.

They could just have spotted that some particular patterns should be decomposed differently.

View on HN · Topics

Just before the end is this one-liner:

> the same input can map to more tokens—roughly 1.0–1.35× depending on the content type

Does this mean that we get a 35% price increase for a 5% efficiency gain? I'm not sure that's worth it.

View on HN · Topics

> Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

This is concerning & tone-deaf especially given their recent change to move Enterprise customers from $xxx/user/month plans to the $20/mo + incremental usage.

IMO the pursuit of ultraintelligence is going to hurt Anthropic, and a Sonnet 5 release that could hit near-Opus 4.6 level intelligence at a lower cost would be received much more favorably. They were already getting extreme push-back on the CC token counting and billing changes made over the past quarter.

View on HN · Topics

Do we have any performance benchmark with token length? Now that the context size is 1 M. I would want to know if I can exhaust all of that or should I clear earlier?

View on HN · Topics

if only. but more token costs, yes.

View on HN · Topics

Interesting that despite Anthropic billing it at the same rate as Opus 4.6, GitHub CoPilot bills it at 7.5x rather than 3x.

View on HN · Topics

Excited to use 1 prompt and have my whole 5-hour window at 100%. They can keep releasing new ones but if they don't solve their whole token shrinkage and gaslighting it is not gonna be interesting to se.

View on HN · Topics

It seems a lot of the problem isn't "token shrinkage" (reducing plan limits), but rather changes they made to prompt caching - things that used to be cached for 1 hour now only being cached for 5 min.

Coding agents rely on prompt caching to avoid burning through tokens - they go to lengths to try to keep context/prompt prefixes constant (arranging non-changing stuff like tool definitions and file content first, variable stuff like new instructions following that) so that prompt caching gets used.

This change to a new tokenizer that generates up to 35% more tokens for the same text input is wild - going to really increase token usage for large text inputs like code.

View on HN · Topics

on Tuesday, with 4.6, I waited for my 5 hour window to reset, asked it to resume, and it burned up all my tokens for the next 5 hour window and ran for less than 10 seconds. I’ve never cancelled a subscription so fast.

View on HN · Topics

I tried the Claude Extension for VSCode on WSL for a reverse engineering task, it consumed all of my tokens, broke and didn't even save the conversatioon

View on HN · Topics

four prompts with opus 4.6 today is equivalent to 30 or 40 two months ago. infernal downgrade in my case.

View on HN · Topics

> First, Opus 4.7 uses an updated tokenizer that improves how the model processes text

wow can I see it and run it locally please? Making API calls to check token counts is retarded.

View on HN · Topics

> In Claude Code, we’ve raised the default effort level to xhigh for all plans.

Does it also mean faster to getting our of credits?

Summarizer