Peak Hour Throttling

Discussion of off-peak credits and peak-hour limit reductions, theories about A/B testing heavy users, suspicion of silent quality degradation for users approaching limits

Users are increasingly suspicious that Anthropic is managing peak demand through a "carrot and stick" approach, pairing off-peak credits with aggressive limit reductions that leave heavy users feeling penalized for high utilization. Beyond simple rate-limiting, a prominent theory suggests that model quality is being silently degraded during high-traffic periods by using lower quantization or routing complex, high-cost queries to more efficient but less capable models. This perceived inconsistency has fueled frustration over a lack of transparency, with many speculating that stealthy A/B testing is being used to throttle those who frequently approach their monthly usage caps. Ultimately, the community fears that as compute costs rise, the most impressive capabilities of these tools are being intentionally gated or "dumbed down" to preserve system stability.

View on HN · Topics

As far as I can tell, no one has asked this question publicly before, but I wonder if they're tinkering with quantizations in the background.

When I was exploring local inference, one of the early mistakes I made was using lower quantizations for models, and then assuming that the model sucked based on the results. It took me a while to learn that as the number of bits decreases, the fidelity of "simulation" decreases as well. But so does the total size in memory.

For example for Kimi K2 - a trillion parameter open source model, the 4-bit version is around 580GB in size. The 8-bit version is, 1.1 TB (yes TB - 1,090 GB). And the full fidelity 16-bit version is, 2.05TB.

So essentially the 4-bit version is roughly (give or take how you're packaging it) 1/4th the size of the 16-bit one. And significantly dumber.

If you have a limited amount of resources / don't want to utilize 6 GB200s to run the model, and want to make do with three much cheaper (comparatively) B200 units, then a lower quantization allows you to do that.

I wonder if Opus' inconsistent performance is because - depending on the time of the day - Anthropic routes you to a differently quantized version?

View on HN · Topics

2 weeks ago the rolling session usage plummeted to borderline unusable. I'd say I get a weekly output equivalent to 2 session windows before change.

View on HN · Topics

I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.

View on HN · Topics

Is that why Anthropic recently gave out free credits for use in off-hours? Possibly an attempt to more evenly distribute their compute load throughout the day?

View on HN · Topics

That was the carrot, but it was followed immediately by the stick (5 hour session limits were halved during peak hours)

View on HN · Topics

i suspect they get cheap off peak electricity and compute is cheaper at those times

View on HN · Topics

I think it's a lot simpler than that. At peak, gpus are all running hot. During low volume, they aren't.

View on HN · Topics

> Is that why Anthropic recently gave out free credits for use in off-hours?

That was the carrot for the stick. The limits and the issues were never officially recognized or communicated. Neither have been the "off-hours credits". You would only know about them if you logged in to your dashboard. When is the last time you logged in there?

View on HN · Topics

Anthropic isn't going to give us that information. It's not actually static, it depends on subscription demand and idle compute available.

View on HN · Topics

I am 90% sure it's looking at month long usage trends now and punishing people who utilize 80%+ week over week. It's the only way to explain how some people burn through their limit in an hour and others who still use it a lot get through their hourly limits fine.

View on HN · Topics

It's hard to say. Admittedly I'm a heavy user as I intentionally cap out my 5x plan every week - I've personally found that I get more usage being on older versions of CC and being very vigilant on context management. But nobody can say for sure, we know they have A/B test capabilities from the CC leaks so it's just a matter of turning on a flag for a heavy user.

View on HN · Topics

Not that anybody can actually use it though, as a large percentage of Copilot users are facing seemingly random multi-day rate limits.

https://www.theregister.com/2026/04/15/github_copilot_rate_l...

View on HN · Topics

I am waiting for the 2x usage window to close to try it out today.

If they are charging 2x usage during the most important part of the day, doesn't this give OpenAI a slight advantage as people might naturally use Codex during this period?

View on HN · Topics

It's a combination of factors. There was rate-limiting implemented by Anthropic, where the 5hr usage limit would be burned through faster at peak hours, I was personally bitten by this multiple times before one guy from Anthropic announced it publicly via twitter, terrible communication. It wasn't small either, ~15 minutes of work ended up burning the entire 5hr limit. That annoyed me enough to switched to Codex for the month at that point.

Now people are saying the model response quality went down, I can't vouch for that since I wasn't using Claude Code, but I don't think this many people saying the same thing is total noise though.

View on HN · Topics

With the new tokenizer did they A/B test this one?

I'm curious if that might be responsible for some of the regressions in the last month. I've been getting feedback requests on almost every session lately, but wasn't sure if that was because of the large amount of negative feedback online.

View on HN · Topics

Is it just Opus 4.6 with throttling removed?

View on HN · Topics

Here’s the problem. The distribution of query difficulty / task complexity is probably heavily right-skewed which drives up the average cost dramatically. The logical thing for anthropic to do, in order to keep costs under control, is to throttle high-cost queries. Claude can only approximate the true token cost of a given query prior to execution. That means anything near the top percentile will need to get throttled as well.

By definition this means that you’re going to get subpar results for difficult queries. Anything too complicated will get a lightweight model response to save on capacity. Or an outright refusal which is also becoming more common.

New models are meaningless in this context because by definition the most impressive examples from the marketing material will not be consistently reproducible by users. The more users who try to get these fantastically complex outputs the more those outputs get throttled.

Summarizer