AI Industry Vibes Culture

Observations that LLM evaluation is highly subjective and vibes-based, comparison to gambling psychology, criticism of cargo-culting prompt engineering tricks from influencers

The current AI landscape is increasingly dominated by a "vibes-based" culture where objective benchmarks are often dismissed in favor of subjective anecdotes and "coding voodoo," leading to frequent, unverified claims that models are being secretly "nerfed." This environment is frequently compared to gambling psychology, with users acting like bettors on a lucky streak as they swap magical prompt incantations and treat non-deterministic outputs like rigged slot machines. Such an atmosphere fuels a "cargo cult" of influencer-driven hype and joke projects that gain unearned prestige through bot-inflated metrics and venture capital interest, potentially masking a plateau in actual model capabilities. Consequently, the discourse reflects a deep tension between those who see "mass psychosis" in user complaints and those who insist that personal intuition is the only way to measure the "sloppiness" of a model’s reasoning.

View on HN · Topics

so it's also a skinner box

View on HN · Topics

Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash , not the car, that's 50 meters away.

I'd say it's a very human mistake to make.

View on HN · Topics

CoT is basically bullshit, entirely confabulated and not related to any "thought process"...

View on HN · Topics

> I hope people realize that tools like caveman are mostly joke/prank projects

This seems to be a common thread in the LLM ecosystem; someone starts a project for shits and giggles, makes it public, most people get the joke, others think it's serious, author eventually tries to turn the joke project into a VC-funded business, some people are standing watching with the jaws open, the world moves on.

View on HN · Topics

Or - making sensational statements gets attention. A dangerous tool is necessarily a powerful tool, so that statement is pretty much exactly what you'd say if you wanted to generate hype, make people excited and curious about your mysterious product that you won't let them use.

View on HN · Topics

Its the same as cyrpto/nft hype cyles, except this time one of the joke projects is going to crash the economy.

View on HN · Topics

A major reason for that is because there's no way to objectively evaluate the performance of LLMs. So the meme projects are equally as valid as the serious ones, since the merits of both are based entirely on anecdata.

It also doesn't help that projects and practices are promoted and adopted based on influencer clout. Karpathy's takes will drown out ones from "lesser" personas, whether they have any value or not.

View on HN · Topics

All LLMs also effectively work by ”larping” a role. You steer it towards larping a caveman and well.. let’s just say they weren’t known for their high iq

View on HN · Topics

I hesitated 100% when i saw caveman gaining steam, changing something like this absolutely changes the behaviour of the models responses, simply including like a "lmao" or something casual in any reply will change the tone entirely into a more relaxed style like ya whatever type mode.

I think a lot of people echo my same criticism, I would assume that the major LLM providers are the actual winners of that repo getting popular as well, for the same reason you stated.

> you will barely save even 1% with such a tool

For the end user, this doesnt make a huge impact, in fact it potentially hurts if it means that you are getting less serious replies from the model itself. However as with any minor change across a ton of users, this is significant savings for the providers.

I still think just keeping the model capable of easily finding what it needs without having to comb through a lot of files for no reason, is the best current method to save tokens. it takes some upfront tokens potentially if you are delegating that work to the agent to keep those navigation files up to date, but it pays dividends when future sessions your context window is smaller and only the proper portions of the project need to be loaded into that window.

View on HN · Topics

We started out with oobabooga, so caveman is the next logical evolution on the road to AGI.

View on HN · Topics

You really think the 33k people that starred a 40 line markdown file realize that?

View on HN · Topics

You mean the 33k bots that created a nearly linear stars/day graph? There's a dip in the middle, but it was very blatant at the start (and now)

View on HN · Topics

Stars are more akin to bookmarks and likes these days, as opposed to a show of support or "I use this"

View on HN · Topics

I use them like bookmarks.

View on HN · Topics

I intentionally throw some weird ones on there just in case anyone is actually ever checking them. Gotta keep interviewers guessing.

View on HN · Topics

I use them as likes

View on HN · Topics

The amount of cargo culting amongst AI halfwits (who seem to have a lot of overlap with influencers and crypto bros) is INSANE

I mean just look at the growth of all these "skills" that just reiterate knowledge the models already have

View on HN · Topics

I mean we had a shoe company pivot to AI and raise their stock value by 300%, how can we even know anymore

View on HN · Topics

Yeah, when I'm writing code I try to avoid zeros and ones, since those are the most common bits, making them essentially noise

View on HN · Topics

I guess just a spell-check in the repo? But yes, I'd imagine that they have an effect. Even running the same input twice is non-deterministic.

View on HN · Topics

I really enjoy the party game "Neanderthal Poetry", in which you can only speak using monosyllabic words. I bet you would too.

View on HN · Topics

People are really trigger-happy when it comes to throwing magic tools on top of AI that claim to "fix" the weak parts (often placeboing themselves because anthropic just fixed some issue on their end).

Then the next month 90% of this can be replaced with new batch of supply chain attack-friendly gimmicks

Especially Reddit seems to be full of such coding voodoo

View on HN · Topics

My favorite to chuckle at are the prompt hack voodoo stuff, like, “tell it to be correct” or “say please” or “tell it someone will die if it doesnt do a good job,” often presented very seriously and with some fast cutting animations in a 30 second reel

View on HN · Topics

Make no mistakes!

View on HN · Topics

> coding voodoo

Well, we've sacrificed the precision of actual programming languages for the ease of English prose interpreted by a non-deterministic black box that we can't reliably measure the outputs of. It's only natural that people are trying to determine the magical incantations required to get correct, consistent results.

View on HN · Topics

Its funny watching llm users act like gamblers. Every other week swearing by one model and cursing another, like a gambler who thinks a certain slot machine, or table is cold this week. These llm companies are literally building slot machine mechanics into their ui interfaces too, I don't think this phenomenon is a coincidence.

Stop using these dopamine brain poisoning machines, think for yourself, don't pay a billionaire for their thinking machine.

View on HN · Topics

Proof they are nerfing the model? It is stable in benchmarks: https://marginlab.ai/trackers/claude-code-historical-perform...

All this just reads like just another case of mass psychosis to me

View on HN · Topics

So many people confuse sycophantic behavior with producing results.

View on HN · Topics

The market here is extraordinarily vibes-based and burning billions of dollars for a ephemeral PR boost, which might only last another couple weeks until people find a reason to hate Codex, does not reflect well on OAI's long term viability.

View on HN · Topics

Agree. I keep effort max on Claude and xhigh on GPT for all tasks and keep tasks as scoped units of work instead of boil the ocean type prompts. It is hard to measure but ultimately the tasks are getting completed and I'm validating so I consider it "working as expected".

View on HN · Topics

Usually the problems that cause this kind of thing are:

1) Bad prompt/context. No matter what the model is, the input determines the output. This is a really big subject as there's a ton of things you can do to help guide it or add guardrails, structure the planning/investigation, etc.

2) Misaligned model settings. If temperature/top_p/top_k are too high, you will get more hallucination and possibly loops. If they're too low, you don't get "interesting" enough results. Same for the repeat protection settings.

I'm not saying it didn't screw up, but it's not really the model's fault. Every model has the potential for this kind of behavior. It's our job to do a lot of stuff around it to make it less likely.

The agent harness is also a big part of it. Some agents have very specific restrictions built in, like max number of responses or response tokens, so you can prevent it from just going off on a random tangent forever.

View on HN · Topics

That's wild that you think 4.6 is bad..... Each model has its strengths and weaknesses I find that Codex is good for architectural design and Claude Is actually better the engineering and building

View on HN · Topics

codex low-key seems to be better than claude. and i say this as an 18-hour-a-day user of both (mostly claude)

View on HN · Topics

Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.

View on HN · Topics

> It then becomes clear just how "sloppy" CC is.

Have you done the reverse? In my experience models will always find something to criticize in another model's work.

View on HN · Topics

These threads are always full of superstitious nonsense. Had a bad week at the AIs? Someone at Anthropic must have nerfed the model!

The roulette wheel isn't rigged, sometimes you're just unlucky. Try another spin, maybe you'll do better. Or just write your own code.

View on HN · Topics

Start vibe-coding -> the model does wonders -> the codebase grows with low code quality -> the spaghetti code builds up to the point where the model stops working -> attempts to fix the codebase with AI actually make it worse -> complain online "model is nerfed"

View on HN · Topics

Part of me wonders if there's some subtle behavioral change with it too. Early on we're distrusting of a model and so we're blown away, we were giving it more details to compensate for assumed inability, but the model outperformed our expectations. Weeks later we're more aligned with its capabilities and so we become lazy. The model is very good, why do we have to put in as much work to provide specifics, specs, ACs, etc. So then of course the quality slides because we assumed it's capabilities somehow absolved the need for the same detailed guardrails (spec, ACs, etc) for the LLM.

This scenario obviously does not apply to folks who run their own benches with the same inputs between models. I'm just discussing a possible and unintentional human behavioral bias.

Even if this isn't the root cause, humans are really bad at perceiving reality. Like, really really bad. LLMs are also really difficult to objectively measure. I'm sure the coupling of these two facts play a part, possibly significant, in our perception of LLM quality over time.

View on HN · Topics

Nah dude, that roulette wheel is 100% rigged. From top to bottom. No doubt about that. If you think they are playing fair you are either brand new to this industry, or a masochist.

View on HN · Topics

Its because llm companies are literally building quasi slot machines, their UI interfaces support this notion, for instance you can run a multiplier on your output x3,x4,5, Like a slot machine. Brain fried llm users are behaving like gamblers more and more everyday (its working). They have all sorts of theories why one model is better than another, like a gambler does about a certain blackjack table or slot machine, it makes sense in their head but makes no sense on paper.

Don't use these technologies if you can't recognize this, like a person shouldn't gamble unless they understand concretely the house has a statistical edge and you will lose if you play long enough. You will lose if you play with llms long enough too, they are also statistical machines like casino games.

This stuff is bad for your brain for a lot of people, if not all.

View on HN · Topics

Or it could be a selection bias. The ground truth is not what HN herd mentality complains about, but the usage stats.

View on HN · Topics

I suppose I come forward with my own usage stats, but it is anecdata :)

And the andecdata matches other anecdata.

Maybe I'm missing why that's selection bias.

View on HN · Topics

I don't know, I think java is the best programming language. I use it for everything I do, no other programming language comes close. Python lost all my trust with how slow it's interpreter is, you can't use it for anything.

^^^^
Sarcastic response, but engineers have always loved their holy wars, LLM flavor is no different.

View on HN · Topics

Completely agree. We're at this place where a frontier model's peak perceived value always seems to be right before it releases.

View on HN · Topics

It's frankly becoming difficult for me to imagine what the next level of coding excellence looks like though.

By which I mean, I don't find these latest models really have huge cognitive gaps. There's few problems I throw at them that they can't solve.

And it feels to me like the gap now isn't model performance, it's the agenetic harnesses they're running in.

View on HN · Topics

People were "predicting" the plateau since GPT-1. By now, it would take extraordinary evidence for me to take such "predictions" seriously.

View on HN · Topics

It might be a bad idea to put that in all caps, because in the training data, angry conversations are less productive. (I do the same thing, just in lowercase.)

View on HN · Topics

I just subscribed this month again because I wanted to have some fun with my projects.

Tried out opus 4.6 a bit and it is really really bad. Why do people say it's so good? It cannot come up with any half-decent vhdl. No matter the prompt. I'm very disappointed. I was told it's a good model

View on HN · Topics

because they’re using it for different things where it works well and that’s all they know?

View on HN · Topics

I don’t think I’ve ever seen otherwise reasonable people go completely unhinged over anything like they do with Opus

View on HN · Topics

I've seen a similar psychological phenomenon where people like something a lot, and then they get unreasonably angry and vocal about changes to that thing.

Usage limits are necessary but I guess people expect more subsidized inference than the company can afford. So they make very angry comments online.

For example, there is no evidence that 4.6 ever degraded in quality: https://marginlab.ai/trackers/claude-code-historical-perform...

View on HN · Topics

Doesn't matter. My vibes say it got bad in January 2026. Thus, they secretly nerfed Opus 4.6 in January 2026.

The fact that it didn't exist back then is completely and utterly irrelevant to my narrative.

View on HN · Topics

Yeah, that's my point. Humans are not reliable LLM evaluators. "Secret model nerfs" happen in "vibes" far more often than they do in any reality.

View on HN · Topics

This but unironically.

"I reject your reality, and substitute my own".

It worked for cheeto in chief, and it worked for Elon, so why not do it in our normal daily lives?

View on HN · Topics

And yet another "AI doesn't work" comment without any meaningful information. What were your exact prompts? What was the output?

This is like a user of conventional software complaining that "it crashes", without a single bit of detail, like what they did before the crash, if there was any error message, whether the program froze or completely disappeared, etc.

View on HN · Topics

In the gemini subreddit there is a persistent problem with bots posting "Gemini sucks, I switched to Claude" and then bots replying they did the same.

Old accounts with no posts for a few years, then suddenly really interested in talking up Claude, and their lackeys right behind to comment.

Not even necessarily calling out Anthropic, many fan boys view these AI wars as existential.

View on HN · Topics

Sorry, no, not a bot. I get way better results out of Codex.

It's just ultimately subjective, and, it's like, your opinion, man. Calling people bots who disagree is probably not a good look.

I don't like OpenAI the company, but their model and coding tool is pretty damn good. And I was an early Claude Code booster and go back and forth constantly to try both.

View on HN · Topics

The same people that hyped up Claude will also hype up better alternatives or speak out against it, seems more like you're being disingenuous here.

View on HN · Topics

The most important question is: does it perform better than 4.6 in real world tasks? What's your experience?

View on HN · Topics

11% further along the particular bell curve of SWE-bench. Not really easy to extrapolate to real world, especially given that eg the Chinese models tend to heavily train on the benchmarks. But a 10% bump with the same model should equate to “feels noticeably smarter”.

A more quantifiable eval would be METR’s task time - it’s the duration of tasks that the model can complete on average 50% of the time, we’ll have to wait to see where 4.7 lands on this one.

View on HN · Topics

Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".

Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?

View on HN · Topics

while it seems even with 4.7 we will never see the quality of early 4.6 days, some dude is posting 'agi arrived!!!' on instagram and linkedIn.

View on HN · Topics

someone tell me if i should be happy

View on HN · Topics

even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype leading to the IPO, not disappointment crashes from their power user base.

View on HN · Topics

You are operating purely on vibes, https://marginlab.ai/trackers/claude-code-historical-perform...

View on HN · Topics

not rejecting reality, but increasing doubts about the effectiveness of these tests. and yes its subjective n=1, but I literally create and ship projects for many months now always from the same github template repository forked and essentially do the same steps with a few differnt brand touches and nearly muscle memory prompting to do the just right next steps mechanically over and over again, and the amount of things getting done per step gots worse and the quality degraded too, forgetting basic things along the way a few prompts in. as I said n=1 but the very repetitive nature of my current work days alwyas doing a new thing from the exact same start point that hasn't changed in half a year is kind of my personal benchmark. YMMV but on my end the effects are real, specifically when tracking hours over this stuff.

View on HN · Topics

Sigh here we go again, model release day is always the worst day of the quarter for me. I always get a lovely anxiety attack and have to avoid all parts of the internet for a few days :/

View on HN · Topics

It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?

View on HN · Topics

Introducing a new upgraded slot machine named "Claude Opus" in the Anthropic casino.

You are in for a treat this time: It is the same price as the last one [0] (if you are using the API.)

But it is slightly less capable than the other slot machine named 'Mythos' the one which everyone wants to play around with. [1]

[0] https://claude.com/pricing#api

[1] https://www.anthropic.com/news/claude-opus-4-7

View on HN · Topics

This is true if you know what you are doing and provide proper guidance. It’s not true if you just want to vibe the whole app.

View on HN · Topics

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

View on HN · Topics

TL;DR; iPhone is getting better every year

The surprise: agentic search is significantly weaker somehow hmm...

Summarizer