Hallucinations and Reliability

Frustrations with LLMs producing non-existent functions, incorrect code, and requiring extensive verification and correction

Users express a profound duality regarding LLMs, often oscillating between significant productivity gains and the mental exhaustion of correcting "addled" hallucinations that invent non-existent functions or introduce subtle logic bugs. While critics argue that these non-deterministic systems undermine technical fundamentals by shifting the developer’s role from creative architect to weary debugger, proponents maintain that leveraging AI for boilerplate and documentation can still double an expert's output. Ultimately, the consensus suggests that while the "vibe-coded" output remains a risky "foot-gun," the ability to provide rigorous human verification and guardrails is becoming a crucial survival skill in an industry where code is increasingly generated at scale but lacks inherent reliability.

View on HN · Topics

So far, eqch and every time I used an LLM to help me with something it hallucinated non-existant functions or was incorrect in an important but non-obvious way.

Though, I guess I do treat LLM's as a last resort longshot for when other documentation is failing me.

View on HN · Topics

"You're holding it wrong"

99% of an LLM's usefulness vanishes, if it behaves like an addled old man.

"What's that sonny? But you said you wanted that!"

"Wait, we did that last week? Sorry let me look at this again"

"What? What do you mean, we already did this part?!"

View on HN · Topics

I'd prefer 1x "wrong stuff" than wrong stuff blasted 1000x. How is that helpful?

Further, they can't write code that fast, because you have to spend 1000x explaining it to them.

View on HN · Topics

Which LLMs have you tried? Claude Code seems to be decent at not hallucinating, Gemini CLI is more eager.

I don't think current LLMs take you all the way but a powerful code generator is a useful think, just assemble guardrails and keep an eye on it.

View on HN · Topics

A simple "how do I access x in y framework in the intended way" shouldnt require any more context.

instead of telling me about z option it keeps hallucinating something that doesnt exist and even says its in the docs when it isnt.

Literally just wasting my time

View on HN · Topics

As long as what it says is reliable and not made up.

View on HN · Topics

Studying gibberish doesn't teach you anything. If you were cargo culting shit before AI you weren't learning anything then either.

View on HN · Topics

Necessarily, LLM output that works isn't gibberish.

The code that LLM outputs, has worked well enough to learn from since the initial launch of ChatGPT. This even though back then you might have to repeatedly say "continue" because it would stop in the middle of writing a function.

View on HN · Topics

Google, Facebook, Amazon, Microsoft....they literally all have the vibe coded code; it's not about vibe coded or not, it is about how well the code is designed, efficient and bug free. Ofc pro coders can debug it and fix it better than some amateur coder but still LLMs are so valuable. I let Gemini vibe code little web projects for me and it serves me well. Although you have to explain everything step by step to it and sometimes when it fixes one bug, it accidently introduces another. But we fix bugs together and learn together. And btw when Gemini fixes bugs, it puts comments in the code on how the particular bug was fixed.

View on HN · Topics

This goes further into LLM usage than I prefer to go. I learn so much better when I do the research and make the plan myself that I wouldn’t let an LLM do that part even if I trusted the LLM to do a good job.

I basically don’t outsource stuff to an LLM unless I know roughly what to expect the LLM output to look like and I’m just saving myself a bunch of typing.

“Could you make me a Go module with an API similar to archive/tar.Writer that produces a CPIO archive in the newcx format?” was an example from this project.

View on HN · Topics

Exactly! ...If the printing press spouted gibberish every 9 words.

View on HN · Topics

That was LLMs in 2023.

View on HN · Topics

Respect to you. I ran out of energy to correct people's dated misconceptions. If they want to get left behind, it's not my problem.

View on HN · Topics

At some point no-one is going to have to argue about this. I'm guessing a bit here, but my guess is that within 5 years, in 90%+ jobs, if you're not using an AI assistant to code, you're going to be losing out on jobs. At that point, the argument over whether they're crap or not is done.

I say this as someone who has been extremely sceptical over their ability to code in deep, complicated scenarios, but lately, claude opus is surprising me. And it will just get better.

View on HN · Topics

That's a lot of words to say "trust me bruh" which is kind of poetic given that's the entire model (no pun intended) that LLMs work on.

View on HN · Topics

If only it were that easy. I got really good at centering and aligning stuff, but only when the application is constructed in the way I expect. This is usually not a problem as I'm usually working on something I built myself, but if I need to make a tweak to something I didn't build, I frequently find myself frustrated and irritated, especially when there is some higher or lower level that is overriding the setting I just added.

As a bonus, I pay attention to what the AI did and its results, and I have actually learned quite a bit about how to do this myself even without AI assistance

View on HN · Topics

Yes, I worry about this quite a bit. Obviously nobody knows yet how it will shake out, but what I've been noticing so far is that brand recognition is becoming more important. This is obviously not a good thing for startup yokels like me, but it does provide an opportunity for quality and brand building.

The initial creation and generation is indeed much easier now, but testing, identifying, and fixing bugs is still very much a process that takes some investment and effort, even when AI assisted. There is also considerable room for differentiation among user flows and the way people interact with the app. AI is not good at this yet, so the prompter needs to be able to identify and direct these efforts.

I've also noticed in some of my projects, even ones shipped into production in a professional environment, there are lots of hard to fix and mostly annoying bugs that just aren't worth it, or that take so much research and debugging effort that we eventually gave up and accepted the downsides. If you give the AI enough guidance to know what to hunt for, it is getting pretty good at finding these things. Often the suggested fix is a terrible idea, but The AI will usually tell you enough about what is wrong that you can use your existing software engineering skills and experience to figure out a good path forward. At that point you can either fix it yourself, or prompt the AI to do it. My success rate doing this is still only at about 50%, but that's half the bugs that we used to live with that we no longer do, which in my opinion has been a huge positive development.

View on HN · Topics

‘Why were they long term?’ is what you need to ask. Code has become essentially free in relative terms, both in time and money domains. What stands out now is validation - LLMs aren’t oracles for better or worse, complex code still needs to be tested and this takes time and money, too. In projects where validation was a significant percentage of effort (which is every project developed by more than two teams) the speed up from LLM usage will be much less pronounced… until they figure out validation, too; and they just might with formal methods.

View on HN · Topics

what "AI" are you speaking of? all the current leading LLMs i know of will _not_ do this (i.e web search for latest libraries) unless you explicitely ask

View on HN · Topics

I'll sometimes ask Claude Sonnet 4.5 for JS and TS library recommendations. Not for "latest" or "most popular". For this case, it seems to love recommending promising-looking code from repos released two months ago with like 63 stars.

View on HN · Topics

Sorry friend, if you can’t identify the important differences between a compiler and an LLM, either intentionally or unintentionally (I can’t tell), then I must question the value of whatever you have to say on the topic.

View on HN · Topics

No, when we write code it has a an absolute and specific meaning to the compiler. When we write words to an LLM they are written in a non-specific informal language (usually English) and processed non-deterministically too. This is an incredibly important distinction that makes coding, and asking the LLM to code, two completely different ball games. One is formal, one is not.

And yes, this isn’t a new phenomenon.

View on HN · Topics

There is no x is because LLM performance is non deterministic. You get slop out at varying degrees of quality and so your job shifts from writing to debugging.

View on HN · Topics

A year or so ago I was seriously thinking of making a series of videos showing how coding agents were just plain bad at producing code. This was based on my experience trying to get them to do very simple things (e.g. a five-pointed star, or text flowing around the edge of circle, in HTML/CSS). They still tend to fail at things like this, but I've come to realize that there are whole classes of adjacent problems they're good at, and I'm starting to leverage their strengths rather than get hung up on their weaknesses.

Perhaps you're not playing to their strengths, or just haven't cracked the code for how to prompt them effectively? Prompt engineering is an art, and slight changes to prompts can make a big difference in the resulting code.

View on HN · Topics

I appreciate your reply. A lot of people just say how wonderful and revolutionary LLMs are, but when asked for more concrete stuff they give vague answers or even worse, berate you for being skeptical/accuse you of being a luddite.

Your list gives me a starting point and I'm sure it can even be expanded. I do use LLMs the way you suggested and find them pretty useful most of the time - in chat mode. However, when using them in "agent mode" I find them far less useful.

View on HN · Topics

I think it depends what you are doing. I’ve had Claude right the front end of a rust/react app and it was 10x if not x (because I just wouldn’t have attempted it). I’ve also had it write the documentation for a low level crate - work that needs to be done for the crate to be used effectively - but which I would have half-arsed because who like writing documentation?

Recently I’ve been using it to write some async rust and it just shits the bed. It regularly codes the select! drop issue or otherwise completely fails to handle waiting on multiple things. My prompts have gotten quite sweary lately. It is probably 1x or worse. However, I am going to try formulating a pattern with examples to stuff in its context and we’ll see. I view the situation as a problem to be overcome, not an insurmountable failure. There may be places where an AI just can’t get it right: I wouldn’t trust it to write the clever bit tricks I’m doing elsewhere. But even there, it writes (most of) the tests and the docs.

On the whole, I’m having far more fun with AI, and I am at least 2x as productive, on average.

Consider that you might be stuck in a local (very bad) maximum. They certainly exist, as I’ve discovered. Try some side projects, something that has lots of existing examples in the training set. If you wanted to start a Formula 1 team, you’re going to need to know how to design a car, but there’s also a shit ton of logistics - like getting the car to the track - that an AI could just handle for you. Find boring but vital work the AI can do because, in my experience, that’s 90% of the work.

View on HN · Topics

I feel like I can manage the entire stack again - with confidence.

I have less confidence after a session, now I second guess everything and it slows me down because I know the foot-gun is in there somewhere.

For example, yesterday Gemini started added garbage Unicode and then diagnosed file corruption which it failed to fix.

And before you reply, yes it's my fault for not adding "URGENT CRITICAL REQUIREMENT: don't add rubbish Unicode" to my GEMINI.md.

View on HN · Topics

> I feel like I can manage the entire stack again - with confidence.

By not managing anything? Ignorance is bliss, I guess.

I understand it. I've found myself looking at new stacks and tech, not knowing what I didn't know, and wondering where to start. But if you skip these fundamentals of the modern dev cycle, what happens when the LLM fails?

View on HN · Topics

It's good that tools create the OP's positive feeling about being on top of the full Web stack again.

I just wish the tools that provides that feeling was a deterministic front-end code generator built from software technology and software engineering insights and not a neural network utilizing a pseudo-random number generator...

Summarizer