Verification and Hallucination Risks

The critical need for external validation mechanisms. Commenters note that coding agents succeed because compilers/linters act as truth-checkers, whereas open-ended tasks (spreadsheets, emails) lack rigorous feedback loops, making hallucinations and "truthy" errors dangerous and hard to detect.

While coding agents benefit from the objective feedback of compilers and linters, commenters warn that open-ended tasks often produce "truthy" hallucinations that are dangerously difficult to detect at a glance. This reasoning deficit leads to significant logic errors—ranging from nonsensical SQL tests to inaccurate medical reports—that still require specialized human expertise to catch, challenging the idea that AI can fully replace specialized roles. Furthermore, there is a pervasive concern that as AI is used to "speed up" work, the necessary human verification will become increasingly cursory or "half-assed" due to mounting productivity pressures. Ultimately, the utility of these agents hinges on whether they are treated as fallible interns requiring constant tutoring or as authoritative tools, with the latter risking a "clanker incompetence" that undermines professional standards.

View on HN · Topics

Yesterday I got AI (a sota model) to write some tests for a backend I'm working on. One set of tests was for a function that does a somewhat complex SQL query that should return multiple rows

In the test setup, the AI added a single database row, ran the query and then asserted the single added row was returned. Clearly this doesn't show that the query works as intended. Is this what people are referring to when they say AI writes their tests?

I don't know what to call this kind of thinking. Any intelligent, reasoning human would immediately see that it's not even close to enough. You barely even need a coding background to see the issues. AI just doesn't have it, and it hasn't improved in this area for years

This kind of thing happens over and over again. I look at the stuff it outputs and it's clear to me that no reasoning thing would act this way

View on HN · Topics

As a counter I’ve had OpenAI Codex and Claude Code both catch logic cases I’d missed in both tests and codes.

The tooling in the Code tools is key to useable LLM coding. Those tools prompt the models to “reason” whether they’ve caught edge cases or met the logic. Without that external support they’re just fancy autocompletes.

In some ways it’s no different than working with some interns. You have to prompt them to “did you consider if your code matched all of the requirements?”.

LLMs are different in that they’re sorta lobotomized. They won’t learn from tutoring “did you consider” which needs to essentially be encoded manually still.

View on HN · Topics

What constitutes real "thinking" or "reasoning" is beside the point. What matters is what results we getting.

And the challenge is rethinking how we do work, connecting all the data sources for agents to run and perform work over the various sources that we perform work. That will take ages. Not to mention having the controls in place to make that the "thinking" was correct in the end.

View on HN · Topics

I asked GPT for rules on 101-level French grammar. That should be well documented for someone learning from English, no? The answers were so consistently wrong that it seemed intentional. Absolutely nothing novel asked of it. It could have quoted verbatim if it wanted to be lazy. I can't think of an easier question to give an LLM. If it's possible to "prompt wrong" a simple task that my six-year old nephew could easily do, the burden of proof is not on the people denying LLM intelligence, it's on the boosters.

View on HN · Topics

Edit: HN filters emojis in comments. TLDR: the bots reply with various sea-related emojis and can't explain that there's no seahorse emoji.

on Meta (whatsapp) chatbot:

me: what is the seahorse emoji?

bot:

me: this is not a seahorse

bot: You got me! The actual seahorse emoji is . Want me to use it in a fun sentence?

on Bing/Copilot:

me: show me the seahorse emoji

bot:

------------

Here you go: — actually, that’s a fun combo, but the real seahorse emoji is … wait, not quite! The correct one is for jellyfish and for unicorn, but the seahorse emoji itself is ? Let me clarify:

The official seahorse emoji is: ? No — correction! It’s for unicorn. The actual seahorse emoji is: ?

Let’s be precise: the seahorse emoji is ?

Oops — I misspoke earlier. The true seahorse emoji is: ?

Wait — let’s settle this cleanly:
The seahorse emoji is: ?

Actually, the correct one is: ?

---------------

In some instances, Copilot answers that "there's no seahorse emoji _and that's why I keep running in circles_". This is just pasting answers, as this is a popular problem discussed on the internet. If the bot were reasoning, it would just answer "there's no seahorse emoji, please stop asking".

View on HN · Topics

Copilot is the absolute worst. Yesterday I had tried to have it create a printable calendar for January 2026 but no matter how I instructed it, it kept showing that the first was on a Wednesday, not Thursday. I even fed it back its own incorrect PDF in a new conversation, which clearly showed the 1st on a Wednesday and asked it to tell me what day the calendar showed the first on. It said the calendar showed the 1st as a Thursday. It started to make me disbelieve my own eyes.

Edit: I gave up on Copilot ant fed the same instructions to ChatGPT, which had no issue.

The point here is that some models seem to know your intention while some just seem stuck on their training data.

View on HN · Topics

> I've gotten a lot of value out of reading the views of experienced engineers; overall they like the tech, but they do not think it is a sentient alien that will delete our jobs.

I normally see things the same way you do, however I did have a conversation with a podiatrist yesterday that gave me food for thought. His belief is that certain medical roles will disappear as they'll become redundant. In his case, he mentioned radiology and he presented his case as thus:

A consultant gets a report + X-Ray from the radiologist. They read the report and confirm what they're seeing against the images. They won't take the report blindly. What changes is that machines have been learning to interpret the images and are able to use an LLM to generate the report. These reports tend not to miss things but will over-report issues. As a consultant will verify the report for themselves before operating, they no longer need the radiologist. If the machine reports a non-existent tumour, they'll see there's no tumour.

View on HN · Topics

> But the radiologists that I've talked to aren't impressed with the work AI had managed to do in their field.

Just so I understand correctly: is it over-reporting problems that aren't there or is it missing blindingly obvious problems? The latter is obviously a problem and, I agree, would completely invalidate it as a useful tool. The former sounded, the way it was explained to me, more like a matter of degrees.

View on HN · Topics

> However, they absolutely also lower the barrier to entry and dethrone “pure single tech” (ie backend only, frontend only, “I don’t know Kubernetes”, or other limited scope) software engineers who’ve previously benefited from super specialized knowledge guarding their place in the business.

This argument gets repeated frequently, but to me it seems to be missing final, actionable conclusion.

If one "doesn't know Kubernetes", what exactly are they supposed to do now, having LLM at hand, in a professional setting? They still "can't" asses the quality of the output, after all. They can't just ask the model, as they can't know if the answer is not misleading.

Assuming we are not expecting people to operate with implicit delegation of responsibility to the LLM (something that is ultimately not possible anyway - taking blame is a privilege human will keep for a foreseeable future), I guess the argument in the form as above collapses to "it's easier to learn new things now"?

But this does not eliminate (or reduce) a need for specialization of knowledge on the employee side, and there is only so much you can specialize in.

The bottleneck maybe shifted right somewhat (from time/effort of the learning stage to the cognition and the memory limits of an individual), but the output on the other side of the funnel (of learn->understand->operate->take-responsibility-for) didn't necessary widen that much, one could argue.

View on HN · Topics

It really depends on whether coding agents is closer to "compiler" or not. Very few amongst us verify assembly code. If the program runs and does the thing, we just assume it did the right thing.

View on HN · Topics

>They still "can't" asses the quality of the output, after all. They can't just ask the model, as they can't know if the answer is not misleading.

Wasn't this a problem before AI? If I took a book or online tutorial and followed it, could I be sure it was teaching me the right thing? I would need to make sure I understood it, that it made sense, that it worked when I changed things around, and would need to combine multiple sources. That still needs to be done. You can ask the model, and you'll have the judge the answer, same as if you asked another human. You have to make sure you are in a realm where you are learning, but aren't so far out that you can easily be misled. You do need to test out explanations and seek multiple sources, of which AI is only one.

An AI can hallucinate and just make things up, but the chance it different sessions with different AIs lead to the same hallucinations that consistently build upon each other is unlikely enough to not be worth worrying about.

View on HN · Topics

> someone who’s shipped entire new frontend feature sets, while also managing a team. I’ve used LLM to prototype these features rapidly and tear down the barrier to entry on a lot of simple problems that are historically too big to be a single-dev item, and clear out the backlog of “nice to haves” that compete with the real meat and bread of my business. This prototyping and “good enough” development has been massively impactful in my small org

Has any senior React dev code review your work? I would be very interested to see what do they have to say about the quality of your code. It's a bit like using LLMs to medically self diagnose yourself and claiming it works because you are healthy.

Ironically enough, it does seem that the only workforce AIs will be shrinking will be devs themselves. I guess in 2025, everyone can finally code

View on HN · Topics

I'm just one data point. Me being unimpressed should not be used to judge their entire work. I feel like I have a pretty decent understanding of a few small corners of what they're doing, and find it a bad omen that they've brushed aside some of my concerns. But I'm definitely not knowledgeable enough about the rest of it all.

What concerns me is, generally, if the experts (and I do consider them experts) can use frontier AI to look very productive, but upon close inspection of something you (in this case I) happen to be knowledgeable about, it's not that great (built on shaky foundations), what about all the vibe coded stuff built by non-experts?

View on HN · Topics

> Longer term, he was also quite optimistic on its ability to cut out roles like radiologists, instead having a software program interpret the images and write a report to send to a consultant.

As a medical imaging tech, I think this is a terrible idea. At least for the test I perform, a lot of redundancy and double-checking is necessary because results can easily be misleading without a diligent tech or critical-thinking on the part of the reading physician. For instance, imaging at slightly the wrong angle can make a normal image look like pathology, or vice versa.

Maybe other tests are simpler than mine, but I doubt it. If you've ever asked an AI a question about your field of expertise and been amazed at the nonsense it spouts, why would you trust it to read your medical tests?

> Since the consultant already checks the report against any images, the AI being more sensitive to potential issues is a positive thing: giving him the power to discard erroneous results rather than potentially miss something more malign.

Unless they had the exact same schooling as the radiologist, I wouldn't trust the consultant to interpret my test, even if paired with an AI. There's a reason this is a whole specialized field -- because it's not as simple as interpreting an EKG.

View on HN · Topics

>our policy agent extracts all coverage limits and policy details into a data ontology

Aren't you worried about the agent missing or hallucinating policy details?

View on HN · Topics

What an uncharitable and nasty comment for something they clearly addressed in theirs:

> It is more accurate and consistent than our humans.

So, errors can clearly happen, but they happen less often than they used to.

> It will draft a reply or an email

"draft" clearly implies a human will will double-check.

View on HN · Topics

> "draft" clearly implies a human will will double-check.

The wording does imply this, but since the whole point was to free the human from reading all the details and relevant context about the case, how would this double-checking actually happen in reality?

View on HN · Topics

The “double checking” is a step to make sure there’s someone low-level to blame. Everyone knows the “double-checking” in most of these systems will be cursory at best, for most double-checkers. It’s a miserable job to do much of, and with AI, it’s a lot of what a person would be doing. It’ll be half-assed. People will go batshit crazy otherwise.

On the off chance it’s not for that reason, productivity requirements will be increased until you must half-ass it.

View on HN · Topics

I think it's a good comment, given that the best agents seem to hallucinate something like 10% on a simple task and more than 70% on complex ones.

View on HN · Topics

>So, errors can clearly happen, but they happen less often than they used to.

If you take the comment at face value. I'm sorry but I've been around this industry long enough to be sceptical of self serving statements like these.

>"draft" clearly implies a human will will double-check.

I'm even more sceptical of that working in practice.

View on HN · Topics

Here's some anecdata from the B2B SaaS company I work at

- Product team is generating some code with LLMs but everything has to go through human review and developers are expected to "know" what they committed - so it hasn't been a major time saver but we can spin up quicker and explore more edge cases before getting into the real work

- Marketing team is using LLMs to generate initial outlines and drafts - but even low stakes/quick turn around content (like LinkedIn posts and paid ads) still need to be reviewed for accuracy, brand voice, etc. Projects get started quicker but still go through various human review before customers/the public sees it

- Similarly the Sales team can generate outreach messaging slightly faster but they still have to review for accuracy, targeting, personalization, etc. Meeting/call summaries are pretty much 'magic' and accurate-enough when you need to analyze any transcripts. You can still fall back on the actual recording for clarification.

- We're able to spin up demos much faster with 'synthetic' content/sites/visuals that are good-enough for a sales call but would never hold up in production

---

All that being said - the value seems to be speeding up discovery of actual work, but someone still needs to actually do the work. We have customers, we built a brand, we're subject to SLAs and other regulatory frameworks so we can't just let some automated workflow do whatever it wants without a ton of guardrails. We're seeing similar feedback from our customers in regard to the LLM features (RAG) that we've added to the product if that helps.

View on HN · Topics

That makes me wonder how much of the article is true, and how much was hallucinated by an AI.

View on HN · Topics

Codex and the like took off because there existed a "validator" of its work - a collection of pre-existing non-LLM software - compilers, linters, code analyzers etc. And the second factor is very limited and defined grammar of programming languages. Under such constraints it was much easier to build a text generator which will validate itself using external tools in a loop, until generated stream makes sense.

And the other "successful" industry being disrupted is the one where there is no need validate output, because errors are ok or irrelevant. A text not containing much factual data, like fiction or business-lingo or spam. Or pictures, where it doesn't matter which color is a specific pixel, a rough match will do just fine.

But outside of those two options, not many other industries can use at scale an imprecise word or media generator. Circular writing and parsing of business emails with no substance? Sure. Not much else.

View on HN · Topics

This is the reasoning deficit. Models are very good at generating large quantities of truthy outputs, but are still too stupid to know when they've made a serious mistake. Or, when they are informed about a mistake they sometimes don't "get it" and keep saying "you're absolutely right!" while doing nothing to fix the problem.

It's a matter of degree, not a qualitative difference. Humans have the exact same flaws, but amateur humans grow into expert humans with low error rates (or lose their job and go to work in KFC), whereas LLMs are yet to produce a true expert in anything because their error rates are unacceptably high.

View on HN · Topics

Besides the ability to deal with text, I think there are several reasons why coding is an exceptionally good fit for LLMs.

Once LLMs gained access to tools like compilers, they started being able to iterate on code based on fast, precise and repeatable feedback on what works and what doesn't, be it failed tests or compiler errors. Compare this with tasks like composing a powerpoint deck, where feedback to the LLM (when there is one) is slower and much less precise, and what's "good" is subjective at best.

Another example is how LLMs got very adept at reading and explaining existing code. That is an impressive and very useful ability, but code is one of the most precise ways we, as humans, can express our intent in instructions that can be followed millions of times in a nearly deterministic way (bugs aside). Our code is written in thoroughly documented languages with a very small vocabulary and much easier grammar than human languages. Compare this to taking notes in a zoom call in German and trying to make sense of inside jokes, interruptions and missing context.

But maybe most importantly, a developer must be the friendliest kind of human for an LLM. Breaking down tasks in smaller chunks, carefully managing and curating context to fit in "memory", orchestrating smaller agents with more specialized tasks, creating new protocols for them to talk to each others and to our tools.... if it sounds like programming, it's because it is.

View on HN · Topics

> You no longer need copywriters because of LLMs

You absolutely do still need copyeditors for anything you actually care about.

View on HN · Topics

Humans are doing a bit more, specifically around 20% more.

AI generates output that must be thoroughly check for most software engineering purposes. If you’re not checking the output, then quality and accuracy must not matter. For quick prototyping that’s mostly true. Not for real engineering.

View on HN · Topics

> Agents as LLMs calling tools in a loop to perform tasks that can be handled by typing commands into a computer absolutely did.

I think that this still isn't true for even very mundane tasks like "read CSV file and translate column B in column C" for files with more than ~200 lines. The LLM will simply refuse to do the work and you'll have to stitch the badly formatted answer excerpts together yourself.

View on HN · Topics

Try it. It will work fine, because the coding agent will write a little Python script (or sed or similar) and run that against the file - it won't attempt to rewrite the file by reading it and then outputting the transformed version via the LLM itself.

View on HN · Topics

> Spreadsheet AI

If you don't mind, could you please write a few examples of what LLMs do in Spreadsheets? Because that's probably the last place where I would allow LLMs, since they tend to generate random data and spreadsheets being notoriously hard to debug due all the hidden formulas and complex dependencies.

Say you have an accounting workbook with 50 or so sheets with tables depending on each other and they contain very important info like inventory and finances. Just a typical small to medium business setup (big corporations also do it). Now what? Do you allow LLMs to edit files like that directly? Do you verify changes afterwards and how?

View on HN · Topics

Do LLM's generate "random data"? I you give them source data there is virtually no room for hallucination in my experience. Spreadsheets are no different than coding. You can put tests in place to verify results.

View on HN · Topics

everyone excited about AI agents doesn’t have to evaluate the actual output they do

Very few people do

so neither Altman, the many CEOs industry wide, Engineering Managers, Software Engineers, “Forward Deployed Engineers” have to actually inspect

their demos show good looking output

its just the people in support roles that have to be like “wait a minute, this is very inconsistent”

all while everyone is doing their best not to get replaced

its clanker discrimination and mixed with clanker incompetence

View on HN · Topics

"good looking output" is exactly the problem. They're all good at good looking output which survives a first glance.

Summarizer