Trust and Verification

Need to verify AI output, treating AI responses with skepticism, the impossibility of reviewing everything at scale, domain expertise required to catch errors

While AI significantly accelerates production, it often creates a dangerous "illusion of competence" by generating polished, authoritative-looking outputs that mask hallucinations and fundamental logic errors. Commenters argue that this shifts a heavy burden onto reviewers, who must possess deep domain expertise to catch subtle bugs that can lead to catastrophic system failures or "write-only" codebases. There is a growing frustration with "vibe coding," where the lack of "skin in the game" leads users to offload the labor of verification to others, effectively drowning meaningful signals in a sea of high-effort noise. Ultimately, the consensus highlights that AI is a tool of probabilities rather than certainties, requiring a skeptical human intermediary to provide the critical "value add" and accountability that the machine cannot.

View on HN · Topics

I'm starting to see pushback for this. I know a Product Manager that was fired for padding his documentation with AI to the point there were mistakes and wasted work due to AI hallucinations.

View on HN · Topics

If that is your manager, do so, sure. But make sure your manager is "such a manager".

If I was your manager, and you sent me your seventeen page AI generated thing coz you think I'm just gonna summarize anyway and I expect something long: You misread me.

I make a point all the time to everyone that won't listen, to not send me walls of text. I'm not gonna read them. I'm gonna ignore them, close your bug reports until I can understand them because you spent the time to make them short and legible. If you use AI for that, I don't care. But I better have something short and that when I read it makes actual sense and when I verify it, holds up. If I wanted to just ask AI, I'd do it myself. You have to "value add" to the AI if you want to be valuable yourself.

View on HN · Topics

The last place I worked for, if it happened with someone new in the company or the team, I would find a polite way to say "do your job and fix this shit" and it worked.

Some people have put me on their blacklists after these interactions, sure, but they're the exact people I don't want to work with again. The important thing here is that I've never done someone else's work for free.

View on HN · Topics

Because the reviewer ends up doing the real work actually checking it works.

The laziness is offloading work down the line.

View on HN · Topics

That has nothing to do with using AI, if the dev didn't check their work then that is being a bad dev.

View on HN · Topics

Unfortunately, there is pressure to treat this stuff in good faith. Maybe the PR author really did write all this. Maybe they really did spend 6 hours writing this document.

So, I approach it in good faith, but I do get upset when people say "I'll ask claude". You need to be the intermediary, I can also prompt claude and read back the result. If you are going to hire an employee to do work on your behalf, you are responsible for their performance at the end of the day. And that's what an AI assistant is. The buck stops with you. But I don't think people understand that and that they don't understand they aren't adding value. At some point, you have to use your brain to decide if the AI is making sense, that's not really my job as the code/doc reviewer. I want to have a conversation with you , not your tooling, basically.

View on HN · Topics

> Why does that content get ranked highly?

Search engines only show a snippet of the content and that always looks convincing. It's the whole content that is off and, unfortunately, a few seconds/minutes can pass before you realize it (If you ever do).

View on HN · Topics

If I paste something from an AI into chat, I always identify it as such by saying something like "my claude instance says this:". I also don't blindly copy paste from it, I always read it first and usually edit it for brevity or tone. Feel like this should be the absolute minimum for sending AI content to a person.

View on HN · Topics

Even that is pretty useless because we have no idea what context "your Claude instance" has. All you're doing is dressing up some bullshit to look authoritative.

When I started my PhD I was already really good at typesetting with LaTeX. I started to bring in fully typeset works in progress for my supervisor to read through. These proofs often had fatal flaws. He asked me to stop typesetting until after the work had been verified because it looked too convincingly correct due to being typeset.

That was about 15 years ago but I've never forgotten it. Drafts should look like drafts. Scrappy work and proofs of concept should look as such. Stop fucking with people by making your bullshit, scrappy ideas look legit. Progress is a cooperative effort. It's not about trying to make people say yes.

View on HN · Topics

My friend built a construction management SaaS entirely via Claude.

It looked damned impressive, and it kind of worked to demo, but he is in no way a programmer, though he understood the problem domain very well. I asked a few basic questions:

- where is the data stored?

- How would you recover from a database failure?

- does it consume tokens at runtime?

- what is the runtime used at the back end?

- why are the web pages 3M in size and take forever to load?

He had no idea.

It's a typical vibe coding scenario, and people like to paint this as why vibe sucks.

I think however that all that is needed to bridge the gap is some very simple feedback from an expert at the right time.

For example to someone who knows about databases, its pretty easy to look at a database schema and spot stuff that looks off - denormalised data, weird columns. That takes 10 minutes, and the feedback could be given directly to the LLM.

Likewise someone who knows a little about systems architecture could make sure at the outset that some good practices are followed, e.g.:

- "I want your help to build this system but at runtime I do not want to consume any tokens."

- "I want the system to store its data in Postgres (or whatever) and I want documented recovery plans if the database craps itself".

- "I want web pages to, as much as possible, load and render as quickly as possible, and then pull data in from the back end, with loading indicators showing where the UI was not yet up to date".

View on HN · Topics

Is CRUD low stakes? Even if all you do with the employee database is read and write employees, losing it or corrupting it is disastrous, potentially business-ending.

View on HN · Topics

> That takes 10 minutes

Verifying LLM output needs to occur every time LLM output is generated, so no it doesn’t just take 10 minutes.

It takes 10 minutes + time to change the LLM input + 10 minutes to verify it worked * ~the number of times the code is generated.

Which is why vibe coding is so common, if you actually care about quality LLM’s are a near endless time sink.

View on HN · Topics

So far, when Claude pops out a schema it's pretty spot on, iff you've described the problem correctly.

What the article's author seems to be hinting at is that the problem was described incorrectly from day one, and the LLM picked the wrong schema from day one. Because the person making it is not technically literate enough to describe the problem in a way an LLM interpreted correctly.

The hidden BA work a developer usually does was missing from the process.

View on HN · Topics

If you have a codebase that big, can you even fit enough of it into a context window for the LLM to make correct and meaningful changes across all of it? Admittedly I've only used LLM-based coding for smaller projects.

View on HN · Topics

I’m an LLM enjoyer who also thinks that ‘er ‘jerbs are safe and, taken to their logical conclusion, most LLM-stroking online around coding reduces to an argument that we should be speaking Haskell to LLMs and also in specs and documentation (just kidding, OCaml is prettier). But also, I do a little business.

You’ve hit the real issue, IT management is D-tier and lacks self awareness. “Agile” is effed up as a rule, while also being the simplest business process ever.

That juniors and fakers are whole hog on LLMs is understandable to me. Hype, fashion, and BS are always potent. The part I still cannot understand, as an Executive in spirit: when there is a production issue, and one of these vibes monkeys you are paying has to fix it, how could you watch them copy and paste logs into a service you’re top dollar paying for, over and over, with no idea of what they’re doing, and also not be on your way to jail for highly defensible manslaughter?

We don’t pay mechanics to Google “how to fix car”.

View on HN · Topics

This is definitely ¾ of what you pay a mechanic to do; 1 publisher writes a maintenance manual for a car; mechanics all around the globe can use that to work on that specific car.

It's the mechanics that don't reference Google or the Haynes manual that are more likely to get it incorrect.

As a kicker, mechanics also have a pricing book for the task, they know how many hours a task will take on a certain car (rounded up for the most part).

View on HN · Topics

You are not responding faithfully to the comment. A mechanic looking up the schematics in a manual understands them. Just because they haven't memorized the material does not make it the same. This is more analogous to looking up a function in the documentation that you forgot about.

This is clearly not what the post was referring to, which is instead like googling how to fix a pipe in your home when you've never done any plumbing before in your life. Can it work out? Sure, depends on the issue, can you cause your pipes to freeze, your house to flood, or sediment build up to completely block a pipe? Yes.

View on HN · Topics

Speaking not as a professional mechanic, but as someone who maintains a car, two trucks, a tractor, a couple boats, and has googled quite a lot of torque specs in my time... If you're googling torque specs in 2026 you're gonna have a bad time. They're frequently just flat out wrong, especially the AI summaries ;). Use the authoritative source of truth--the shop manual published by the equipment manufacturer. Accept no substitutes.

View on HN · Topics

Yeah Bentley (and in some cases Haynes) make good aftermarket manuals too. And you can find good information on some forums. But you can also find a lot of bad information. Reliably sifting the good from bad only comes with experience--much like in software.

View on HN · Topics

It would be horrible to rewrite. Not the first commit or whatever. But after a few weeks of people not reading the code it looks more like a write only code base. I refused to go full vibe/agentic coding. So I got to see what was happening. This was only over a short period of time mind you.

There was a lot of duplicate and triplicate methods. A lot of the classes were is-a related without inheritance, not the biggest deal but it was becoming a mess.

Code I used to know well was more or less gone. It was rewritten in a way that wasn't the same approach and had lost lessons learned. Some of it had real battle wounds baked into it. Things qa passed the week before were broken in places no one thought they touched. A good deal of tests were useless or didn't mean anything for production.

Code review is more or less impossible for me. I can read maybe a 1k line change. 20-30k changes all the time? You end up saying "sure buddy lgtm". We had someone put a 200kloc change for a new feature using a 3rd party tool no one had used before. No clue, but it was not my business apparently because we needed to be more individuals now that we were using AI

View on HN · Topics

There is perhaps _some_ truth to this, long term. But I think it’s way too early to remove all the QA.

View on HN · Topics

Yes I get your frustration, the same thing is happening across orgs these days as claude and co-work has become widespread.

Wisdom is a thing, so is competence. Humans have it or they don't but machines do not (yet), but the massive capabilities of the tools are also something that can't be ignored.

We can't throw the baby out with the bathwater. It's going to take some cycles of learning the ropes with this technology for humans to understand it better.

I would push back -why couldn't the senior devs communicate these issues to senior management? It sounds like a broken human system not a broken tool or technology. All AI did was shine a light on the human issues on that org.

View on HN · Topics

> intelligent autocomplete

I'm curious how much value others are finding in this. Personally I turned it off about a year ago and went back to traditional (jetbrains) IDE autocomplete. In my experience the AI suggestions would predict exactly what I wanted < 1% of the time, were useful perhaps 10% of the time, and otherwise were simply wrong and annoying. Standard IDE features allowing me to quickly search and/or browse methods, variables, etc. are far more useful for translating my thoughts into code (i.e. minimizing typing).

View on HN · Topics

Even worse, I've seen the JetBrains AI auto-complete insert hard-to-spot bugs, like two nested for loops with i and j for loop index variables, where the inner loop was fairly complex and incorrectly used i instead of j in one place.

View on HN · Topics

ouch, sounds like your manager is more a problem than the llm review!

i find it as a good backstop to catch dumb mistakes or suggest alternatives but is not a replacement for human review (we require human review but llm suggestions are always optional and you're free to ignore)

View on HN · Topics

On troubleshooting, either LLMs used to be better, or I'm in a huge bad luck strake. All of the last few times I tried to ask one, I've got a perfectly believable and completely wrong answer that weren't even on the right subject.

On code review, the amount of false positives is absolutely overwhelming. And I see no reason for that to improve.

But yes, LLMs can probably help on those lines.

View on HN · Topics

I've found them super hit or miss for debugging. I've gone down several rabbit holes where the LLM wasted hours of my time for a simple fix. On the other hand, they're awesome for ripping through thousands of log lines and then correlating it to something dumb happening in your codebase. My modus opernadi with them for debugging is basically "distrust but consider". I'll let one of them rip in the background while I go and debug myself, and if they can find the solution, great, if not, well, I haven't spent much effort or time trying to convince them to find the problem.

View on HN · Topics

Even generating a first-pass of the eventual production code that you can step back and review is useful to get ideas, so long as you guard yourself against laziness of going with the first answer it provides

View on HN · Topics

The right use of AI requires stellar leadership, and to be honest, I don't think that kind of leadership exists. I am using AI just for myself, and the traps and pitfalls I encounter are so many. For example, I generate an article on a topic, and while this is very useful to get started, I then have to go through every sentence because AI makes some overconfident statements that are just not true in this form. This is still very helpful, because then I have to think about why they are not true. But I don't see how that can ever scale, how would I know that colleagues are also diligent like this?

AI is incredible in three scenarios: a) what I just described, to get you started, b) to generate artifacts that can be rigorously checked (and I don't mean tests, I mean proofs), c) where your artifacts don't have a meaningful notion of correctness, like a work of art.

c) is a matter of taste, b) certainly scales, but a) is where I think trust will be essential, and I am not ready to trust anyone with that except myself.

Oh, and I think currently, c) is applied to software engineering, by people who cannot distinguish the engineering from the art part of software. Which is just funny right now, and will eventually be catastrophic.

View on HN · Topics

It's also wrong advice. After an LLM produces code, asking it if it's correct (in a variety of other ways) can often find actual problems with it.

View on HN · Topics

Also, all code is wrong in the wrong context, all code is right in the right context, the reason AI cannot one shot a complete architecture is that it's not a defined and possible task - if you fully specify the architecture the AI isn't designing anything, and if you don't fully specify the architecture how is the AI going to resolve ambiguity without either guessing, asking questions to make you do the necessary work, or refusing to work until it's fully specified?

AI is a stochastic process, it's more like finding the answer to a particular problem using simulated annealing, a genetic algorithm, or a constrained random walk. It's been trained on code well enough that there's a high density probability field around the kinds of code you might want, and that's what you see often - middle of the road solutions are easy to one shot.

But if you have very specific requirements, you're going to quickly run into areas of the probability cloud that are less likely, some so unlikely that the AI has no training data to guide it, at which point it's no better than generating random characters constrained by the syntax of the language unless you can otherwise constrain the output with some sort of inline feedback mechanism (LSP, test, compiler loops, linters, fuzzers, prop testing, manual QA, etc etc).

View on HN · Topics

I have to produce a great deal of documentation at work for our customers, most of it regulatory and compliance assessments.

Some of the sources I need to use come from agencies in the government or working with the government and are often over a thousand pages long.

So AI has been incredibly helpful here because a lot of what I need to do is map this huge bureaucratic set of guidelines and policies to each customer’s particular situation.

Aware of the sloppy nature of LLMs I created my own workflow that resembles more coding than document drafting.

I use Codex, VSCode and plain markdown, I don’t use MS Word or Copilot like all my other colleagues.

I invest a great deal of time still doing manual labor like researching and selecting my sources, which I then make available for Codex to use as its single source of truth.

I start with a skill that generates the outline which often is longer than it should be. Sometimes I get say a 18 sections outline and I ask Codex to cut it in half. Then I ask for a preliminary draft of each section (each on a separate markdown) and read through and update as necessary, before I ask the agent to develop each section in full, then proof read and update again.

When I’m satisfied I merge all the sections into one single markdown and run another skill to check for repetition, ambiguity, length, etc and usually a few legitimate improvements are recommended.

The whole process can still take me several days to produce a 20-30 pages compliance document, which gets read, verified and approved by myself and others in my team before it goes out.

The productivity gains are pretty obvious, but most importantly I think the content is of better quality for the customer.

View on HN · Topics

There was a hidden benefit in the old way: it avoided people making effort for things that weren't important. It took effort to make signal cut through noise. When it was low effort, it was obvious it was just noise and could easily be ignored.

Now low effort noise can masquerade as high effort signal, drowning out the signal for things that actually matter.

Direct relationships of trust matter more than ever now. You can't just trust that if something looks high effort that it actually is. You need to know the person producing it and know how they approach work and how they treat you personally. Do they cut corners all the time or only for reasons they clearly communicate? Do they value high quality work? Do they respect your time?

View on HN · Topics

Human mistakes in code usually have reasoning behind it. You can understand how the engineer made the mistake.

AI mistakes aren't like this, mistakes look like someone was lobotomized mid coding.

View on HN · Topics

The most productive people seem to be the ones who are skeptical of AI but found compelling cases to use them for and aren't afraid to correct them.

View on HN · Topics

Using LLMs/agents feels like bowling with bumpers but I'm the bumpers.

View on HN · Topics

It’s like walking a dog that keeps pulling off the path

View on HN · Topics

While I’m not disagreeing, if you ask the LLM to critique something, it will try very hard to find something to critique, regardless of how little it might be warranted. The important thing is that you have to remain the competent judge of its output.

View on HN · Topics

There is always a chance that the LLM will hallucinate something wrong. It's all probabilities, quite possibly the closest thing to quantum mechanics in action that we have at the macro level. The act of receiving information from an LLM collapses its state, which was heretofore unknown.

However , your actions can certainly influence those probabilities.

> If asked properly, LLMs can be used to poke holes in an existing reasoning or come up with new ideas or things to explore.

Since, at the most basic level, LLMs are prediction engines, and since one of the things they really, really want (OK, they don't "want", but one of the things they are primed to do) is to respond with what they have predicted you want to see.

Embedding assertions in your prompt is either the worst thing you can do, or the best thing you can do, depending on the assertions. The engine will typically work really hard to generate a response that makes your assertion true.

This is one reason why lawyers keep getting dinged by judges for citations made up from whole cloth. "Find citations that show X" is a command with an embedded assertion. Not knowing any better, the LLM believes (to the extent such a thing is possible) that the assertion you made is true, and attempts to comply, making up shit as it goes if necessary.

View on HN · Topics

> never ask a model for confirmation or encouragement; but you can absolutely ask it to critique something, and that's often of value.

What's the difference? The end result is equally unreliable.

In either case, the value is determined by a human domain expert who can judge whether the output is correct or not, in the right direction or not, if it's worth iterating upon or if it's going to be a giant waste of time, and so on. And the human must remain vigilant at every step of the way , since the tool can quickly derail.

People who are using these tools entirely autonomously, and give them access to sensitive data and services, scare the shit out of me. Not because the tool can wipe their database or whatnot, but because this behavior is being popularized, normalized, and even celebrated. It's only a matter of time until some moron lets it loose on highly critical systems and infrastructure, and we read something far worse than an angry tweet.

View on HN · Topics

It's not ai that scary it's people using its field they don't know and then defending wrong outputs like they built it themselves

View on HN · Topics

That's a very good revert on horse-riding analogy. But you might still be making an assumption that the horse package doesn't come with a weapon. It might boil down to saying "AI can not achieve the skills of a senior engineer" - which might not have a strong basis.

View on HN · Topics

i too find lots of value in llms but your example describes a scenario a programmer could have also easily solved and maybe even had writing it correctly in the first or second shot.

that isn't to say an llm can't be useful but your post implies it's inevitable that llms will replace humans entirely from writing code, which i think is incredibly optimistic at best.

that said we will see!

View on HN · Topics

just last week AI led a developer on our team to brick our git history when he was attempting to fix a deploy. he's not a git expert but an llm should of not led him that far astray, no?

i see on a weekly basis where if an llm was left to do what its initial direction was without human oversight it would have broken otherwise working programs

View on HN · Topics

Great article. Hits on many points that resonate with my experience.

The skin in the game one, in particular, is something I've been thinking about. People have been telling me LLMs are "more intelligent" than "average people". But it's easy to sound intelligent when you have no skin in the game. People have to stand by their word and suffer the consequences of their actions. It's not enough just to sound intelligent.

It seems appropriate also to share an anecdote of an incident that recently happened in my job. A colleague submitted some code for review, quite a lot of it. A second colleague reviewed and questioned a piece of code. Rather than answer the question with a justification, the question was taken rhetorically and the code was removed. The code then failed in production because the removed code was, in fact, necessary. The LLM obviously "knew" this, but neither colleague did. It's leading me to introduce a "no rhetorical questions in code review" rule. The submitter must be able to justify every line of code they submit.

Summarizer