llm/8632d754-c7a3-4ec2-977a-2733719992fa/topic-9-a486dc1d-fd9e-4f30-a80a-0e2cc51d1e29-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> Trust and Hallucination Risks # Skepticism regarding the reliability of AI, highlighted by examples like 'wind-powered cars' or bad recipes. Users argue that because LLMs predict tokens rather than understanding physics or logic, they are 'confidently stupid' and require expert humans to filter out hallucinations, making them dangerous for those lacking deep domain knowledge. </topic> <comments_about_topic> 1. "When was the last time you reviewed the machine code produced by a compiler?" Compilers will produce working output given working input literally 100% of my time in my career. I've never personally found a compiler bug. Meanwhile AI can't be trusted to give me a recipe for potato soup. That is to say, I would under no circumstances blindly follow the output of an LLM I asked to make soup. While I have, every day of my life, gladly sent all of the compiler output to the CPU without ever checking it. The compiler metaphor is simply incorrect and people trying to say LLMs compile English into code insult compiler devs and English speakers alike. 2. Sure bitcoin is at least deterministic, but IMO (an that of many in the finance industry) it's solving entirely the wrong problem - in practice people want trust and identity in transactions much more than they want distributed and trustless. In a similar way LLMs seem to me to be solving the wrong problem - an elegant and interesting solution, but a solution to the wrong problem (how can I fool humans into thinking the bot is generally intelligent), rather than the right problem (how can I create a general intelligence with knowledge of the world). It's not clear to me we can jump from the first to the second. 3. This argument is disingenuous and distracts rather than addresses the point. Yes, it is possible for a compiler to have a bug. No, that is I’m mo way analogous to AI producing buggy code. I’ve experienced maybe two compiler bugs in my twenty year career. I have experienced countless AI mistakes - hundreds? Thousands? Already. These are not the same and it has the whiff of sales patter trying to address objections. Please stop. 4. If natural language is used to specify work to the LLM, how can the output ever be trusted? You'll always need to make sure the program does what you want, rather than what you said. 5. You trust your natural language instructions thousand times a day. If you ask for a large black coffee, you can trust that is more or less what you’ll get. Occasionally you may get something so atrocious that you don’t dare to drink, but generally speaking you trust the coffee shop knows what you want. It you insist on a specific amount of coffee brewed at a specific temperature, however, you need tools to measure. AI tools are similar. You can trust them because they are good enough, and you need a way (testing) to make sure what is produced meet your specific requirements. Of course they may fail for you, doesn’t mean they aren’t useful in other cases. All of that is simply common sense. 6. > Meanwhile AI can't be trusted to give me a recipe for potato soup. This just isn't true any more. Outside of work, my most common use case for LLMs is probably cooking. I used to frequently second guess them, but no longer - in my experience SOTA models are totally reliable for producing good recipes. I recognize that at a higher level we're still talking about probabilistic recipe generation vs. deterministic compiler output, but at this point it's nonetheless just inaccurate to act as though LLMs can't be trusted with simple (e.g. potato soup recipe) tasks. 7. This is obviously besides the point but I did blindly follow a wiener schnitzel recipe ChatGPT made me and cooked for a whole crew. It turned out great. I think I got lucky though, the next day I absolutely massacred the pancakes. 8. Recent experiments with LLM recipes (ChatGPT): missed salt in a recipe to make rice, then flubbed whether that type of rice was recommended to be washed in the recipe it was supposedly summarizing (and lied about it, too)… Probabilistic generation will be weighted towards the means in the training data. Do I want my code looking like most code most of the time in a world full of Node.js and PHP? Am I better served by rapid delivery from a non-learning algorithm that requires eternal vigilance and critical re-evaluation or with slower delivery with a single review filtered through an meatspace actor who will build out trustable modules in a linear fashion with known failure modes already addressed by process (ie TDD, specs, integration & acceptance tests)? I’m using LLMs a lot, but can’t shake the feeling that the TCO and total time shakes out worse than it feels as you go. 9. I genuinely admire your courage and willingness (or perhaps just chaos energy) to attempt both wiener schnitzel and pancakes for a crew, based on AI recipes, despite clearly limited knowledge of either. 10. That is not the issue, any potato soup recipe would be fine, the issue is that it might fetch values from different recipes and give you an abomination. 11. This exactly, I cook as passion, and LLMs just routinely very clearly (weighted) "average" together different recipes to produce, in the worst case, disgusting monstrosities, or, in the best case, just a near-replica of some established site's recipe. 12. Maybe. But it's been 3 years and it still isn't good enough to actually trust. That doesn't raise confidence that it will ever get there. 13. Reasoning by analogy is usually a bad idea, and nowhere is this worse than talking about software development. It’s just not analogous to architecture, or cooking, or engineering. Software development is just its own thing. So you can’t use analogy to get yourself anywhere with a hint of rigour. The problem is, AI is generating code that may be buggy, insecure, and unmaintainable. We have as a community spent decades trying to avoid producing that kind of code. And now we are being told that productivity gains mean we should abandon those goals and accept poor quality, as evidenced by MoltBook’s security problems. It’s a weird cognitive dissonance and it’s still not clear how this gets resolved. 14. > The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top. I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you . It's great when it gets things right but even then it's you that is confirming this. 15. I skimmed over it, and didn’t find any discussion of: - Pull requests - Merge requests - Code review I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting. Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”. We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line? 16. Either really comprehensive tests (that you read) or read it. Usually i find you can skim most of it, but like in core sections like billing or something you gotta really review it. The models still make mistakes. 17. You can't skim over AI code. For even mid-level tasks it will make bad assumptions, like sorting orders or timezone conversions. Basic stuff really. You've probably got a load of ticking time bomb bugs if you've just been skimming it. 18. The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius OPINION | THE REALITY CHECK In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics. The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid. The Paradox of the Wind-Powered Car The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero." To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable. The Erosion of the "Human Spark" The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes. In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google. However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box. The Gaslighting of the Realist Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive. When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet. The High Cost of Being "Authoritatively Wrong" We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix. The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself. 19. "Become better at intuiting the behavior of this non-deterministic black box oracle maintained by a third party" just isn't a strong professional development sell for me, personally. If the future of writing software is chasing what a model trainer has done with no ability to actually change that myself I don't think that's going to be interesting to nearly as many people. 20. And yet the comments are chock full of cargo-culting about different moods of the oracle and ways to get better output. 21. Since they started doing that it's gained a lot of bugs. 22. Exactly. The LLMs are quite good at "code inpainting", eg "give me the outline/constraints/rules and I'll fill-in the blanks" But not so good at making (robust) new features out of the blue 23. This matches my experience, especially "don’t draw the owl" and the harness-engineering idea. The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX). What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me. Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed. 24. >The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX). Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%. So I would start getting to 90% and basically starting a new project with that as the baseline to add to. 25. I think you are right, the secret is that there is no secret. The projects I have been involved with thats been most successful was using these techniques. I also think experience helps because you develop a sense that very quickly knows if the model wants to go in a wonky direction and how a good spec looks like. With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code. 26. > The bottle neck has gone from writing code to reading code. This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it. Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster. 27. How much does it cost per day to have all these agents running on your computer? Is your company paying for it or you? What is your process of the agent writes a piece of code, let's say a really complex recursive function, and you aren't confident you could have come up with the same solution? Do you still submit it? 28. “Nuke” is maybe too strong of a word, but it has not been uncommon for me to see it trying to install specific versions of languages on my machine, or services I intentionally don’t have configured, or sometimes trying to force npm when I’m using bun, etc. 29. I was using it the same way you just described but for C# and Angular and you're spot on. It feels amazing not having to memorize APIs and just let the AI even do code coverage near to 100%, however at some point I began noticing 2 things: - When tests didn't work I had to check what was going on and the LLMs do cheat a lot with Volkswagen tests, so that began to make me skeptic even of what is being written by the agents - When things were broken, spaghetti and awful code tends to be written in an obnoxius way it's beyond repairable and made me wish I had done it from scratch. Thankfully I just tried using agents for tests and not for the actual code, but it makes me think a lot if "vibe coding" really produces quality work. 30. The Death of the "Stare": Why AI’s "Confident Stupidity" is a Threat to Human Genius OPINION | THE REALITY CHECK In the gleaming offices of Silicon Valley and the boardrooms of the Fortune 500, a new religion has taken hold. Its deity is the Large Language Model, and its disciples—the AI Evangelists—speak in a dialect of "disruption," "optimization," and "seamless integration." But outside the vacuum of the digital world, a dangerous friction is building between AI’s statistical hallucinations and the unyielding laws of physics. The danger of Artificial Intelligence isn't that it will become our overlord; the danger is that it is fundamentally, confidently, and authoritatively stupid. The Paradox of the Wind-Powered Car The divide between AI hype and reality is best illustrated by a recent technical "solution" suggested by a popular AI model: an electric vehicle equipped with wind generators on the front to recharge the battery while driving. To the AI, this was a brilliant synergy. It even claimed the added weight and wind resistance amounted to "zero." To any human who has ever held a wrench or understood the First Law of Thermodynamics, this is a joke—a perpetual motion fallacy that ignores the reality of drag and energy loss. But to the AI, it was just a series of words that sounded "correct" based on patterns. The machine doesn't know what wind is; it only knows how to predict the next syllable. The Erosion of the "Human Spark" The true threat lies in what we are sacrificing to adopt this "shortcut" culture. There is a specific human process—call it The Stare. It is that thirty-minute window where a person looks at a broken machine, a flawed blueprint, or a complex problem and simply observes. In that half-hour, the human brain runs millions of mental simulations. It feels the tension of the metal, the heat of the circuit, and the logic of the physical universe. It is a "Black Box" of consciousness that develops solutions from absolutely nothing—no forums, no books, and no Google. However, the new generation of AI-dependent thinkers views this "Stare" as an inefficiency. By outsourcing our thinking to models that cannot feel the consequences of being wrong, we are witnessing a form of evolutionary regression. We are trading hard-earned competence for a "Yes-Man" in a box. The Gaslighting of the Realist Perhaps most chilling is the social cost. Those who still rely on their intuition and physical experience are increasingly being marginalized. In a world where the screen is king, the person pointing out that "the Emperor has no clothes" is labeled as erratic, uneducated, or naive. When a master craftsman or a practical thinker challenges an AI’s "hallucination," they aren't met with logic; they are met with a robotic refusal to acknowledge reality. The "AI Evangelists" have begun to walk, talk, and act like the models they worship—confidently wrong, devoid of nuance, and completely detached from the ground beneath their feet. The High Cost of Being "Authoritatively Wrong" We are building a world on a foundation of digital sand. If we continue to trust AI to design our structures and manage our logic, we will eventually hit a wall that no "prompt" can fix. The human brain runs on 20 watts and can solve a problem by looking at it. The AI runs on megawatts and can’t understand why a wind-powered car won't run forever. If we lose the ability to tell the difference, we aren't just losing our jobs—we're losing our grip on reality itself. 31. Maybe you disagree with it, but it seems like a pretty straightforward argument: A lot of us dismiss AI because "it can't be trusted to do as good a job as me". The OP is arguing that someone, who can do better than most of us, disagrees with this line of thinking. And if we have respect for his abilities, and recognize them as better than our own, we should perhaps re-assess our own rationale in dismissing the utility of AI assistance. If he can get value out of it, surely we can too if we don't argue ourselves out of giving it a fair shake. The flip side of that argument might be that you have to be a much better programmer than most of us are, to properly extract value out of the AI... maybe it's only useful in the hands of a real expert. </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Trust and Hallucination Risks # Skepticism regarding the reliability of AI, highlighted by examples like 'wind-powered cars' or bad recipes. Users argue that because LLMs predict tokens rather than understanding physics or logic, they are 'confidently stupid' and require expert humans to filter out hallucinations, making them dangerous for those lacking deep domain knowledge.
31