llm/5888b8dc-b96e-4444-9c3c-465dde409e92/topic-19-11d69960-6366-4ea7-83ee-f158b0e5f5db-input.json
You are a comment summarizer. Given a topic and a list of comments tagged with that topic, write a single paragraph summarizing the key points and perspectives expressed in the comments. TOPIC: AI hallucinations and reliability COMMENTS: 1. So far, eqch and every time I used an LLM to help me with something it hallucinated non-existant functions or was incorrect in an important but non-obvious way. Though, I guess I do treat LLM's as a last resort longshot for when other documentation is failing me. 2. "You're holding it wrong" 99% of an LLM's usefulness vanishes, if it behaves like an addled old man. "What's that sonny? But you said you wanted that!" "Wait, we did that last week? Sorry let me look at this again" "What? What do you mean, we already did this part?!" 3. Which LLMs have you tried? Claude Code seems to be decent at not hallucinating, Gemini CLI is more eager. I don't think current LLMs take you all the way but a powerful code generator is a useful think, just assemble guardrails and keep an eye on it. 4. A simple "how do I access x in y framework in the intended way" shouldnt require any more context. instead of telling me about z option it keeps hallucinating something that doesnt exist and even says its in the docs when it isnt. Literally just wasting my time 5. As long as what it says is reliable and not made up. 6. Google, Facebook, Amazon, Microsoft....they literally all have the vibe coded code; it's not about vibe coded or not, it is about how well the code is designed, efficient and bug free. Ofc pro coders can debug it and fix it better than some amateur coder but still LLMs are so valuable. I let Gemini vibe code little web projects for me and it serves me well. Although you have to explain everything step by step to it and sometimes when it fixes one bug, it accidently introduces another. But we fix bugs together and learn together. And btw when Gemini fixes bugs, it puts comments in the code on how the particular bug was fixed. 7. Exactly! ...If the printing press spouted gibberish every 9 words. 8. That was LLMs in 2023. 9. Or, given that OP is presumably a developer who just doesn't focus fully on front end code they could skip straight to checking MDN for "center div" and get a How To article ( https://developer.mozilla.org/en-US/docs/Web/CSS/How_to/Layo... ) as the first result without relying on spicy autocomplete. Given how often people acknowledge that ai slop needs to be verified, it seems like a shitty way to achieve something like this vs just checking it yourself with well known good reference material. 10. That's a lot of words to say "trust me bruh" which is kind of poetic given that's the entire model (no pun intended) that LLMs work on. 11. Yes, I worry about this quite a bit. Obviously nobody knows yet how it will shake out, but what I've been noticing so far is that brand recognition is becoming more important. This is obviously not a good thing for startup yokels like me, but it does provide an opportunity for quality and brand building. The initial creation and generation is indeed much easier now, but testing, identifying, and fixing bugs is still very much a process that takes some investment and effort, even when AI assisted. There is also considerable room for differentiation among user flows and the way people interact with the app. AI is not good at this yet, so the prompter needs to be able to identify and direct these efforts. I've also noticed in some of my projects, even ones shipped into production in a professional environment, there are lots of hard to fix and mostly annoying bugs that just aren't worth it, or that take so much research and debugging effort that we eventually gave up and accepted the downsides. If you give the AI enough guidance to know what to hunt for, it is getting pretty good at finding these things. Often the suggested fix is a terrible idea, but The AI will usually tell you enough about what is wrong that you can use your existing software engineering skills and experience to figure out a good path forward. At that point you can either fix it yourself, or prompt the AI to do it. My success rate doing this is still only at about 50%, but that's half the bugs that we used to live with that we no longer do, which in my opinion has been a huge positive development. 12. what "AI" are you speaking of? all the current leading LLMs i know of will _not_ do this (i.e web search for latest libraries) unless you explicitely ask 13. No, when we write code it has a an absolute and specific meaning to the compiler. When we write words to an LLM they are written in a non-specific informal language (usually English) and processed non-deterministically too. This is an incredibly important distinction that makes coding, and asking the LLM to code, two completely different ball games. One is formal, one is not. And yes, this isn’t a new phenomenon. 14. As you get deeper beyond the starter and bootstrap code it definitely takes a different approach to get value. This is in part because context limits of large code bases and because the knowledge becomes more specialized and the LLM has no training on that kind of code. But people are making it work, it just isn't as black and white. 15. There is no x is because LLM performance is non deterministic. You get slop out at varying degrees of quality and so your job shifts from writing to debugging. 16. One of my favorite engineers calls AI a "wish fulfillment slot machine." 17. A year or so ago I was seriously thinking of making a series of videos showing how coding agents were just plain bad at producing code. This was based on my experience trying to get them to do very simple things (e.g. a five-pointed star, or text flowing around the edge of circle, in HTML/CSS). They still tend to fail at things like this, but I've come to realize that there are whole classes of adjacent problems they're good at, and I'm starting to leverage their strengths rather than get hung up on their weaknesses. Perhaps you're not playing to their strengths, or just haven't cracked the code for how to prompt them effectively? Prompt engineering is an art, and slight changes to prompts can make a big difference in the resulting code. 18. I think it depends what you are doing. I’ve had Claude right the front end of a rust/react app and it was 10x if not x (because I just wouldn’t have attempted it). I’ve also had it write the documentation for a low level crate - work that needs to be done for the crate to be used effectively - but which I would have half-arsed because who like writing documentation? Recently I’ve been using it to write some async rust and it just shits the bed. It regularly codes the select! drop issue or otherwise completely fails to handle waiting on multiple things. My prompts have gotten quite sweary lately. It is probably 1x or worse. However, I am going to try formulating a pattern with examples to stuff in its context and we’ll see. I view the situation as a problem to be overcome, not an insurmountable failure. There may be places where an AI just can’t get it right: I wouldn’t trust it to write the clever bit tricks I’m doing elsewhere. But even there, it writes (most of) the tests and the docs. On the whole, I’m having far more fun with AI, and I am at least 2x as productive, on average. Consider that you might be stuck in a local (very bad) maximum. They certainly exist, as I’ve discovered. Try some side projects, something that has lots of existing examples in the training set. If you wanted to start a Formula 1 team, you’re going to need to know how to design a car, but there’s also a shit ton of logistics - like getting the car to the track - that an AI could just handle for you. Find boring but vital work the AI can do because, in my experience, that’s 90% of the work. 19. I feel like I can manage the entire stack again - with confidence. I have less confidence after a session, now I second guess everything and it slows me down because I know the foot-gun is in there somewhere. For example, yesterday Gemini started added garbage Unicode and then diagnosed file corruption which it failed to fix. And before you reply, yes it's my fault for not adding "URGENT CRITICAL REQUIREMENT: don't add rubbish Unicode" to my GEMINI.md. 20. > I feel like I can manage the entire stack again - with confidence. By not managing anything? Ignorance is bliss, I guess. I understand it. I've found myself looking at new stacks and tech, not knowing what I didn't know, and wondering where to start. But if you skip these fundamentals of the modern dev cycle, what happens when the LLM fails? 21. I'm trying to catch up with AI but it's difficult because most articles I find are kinda vague and there is a lack of clear examples. It's always about prompting or how AI "is great" yadi yada but hardly any step by step examples. I can easily ask gemini CLI to produce code for example. But how to work with AI in an existing codebase isn't obvious at all. It seems also that for any serious use you need a paid subscription? It seems like the free models just can't handle large codebases. 22. It's good that tools create the OP's positive feeling about being on top of the full Web stack again. I just wish the tools that provides that feeling was a deterministic front-end code generator built from software technology and software engineering insights and not a neural network utilizing a pseudo-random number generator... 23. We need better chatbots to fix the bugs from the current chatbots that fixed the bugs from the previous chatbots when they fixed the bugs from the previous generation of chatbots that….. Just give Sam Altman more and more of your money and he’ll make a more advanced chatbot to fix the chatbot he sold you that broke everything. You don’t even need to own a computer, just install an app on your phone to do it all. It doesn’t matter that regular people have been completely priced out of personal computing when GPT is just gonna do all the computing anymore anyway. Clearly a sustainable way forward for the industry. Write a concise, engaging paragraph (3-5 sentences) that captures the main ideas, notable perspectives, and overall sentiment of these comments regarding the topic. Focus on the most interesting and representative viewpoints. Do not use bullet points or lists - write flowing prose.
AI hallucinations and reliability
23