llm/065c6e83-d0d5-4aca-be3d-92768a8a3506/topic-17-5e86fbb0-9997-4117-b82b-18b70ef99416-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> Human Review Requirements # Debate about whether all AI-generated code must be reviewed line-by-line. Questions about trust, liability, and whether AI can eventually be trusted without oversight. </topic> <comments_about_topic> 1. Agreed. The process described is much more elaborate than what I do but quite similar. I start to discuss in great details what I want to do, sometimes asking the same question to different LLMs. Then a todo list, then manual review of the code, esp. each function signature, checking if the instructions have been followed and if there are no obvious refactoring opportunities (there almost always are). The LLM does most of the coding, yet I wouldn't call it "vibe coding" at all. "Tele coding" would be more appropriate. 2. > Why would you test implementation details Because this has never been sufficient. From things like various hard to test cases to things like readability and long term maintenance. Reading and understanding the code is more efficient and necessary for any code worth keeping around. 3. It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own 4. > if you don't plan perfectly, you'll have to start over from scratch if anything goes wrong. You just revert what the AI agent changed and revise/iterate on the previous step - no need to start over. This can of course involve restricting the work to a smaller change so that the agent isn't overwhelmed by complexity. 5. Just reading that plan would take weeks or months 6. > but in doing what I don't know as well. Comments like these really help ground what I read online about LLMs. This matches how low performing devs at my work use AI, and their PRs are a net negative on the team. They take on tasks they aren’t equipped to handle and use LLMs to fill the gaps quickly instead of taking time to learn (which LLMs speed up!). 7. 100,000 lines is approx. one million words. The average person reads at 250wpm. The entire thing would take 66 hours just to read, assuming you were approaching it like a fiction book, not thinking anything over 8. i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ? in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution. all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md. these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no. 9. I've been running AI coding workshops for engineers transitioning from traditional development, and the research phase is consistently the part people skip — and the part that makes or breaks everything. The failure mode the author describes (implementations that work in isolation but break the surrounding system) is exactly what I see in workshop after workshop. Engineers prompt the LLM with "add pagination to the list endpoint" and get working code that ignores the existing query builder patterns, duplicates filtering logic, or misses the caching layer entirely. What I tell people: the research.md isn't busywork, it's your verification that the LLM actually understands the system it's about to modify. If you can't confirm the research is accurate, you have no business trusting the plan. One thing I'd add to the author's workflow: I've found it helpful to have the LLM explicitly list what it does NOT know or is uncertain about after the research phase. This surfaces blind spots before they become bugs buried three abstraction layers deep. 10. Same here. I did bounce off these tools a year ago. They just didn't work for me 60% of the time. I learned a bit in that initial experience though and walked away with some tasks ChatGPT could replace in my workflow. Mainly replacing scripts and reviewing single files or functions. Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer. 11. > The output is more or less what I'd be writing as a principal engineer. I certainly hope this is not true, because then you're not competent for that role. Claude Code writes an absolutely incredible amount of unecessary and superfluous comments, it's makes asinine mistakes like forgetting to update logic in multiple places. It'll gladly drop the entire database when changing column formats, just as an example. 12. This is exactly what the article is about. The tradeoff is that you have to throughly review the plans and iterate on them, which is tiring. But the LLM will write good code faster than you, if you tell it what good code is. 13. The key part of my comment is "correctly". Does it write maintainable code? Does it write extensible code? Does it write secure code? Does it write performant code? My experience has been it failing most of these. The code might "work", but it's not good for anything more than trivial, well defined functions (that probably appeared in it's training data written by humans). LLMs have a fundamental lack of understanding of what they're doing, and it's obvious when you look at the finer points of the outcomes. That said, I'm sure you could write detailed enough specs and provide enough examples to resolve these issues, but that's the point of my original comment - if you're just writing specs instead of code you're not gaining anything. 14. > But the aha moment for me was what’s maintainable by AI vs by me by hand are on different realms. So maintainable has to evolve from good human design patterns to good AI patterns. How do you square that with the idea that all the code still has to be reviewed by humans? Yourself, and your coworkers 15. I picture like semi conductors; the 5nm process is so absurdly complex that operators can't just peek into the system easily. I imagine I'm just so used to hand crafting code that I can't imagine not being able to peek in. So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better. 16. Then who is responsible when (not if) that code does horrible things? We have humans to blame right now. I just don’t see it happening personally because liability and responsibility are too important 17. For some software, sure but not most. And you don’t blame humans anyways lol. Everywhere I’ve worked has had “blameless” postmortems. You don’t remove human review unless you have reasonable alternatives like high test coverage and other automated reviews. 18. We still have performance reviews and are fired. There’s a human that is responsible. “It’s AI all the way down” is either nonsense on its face, or the industry is dead already. 19. > In all, maybe 30-45 minutes of my actual time to add a small feature Why would this take you multiple days to do if it only took you 30m to review the code? Depends on the problem, but if I’m able to review something the time it’d take me to write it is usually at most 2x more worst case scenario - often it’s about equal. I say this because after having used these tools, most of the speed ups you’re describing come at the cost of me not actually understanding or thoroughly reviewing the code. And this is corroborated by any high output LLM users - you have to trust the agent if you want to go fast. Which is fine in some cases! But for those of us who have jobs where we are personally responsible for the code, we can’t take these shortcuts. 20. Well it's less mental load. It's like Tesla's FSD. Am I a better driver than the FSD? For sure. But is it nice to just sit back and let it drive for a bit even if it's suboptimal and gets me there 10% slower, and maybe slightly pisses off the guy behind me? Yes, nice enough to shell out $99/mo. Code implementation takes a toll on you in the same way that driving does. I think the method in TFA is overall less stressful for the dev. And you can always fix it up manually in the end; AI coding vs manual coding is not either-or. 21. I partly agree with you. But once you have a codebase large enough, the changes become longer to even type in, once figured out. I find the best way to use agents (and I don't use claude) is to hash it out like I'm about to write these changes and I make my own mental notes, and get the agent to execute on it. Agents don't get tired, they don't start fat fingering stuff at 4pm, the quality doesn't suffer. And they can be parallelised. Finally, this allows me to stay at a higher level and not get bogged down of "right oh did we do this simple thing again?" which wipes some of the context in my mind and gets tiring through the day. Always, 100% review every line of code written by an agent though. I do not condone committing code you don't 'own'. I'll never agree with a job that forces developers to use 'AI', I sometimes like to write everything by hand. But having this tool available is also very powerful. 22. I think it comes down to "it depends". I work in a NIS2 regulated field and we're quite callenged by the fact that it means we can't give AI's any sort of real access because of the security risk. To be complaint we'd have to have the AI agent ask permission for every single thing it does, before it does it, and foureye review it. Which is obviously never going to happen. We can discuss how bad the NIS2 foureye requirement works in the real world another time, but considering how easy it is to break AI security, it might not be something we can actually ever use. This makes sense on some of the stuff we work on, since it could bring an entire powerplant down. On the flip-side AI risks would be of little concern on a lot of our internal tools, which are basically non-regulated and unimportant enough that they can be down for a while without costing the business anything beyond annoyances. This is where our challenges are. We've build our own chatbot where you can "build" your own agent within the librechat framework and add a "skill" to it. I say "skill" because it's older than claude skills but does exactly the same. I don't completely buy the authors: > “deeply”, “in great details”, “intricacies”, “go through everything” bit, but you can obviously save a lot of time by writing a piece of english which tells it what sort of environment you work in. It'll know that when I write Python I use UV, Ruff and Pyrefly and so on as an example. I personally also have a "skill" setting that tells the AI not to compliment me because I find that ridicilously annoying, and that certainly works. So who knows? Anyway, employees are going to want more. I've been doing some PoC's running open source models in isolation on a raspberry pi (we had spares because we use them in IoT projects) but it's hard to setup an isolation policy which can't be circumvented. We'll have to figure it out though. For powerplant critical projects we don't want to use AI. But for the web tool that allows a couple of employees to upload three excel files from an external accountant and then generate some sort of report on them? Who cares who writes it or even what sort of quality it's written with? The lifecycle of that tool will probably be something that never changes until the external account does and then the tool dies. Not that it would have necessarily been written in worse quality without AI... I mean... Have you seen some of the stuff we've written in the past 40 years? 23. “The workflow I’m going to describe has one core principle: never let Claude write code until you’ve reviewed and approved a written plan.” I’m not sure we need to be this black and white about things. Speaking from the perspective of leading a dev team, I regularly have Claude Code take a chance at code without reviewing a plan. For example, small issues that I’ve written clear details about, Claude can go to town on those. I’ve never been on a team that didn’t have too many of these types of issues to address. And, a team should have othee guards in place that validates that code before it gets merged somewhere important. I don’t have to review every single decision one of my teammates is going to make, even those less experienced teammates, but I do prepare teammates with the proper tools (specs, documentation, etc) so they can make a best effort first attempt. This is how I treat Claude Code in a lot of scenarios. 24. I’ve been using this same pattern, except not the research phase. Definetly will try to add it to my process aswell. Sometimes when doing big task I ask claude to implement each phase seprately and review the code after each step. 25. I've been working off and on on a vibe coded FP language and transpiler - mostly just to get more experience with Claude Code and see how it handles complex real world projects. I've settled on a very similar flow, though I use three documents: plan, context, task list. Multiple rounds of iteration when planning a feature. After completion, have a clean session do an audit to confirm that everything was implemented per the design. Then I have both Claude and CodeRabbit do code review passes before I finally do manual review. VERY heavy emphasis on tests, the project currently has 2x more test code than application code. So far it works surprisingly well. Example planning docs below - https://github.com/mbcrawfo/vibefun/tree/main/.claude/archiv... 26. > it's how good senior engineers already work The other trick all good ones I’ve worked with converged on: it’s quicker to write code than review it (if we’re being thorough). Agents have some areas where they can really shine (boilerplate you should maybe have automated already being one), but most of their speed comes from passing the quality checking to your users or coworkers. Juniors and other humans are valuable because eventually I trust them enough to not review their work. I don’t know if LLMs can ever get here for serious industries. 27. > it's how good senior engineers already work The other trick all good ones I’ve worked with converged on: it’s quicker to write code than review it (if we’re being thorough). Agents have some areas where they can really shine (boilerplate you should maybe have automated already being one), but most of their speed comes from passing the quality checking to your users or coworkers. Juniors and other humans are valuable because eventually I trust them enough to not review their work. I don’t know if LLMs can ever get here for serious industries. 28. This is similar to what I do. I instruct an Architect mode with a set of rules related to phased implementation and detailed code artifacts output to a report.md file. After a couple of rounds of review and usually some responses that either tie together behaviors across context, critique poor choices or correct assumptions, there is a piece of work defined for a coder LLM to perform. With the new Opus 4.6 I then select specialist agents to review the report.md, prompted with detailed insight into particular areas of the software. The feedback from these specialist agent reviews is often very good and sometimes catches things I had missed. Once all of this is done, I let the agent make the changes and move onto doing something else. I typically rename and commit the report.md files which can be useful as an alternative to git diff / commit messages etc. 29. I have to give this a try. My current model for backend is the same as how author does frontend iteration. My friend does the research-plan-edit-implement loop, and there is no real difference between the quality of what I do and what he does. But I do like this just for how it serves as documentation of the thought process across AI/human, and can be added to version control. Instead of humans reviewing PRs, perhaps humans can review the research/plan document. On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too. Modelwise, Opus 4.6 is scary good! 30. 1. Don't implement too much at at time 2. Have the agent review if it followed the plan and relevant skills accurately. 31. Good article, but I would rephrase the core principle slightly: Never let Claude write code until you’ve reviewed, *fully understood* and approved a written plan. In my experience, the beginning of chaos is the point at which you trust that Claude has understood everything correctly and claims to present the very best solution. At that point, you leave the driver's seat. 32. I don't deny that AI has use cases, but boy - the workflow described is boring: "Most developers type a prompt, sometimes use plan mode, fix the errors, repeat. " Does anyone think this is as epic as, say, watch the Unix archives https://www.youtube.com/watch?v=tc4ROCJYbm0 where Brian demos how pipes work; or Dennis working on C and UNIX? Or even before those, the older machines? I am not at all saying that AI tools are all useless, but there is no real epicness. It is just autogenerated AI slop and blob. I don't really call this engineering (although I also do agree, that it is engineering still; I just don't like using the same word here). > never let Claude write code until you’ve reviewed and approved a written plan. So the junior-dev analogy is quite apt here. I tried to read the rest of the article, but I just got angrier. I never had that feeling watching oldschool legends, though perhaps some of their work may be boring, but this AI-generated code ... that's just some mythical random-guessing work. And none of that is "intelligent", even if it may appear to work, may work to some extent too. This is a simulation of intelligence. If it works very well, why would any software engineer still be required? Supervising would only be necessary if AI produces slop. 33. The post and comments all read like: Here are my rituals to the software God. If you follow them then God gives plenty. Omit one step and the God mad. Sometimes you have to make a sacrifice but that's better for the long term. I've been in eng for decades but never participated in forums. Is the cargo cult new? I use Claude Code a lot. Still don't trust what's in the plan will get actually written, regardless of details. My ritual is around stronger guardrails outside of prompting. This is the new MongoDB webscale meme. 34. Interesting approach. The separation of planning and execution is crucial, but I think there's a missing layer most people overlook: permission boundaries between the two phases. Right now when Claude Code (or any agent) executes a plan, it typically has the same broad permissions for every step. But ideally, each execution step should only have access to the specific tools and files it needs — least privilege, applied to AI workflows. I've been experimenting with declarative permission manifests for agent tasks. Instead of giving the agent blanket access, you define upfront what each skill can read, write, and execute. Makes the planning phase more constrained but the execution phase much safer. Anyone else thinking about this from a security-first angle? 35. I came to the exact same pattern, with one extra heuristic at the end: spin up a new claude instance after the implementation is complete and ask it to find discrepancies between the plan and the implementation. 36. I don't know. I tried various methods. And this one kind of doesn't work quite a bit of times. The problem is plan naturally always skips some important details, or assumes some library function, but is taken as instruction in the next section. And claude can't handle ambiguity if the instruction is very detailed(e.g. if plan asks to use a certain library even if it is a bad fit claude won't know that decision is flexible). If the instruction is less detailed, I saw claude is willing to try multiple things and if it keeps failing doesn't fear in reverting almost everything. In my experience, the best scenario is that instruction and plan should be human written, and be detailed. </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Human Review Requirements # Debate about whether all AI-generated code must be reviewed line-by-line. Questions about trust, liability, and whether AI can eventually be trusted without oversight.
36