llm/9db4e77f-8dd5-46da-972e-40d33f3399ef/topic-8-eec3a809-7c1e-41ae-9d54-bcc6a7a15ca4-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> AI Code Review Strategies # Approaches for handling AI-generated code, such as using separate AI instances to review PRs, the necessity of rigorous CI/CD guardrails, and the danger of blindly trusting 'green' tests without human oversight. </topic> <comments_about_topic> 1. Yep. For one of the things I am doing, I am the solo developer on a web application. At any given point, there are 4-5 large features I want and I instruct Claude to heavily test those features, so it is not unusual for each to run for 30-45 minutes and for overall conversations to span several hours. People are correct that it often makes mistakes, so that testing phase usually uncovers a bunch of issues it has to fix. I usually have 1-2 mop up terminal windows open for small things I notice as I go along that I want to fix. Claude can be bad about things like putting white text on a white button and I want a free terminal to just drop every little nitpick into it. They exist for me to just throw small tasks into. Yes, you really should start a new convo every need, but these are small things and I do not want to disrupt my flow. There are another 2-3 for smaller features that I am regularly reviewing and resetting. And then another one dedicated to just running the tests already built over and over again and solving any failures or investigating things. Another one is for research to tell me things about the codebase. 2. Potentially, a lot of that isn't just code generation, it *is* requirements gathering, design iteration, analysis, debugging, etc. I've been using CC for non-programming tasks and its been pretty successful so far, at least for personal projects (bordering on the edge of non-trivial). For instance, I'll get a 'designer' agent coming up with spec, and a 'design-critic' to challenge the design and make the original agent defend their choices. They can ask open questions after each round and I'll provide human feedback. After a few rounds of this, we whittle it down to a decent spec and try it out after handing it off to a coding agent. Another example from work: I fired off some code analysis to an agent with the goal of creating integration tests, and then ran a set of spec reviewers in parallel to check its work before creating the actual tickets. My point is there are a lot of steps involved in the whole product development process and isn't just "ship production code". And we can reduce the ambiguity/hallucinations/sycophancy by creating validation/checkpoints (either tests, 'critic' agents to challenge designs/spec, or human QA/validation when appropriate) The end game of this approach is you have dozens or hundreds of agents running via some kind of orchestrator churning through a backlog that is combination human + AI generated, and the system posts questions to the human user(s) to gather feedback. The human spends most of the time doing high-level design/validation and answering open questions. You definitely incur some cognitive debt and risk it doing something you don't want, but thats part of the fun for me (assuming it doesn't kill my AI bill). 3. LLM agents can be a bit like slot machines. The more the merrier. And at least two generate continuous shitposts for their companies Slack. That said, having one write code and a clean context review it is helpful. 4. That's why you have Codex review the code. (I'm only half joking. Having one LLM review the PRs of another is actually useful as a first line filter.) 5. Even having Opus review code written by Opus works very well as a first pass. I typically have it run a sub-agent to review its own code using a separate prompt. The sub-agents gets fresh context, so it won't get "poisoned" by the top level contexts justifications for the questionable choices it might have made. The prompts then direct the top level instance to repeat the verification step until the sub-agent gives the code a "pass", and fix any issues flagged. The result is change sets that still need review - and fixes - but are vastly cleaner than if you review the first output. Doing runs with other models entirely is also good - they will often identify different issues - but you can get far with sub-agents and different persona ( and you can, if you like, have Claude Code use a sub agent to run codex to prompt it for a review, or vice versa - a number of the CLI tools seems to have "standardized" on "-p <prompt>" to ask a question on the command line) Basically, reviewing output from Claude (or Codex, or any model) that hasn't been through multiple automated review passes by a model first is a waste of time - it's like reviewing the first draft from a slightly sloppy and overly self-confident developer who hasn't bothered checking if their own work even compiles first. 6. Thanks, that sounds all very reasonable! > Basically, reviewing output from Claude (or Codex, or any model) that hasn't been through multiple automated review passes by a model first is a waste of time - it's like reviewing the first draft from a slightly sloppy and overly self-confident developer who hasn't bothered checking if their own work even compiles first. Well, that's what the CI is for. :) In any case, it seems like a good idea to also feed the output of compiler errors and warnings and the linter back to your coding agent. 7. At the begining of the project, the runs are fast, but as the project gets bigger, the runs are slower: - there are bigger contexts - the test suite is much longer and slower - you need to split worktree, resources (like db, ports) and sometimes containers to work in isolation So having 10 workers will run for a long time. Which give plenty of time to write good spec. You need good spec, so the llm produce good tests, so it can write good code to match these tests. Having a very strong spec + test suite + quality gates (linter, type checkers, etc) is the only way to get good results from an LLM as the project become more complex. Unlike a human, it's not very good at isolating complexity by itself, nor stopping and asking question in the face of ambiguity. So the guardrails are the only thing that keeps it on track. And running a lot of guardrail takes time. E.G: yesterday I had a big migration to do from HTMX to viewjs, I asked the LLM to produce screenshots of each state, and then do the migration in steps in a way that kept the screenshit 90% identical. This way I knew it would not break the design. But it's very long to run e2e tests + screenshot comparison every time you do a modification. Still faster than a human, but it gives plenty of time to talk to another llm. Plus you can assign them very different task: - One work on adding a new feature - One improves the design - One refactor part of the code (it's something you should do regularly, LLM produce tech debt quickly) - One add more test to your test suite - One is deploying on a new server - One is analyzing the logs of your dev/test/prod server and tell you what's up - One is cooking up a new logo for you and generating x versions at different resolutions. Etc. It's basically a small team at your disposal. 8. In other words, nobody cares that the generated code is shit, because there is no human who can review that much code. Not even on high level. According to the discussion here, they don’t even care whether the tests are real. They just care about that it’s green. If tests are useless in reality? Who cares, nobody has time to check them! And who will suffer because of this? Who cares, they pray that not them! 9. Yeah, but there is a difference, between if at least one people at one point of time understood the code (or the specific part of it), and none. Also, there are different levels. Wildfly’s code for example is utterly incomprehensible, because the flow jumps on huge inheritance chains up and down to random points all the time; some Java Enterprise people are terrible with this. Anyway, the average for tools used by many is way better than that. So it’s definitely possible to make it worse. Blindly trusting AI is one possible way to reach those new lows. So it would be good to prevent it, before it’s too late, and not praising it without that, and even throwing out one of the (broken, but better than nothing) safeguard. Especially how code review is obviously dead with such amount of generated code per week. (The situation wasn’t great there either before) So it’s a two in one bad situation. 10. I implemented some of his setup and have been loving it so far. My current workflow is typically 3-5 Claude Codes in parallel - Shallow clone, plan mode back and forth until I get the spec down, hand off to subagent to write a plan.md - Ralph Wiggum Claude using plan.md and skills until PR passes tests, CI/CD, auto-responds to greptile reviews, prepares the PR for me to review - Back and forth with Claude for any incremental changes or fixes - Playwright MCP for Claude to view the browser for frontend I still always comb through the PRs and double check everything including local testing, which is definitely the bottleneck in my dev cycles, but I'll typically have 2-4 PRs lined up ready for me at any moment. 11. It codes faster and with more abandon. For good results, mix Claude Code with Codex (preferably high or xhigh reasoning) for reviews. 12. > manually fixing crap it produces > it tends to produce so many errors I get some of the skepticism in this thread, but I don't get takes like this. How are you using cc that the output you look at is "full of errors"? By the time I look at the output of a session the agent has already ran linting, formatting, testing and so on. The things I look at are adherence to the conventions, files touched, libraries used, and so on. And the "error rate" on those has been steadily coming down. Especially if you also use a review loop (w/ codex since it has been the best at review lately). You have to set these things up for success. You need loops with clear feedback. You need a project that has lots of clear things to adhere to. You need tight integrations. But once you have these things, if you're looking at "errors", you're doing something wrong IMO. 13. One thing that’s helped me is creating a bake-off. I’ll do it between Claude and codex. Same prompt but separate environments. They’ll both do their thing and then I’ll score them at the end. I find it helps me because frequently only one of them makes a mistake, or one of them finds an interesting solution. Then once I declare a winner I have scripts to reset the bake-off environments. 14. you can have them review each other's work, too. 15. Did you manage to make proper reviews of all the 30 PR ? 16. I have formal requirements for all implemented code. This is all on relatively greenfield solo developed codebases with tools I know inside out (Django, click based cli etc) so yes. Thanks so much for your concern, internet person! 17. A classic hacker news post that will surely interest coders from all walks of life! ~ After regular use of an AI coding assistant for some time, I see something unusual: my biggest wins came from neither better prompts nor a smarter model. They originated from the way I operated. At first, I thought of it as autocomplete. Afterwards, similar to a junior developer. In the end, a collaborator who requires constraints. Here is a framework I have landed on. First Step: Request for everything. Obtain acceleration, but lots of noise. Stage two: Include regulations. Less Shock, More Trust. Phase 3: Allow time for acting but don’t hesitate to perform reviews aggressively. A few habits that made a big difference. Specify what can be touched or come into contact with. Asking it to explain differences before applying them. Consider “wrong but confident” answers as signal to tighten scope. Wondering what others see only after time. What transformations occurred after the second or fourth week? When was the trust increased or reduced? What regulations do you wish you had added earlier? </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
AI Code Review Strategies # Approaches for handling AI-generated code, such as using separate AI instances to review PRs, the necessity of rigorous CI/CD guardrails, and the danger of blindly trusting 'green' tests without human oversight.
17