Summarizer

LLM Input

llm/8632d754-c7a3-4ec2-977a-2733719992fa/topic-1-6b67d9fa-d122-4e47-96f5-777e498865b9-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
The Code Review Bottleneck # Concerns that generating code faster merely shifts the bottleneck to reviewing code, which is often harder and more time-consuming than writing it. Users discuss the cognitive load of verifying 'vibe code' and the risks of blindly trusting output that looks correct but contains subtle bugs or security flaws.
</topic>

<comments_about_topic>
1. I'm not arguing that LLMs are at a point today where we can blindly trust their outputs in most applications, I just don't think that 100% correct output is necessarily a requirement for that. What it needs to be is correct often enough that the cost of reviewing the output far outweighs the average cost of any errors in the output, just like with a compiler.

This even applies to human written code and human mistakes, as the expected cost of errors goes up we spend more time on having multiple people review the code and we worry more about carefully designing tests.

2. The challenge not addressed with this line of reasoning is the required sheer scale of output validation on the backend of LLM-generated code. Human hand-developed code was no great shakes at the validation front either, but the scale difference hid this problem.

I’m hopeful what used to be tedious about the software development process (like correctness proving or documentation) becomes tractable enough with LLM’s to make the scale more manageable for us. That’s exciting to contemplate; think of the complexity categories we can feasibly challenge now!

3. Reasoning by analogy is usually a bad idea, and nowhere is this worse than talking about software development.

It’s just not analogous to architecture, or cooking, or engineering. Software development is just its own thing. So you can’t use analogy to get yourself anywhere with a hint of rigour.

The problem is, AI is generating code that may be buggy, insecure, and unmaintainable. We have as a community spent decades trying to avoid producing that kind of code. And now we are being told that productivity gains mean we should abandon those goals and accept poor quality, as evidenced by MoltBook’s security problems.

It’s a weird cognitive dissonance and it’s still not clear how this gets resolved.

4. Don't take this as criticizing LLMs as a whole, but architects also don't call themselves engineers. Engineers are an entirely distinct set of roles that among other things validate the plan in its totality, not only the "new" 1/5th. Our job spans both of these.

"Architect" is actually a whole career progression of people with different responsibilities. The bottom rung used to be the draftsmen, people usually without formal education who did the actual drawing. Then you had the juniors, mid-levels, seniors, principals, and partners who each oversaw different aspects. The architects with their name on the building were already issuing high level guidance before the transition instead of doing their own drawings.

When was the last time you reviewed the machine code produced by a compiler?

Last week, to sanity check some code written by an LLM.

5. > We don't call architects 'vibe architects' even though they copy-paste 4/5th of your next house and use a library of things in their work!

Maybe not, but we don't allow non-architects to vomit out thousands of diagrams that they cannot review, and that is never reviewed, which are subsequently used in the construction of the house.

Your analogy to s/ware is fatally and irredeemably flawed, because you are comparing the regulated and certification-heavy production of content, which is subsequently double-checked by certified professionals, with an unregulated and non-certified production of content which is never checked by any human.

6. I skimmed over it, and didn’t find any discussion of:

- Pull requests
- Merge requests
- Code review

I feel like I’m taking crazy pills. Are SWE supposed to move away from code review, one of the core activities for the profession? Code review is as fundamental for SWE as double entry is for accounting.

Yes, we know that functional code can get generated at incredible speeds. Yes, we know that apps and what not can be bootstrapped from nothing by “agentic coding”.

We need to read this code, right? How can I deliver code to my company without security and reliability guarantees that, at their core, come from me knowing what I’m delivering line-by-line?

7. Give it a read, he mentions briefly how he uses for PR triages and resolving GH issues.

He doesn't go in details, but there is a bit:

> Issue and PR triage/review. Agents are good at using gh (GitHub CLI), so I manually scripted a quick way to spin up a bunch in parallel to triage issues. I would NOT allow agents to respond, I just wanted reports the next day to try to guide me towards high value or low effort tasks.

> More specifically, I would start each day by taking the results of my prior night's triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel).

This is a short excerpt, this article is worth reading. Very grounded and balanced.

8. Either really comprehensive tests (that you read) or read it. Usually i find you can skim most of it, but like in core sections like billing or something you gotta really review it. The models still make mistakes.

9. You can't skim over AI code.

For even mid-level tasks it will make bad assumptions, like sorting orders or timezone conversions.

Basic stuff really.

You've probably got a load of ticking time bomb bugs if you've just been skimming it.

10. You read it. You now have an infinite army of overconfident slightly drunken new college grads to throw at any problem.

Some times you’re gonna want to slowly back away from them and write things yourself. Sometimes you can farm out work to them.

Code review their work as you would any one else’s, in fact more so.

My rule of thumb has been it takes a senior engineer per every 4 new grads to mentor them and code review their work. Or put another way bringing on a new grad gets you +1 output at the cost of -0.25 a senior.

Also, there are some tasks you just can’t give new college grads.

Same dynamic seems to be shaping up here. Except the AI juniors are cheap and work 24*7 and (currently) have no hope of growing into seniors.

11. So read the code.

12. Cool, code review continues to be one of the biggest bottlenecks in our org, with or without agentic AI pumping out 1k LOC per hour.

13. For me, AI is the best for code research and review

Since some team members started using AI without care, I did create bunch of agents/skills/commands and custom scripts for claude code. For each PR, it collects changes by git log/diff, read PR data and spin bunch of specialized agents to check code style, architecture, security, performance, and bugs. Each agent armed with necessary requirement documents, including security compliance files. False positives are rare, but it still misses some problems. No PR with ai generated code passes it. If AI did not find any problems, I do manual review.

14. Ok? You still have to read the code.

15. That's just not what has been happening in large enterprise projects, internal or external, since long before AI.

Famous example - but by no means do I want to single out that company and product: https://news.ycombinator.com/item?id=18442941

From my own experience, I kept this post bookmarked because I too worked on that project in the late 1990s, you cannot review those changes anyway. It is handled as described, you keep tweaking stuff until the tests pass. There is fundamentally no way to understand the code. Maybe its different in some very core parts, but most of it is just far too messy. I tried merely disentangling a few types ones, because there were a lot of duplicate types for the most simple things, such as 32 bit integers, and it is like trying to pick one noodle out of a huge bowl of spaghetti, and everything is glued and knotted together, so you always end up lifting out the entire bowl's contents. No AI necessary, that is just how such projects like after many generations of temporary programmers (because all sane people will leave as soon as they can, e.g. once they switched from an H1B to a Green Card) under ticket-closing pressure.

I don't know why since the beginning of these discussions some commenters seem to work off wrong assumptions that thus far our actual methods lead to great code. Very often they don't, they lead to a huge mess over time that just gets bigger.

And that is not because people are stupid, its because top management has rationally determined that the best balance for overall profits does not require perfect code. If the project gets too messy to do much the customers will already have been hooked and can't change easily, and when they do, some new product will have already replaced the two decades old mature one. Those customers still on the old one will pay premium for future bug fixes, and the rest will jumpt to the new trend. I don't think AI can make what's described above any, or much worse.

16. You're missing the point. The point is that reading the code is more time consuming than writing it, and has always been thus. Having a machine that can generate code 100x faster, but which you have to read carefully to make sure it hasn't gone off the rails, is not an asset. It is a liability.

17. So you have a hobby.

I have a profession. Therefore I evaluate new tools. Agents coding I've introduced into my auxiliary tool forgings (one-off bash scripts) and personal projects, and I'm just now comfortable to introduce into my professional work. But I still evaluate every line.

18. So then type the code as well and read it after. Why are you mad

19. I think this is the crux of why, when used as an enhancement to solo productivity, you'll have a pretty strict upper bound on productivity gains given that it takes experienced engineers to review code that goes out at scale.

That being said, software quality seems to be decreasing, or maybe it's just cause I use a lot of software in a somewhat locked down state with adblockers and the rest.

Although, that wouldn't explain just how badly they've murdered the once lovely iTunes (now Apple Music) user interface. (And why does CMD-C not pick up anything 15% of the time I use it lately...)

Anyways, digressions aside... the complexity in software development is generally in the organizational side. You have actual users, and then you have people who talk to those users and try to see what they like and don't like in order to distill that into product requirements which then have to be architected, and coordinated (both huge time sinks) across several teams.

Even if you cut out 100% of the development time, you'd still be left with 80% of the timeline.

Over time though... you'll probably see people doing what I do all day (which is move around among many repositories (although I've yet to use the AI much, got my Cursor license recently and am gonna spin up some POCs that I want to see soon)), enabled by their use of AI to quickly grasp what's happening in the repo, and the appropriate places to make changes.

Enabling developers to complete features from tip to tail across deep, many pronged service architectures would could bring project time down drastically and bring project management, and cross team coordination costs down tremendously.

Similarly, in big companies, the hand is often barely aware at best of the foot. And space exploration is a serious challenge. Often folk know exactly one step away, and rely on well established async communication channels which also only know one step further. Principal engineers seem to know large amounts about finite spaces and are often in the dark small hops away to things like the internal tooling for the systems they're maintaining (and often not particularly great at coming in to new spaces and thinking with the same perspective... no we don't need individual micro services for every 12 request a month admin api group we want to set up).

Once systems can take a feature proposal and lay out concrete plans which each little kingdom can give a thumbs up or thumbs down to for further modifications, you can again reduce exploration, coordination, and architecture time down.

Sadly, seems like User Experience design is an often terribly neglected part of our profession. I love the memes about an engineer building the perfect interface like a water pitcher only for the person to position it weirdly in order to get a pour out of the fill hole or something. Lemme guess how many users you actually talked to (often zero), and how many layers of distillation occurred before you received a micro picture feature request that ends up being build and taking input from engineers with no macro understanding of a user's actual needs, or day to day.

And who often are much more interested in perfecting some little algorithm thank thinking about enabling others.

So my money is on money flowing to...
- People who can actually verify system integrity, and can fight fires and bugs (but a lot of bug fixing will eventually becoming prompting?)
- Multi-talented individuals who can say... interact with users well enough to understand their needs as well as do a decent job verifying system architecture and security

It's outside of coding where I haven't seen much... I guess people use it to more quickly scaffold up expense reports, or generate mocks. So, lots of white collar stuff. But... it's not like the experience of shopping at the supermarket has changed, or going to the movies, or much of anything else.

20. If I asked you for the same thing 10 times, wiping your memory each time, would you generate the same result?

And why does it matter anyway? I'd the code passes the tests and you like the look of it, it's good. It doesn't need to be existentially complicated.

21. I've spent 2+ decades producing software across a number of domains and orgs and can fully agree that _disciplined use_ of LLM systems can significantly boost productivity, but the rules and guidance around their use within our industry writ large are still in flux and causing as many problems as they're solving today.

As the most senior IC within my org, since the advent of (enforced) LLM adoption my code contribution/output has stalled as my focus has shifted to the reactionary work of sifting through the AI generated chaff following post mortems of projects that should have never have shipped in the first place. On a good day I end up rejecting several PRs that most certainly would have taken down our critical systems in production due to poor vetting and architectural flaws, and on the worst I'm in full on fire fighting mode to "fix" the same issues already taking down production (already too late.)

These are not inherent technical problems in LLMs, these are organizational/processes problems induced by AI pushers promising 10x output without the necessary 10x requirements gathering and validation efforts that come with that. "Everyone with GenAI access is now a 10x SDE" is the expectation, when the reality is much more nuanced.

The result I see today is massive incoming changesets that no one can properly vet given the new shortened delivery timelines and reduced human resourcing given to projects. We get test suite coverage inflation where "all tests pass" but undermine core businesses requirements and no one is being given the time or resources to properly confirm the business requirements are actually being met. Shit hits the fan, repeat ad nauseum. The focus within our industry needs to shift to education on the proper application and use of these tools, or we'll inevitably crash into the next AI winter; an increasingly likely future that would have been totally avoidable if everyone drinking the Koolaid stopped to observe what is actually happening.

As you implied, code is cheap and most code is "throwaway" given even modest time horizons, but all new code comes with hidden costs not readily apparent to all the stakeholders attempting to create a new normal with GenAI. As you correctly point out, the biggest problems within our industry aren't strictly technical ones, they're interpersonal, communication and domain expertise problems, and AI use is simply exacerbating those issues. Maybe all the orgs "doing it wrong" (of which there are MANY) simply fail and the ones with actual engineering discipline "make it," but it'll be a reckoning we should not wish for.

I have heard from a number of different industry players and they see the same patterns. Just look at the average linked in post about AI adoption to confirm. Maybe you observe different patterns and the issues aren't as systemic as I fear. I honestly hope so.

Your implication that seniors like myself are "insecure about our jobs" is somewhat ironically correct, but not for the reasons you think.

22. but annoying hype is exactly the issue with AI in my eyes. I get it's a useful tool in moderation and all, but I also experience that management values speed and quantity of delivery above all else, and hype-driven as they are I fear they will run this industry to the ground and we as users and customers will have to deal with the world where software is permanently broken as a giant pile of unmaintainable vibe code and no experienced junior developers to boot.

23. It sounds like you're talking more about "vibe coding" i.e. just using LLMs without inspecting the output. That's neither what the article nor the people to whom you're replying are saying. You can (and should) heavily review and edit LLM generated code. You have the full ability to change it yourself, because the code is just there and can be edited!

24. Since they started doing that it's gained a lot of bugs.

25. I actually enjoy writing specifications. So much so that I made it a large part of my consulting work for a huge part of my career. SO it makes sense that working with Gen-AI that way is enjoyable for me.

The more detailed I am in breaking down chunks, the easier it is for me to verify and the more likely I am going to get output that isn't 30% wrong.

26. This matches my experience, especially "don’t draw the owl" and the harness-engineering idea.

The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX).

What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me.

Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed.

27. I think you are right, the secret is that there is no secret. The projects I have been involved with thats been most successful was using these techniques. I also think experience helps because you develop a sense that very quickly knows if the model wants to go in a wonky direction and how a good spec looks like.

With where the models are right now you still need a human in the loop to make sure you end up with code you (and your organisation) actually understands. The bottle neck has gone from writing code to reading code.

28. > The bottle neck has gone from writing code to reading code.

This has always been the bottleneck. Reviewing code is much harder and gets worse results than writing it, which is why reviewing AI code is not very efficient. The time required to understand code far outstrips the time to type it.

Most devs don’t do thorough reviews. Check the variable names seem ok, make sure there’s no obvious typos, ask for a comment and call it good. For a trusted teammate this is actually ok and why they’re so valuable! For an AI, it’s a slot machine and trusting it is equivalent to letting your coworkers/users do your job so you can personally move faster.

29. I've been thinking about this as three maturity levels.

Level 1 is what Mitchell describes — AGENTS.md, a static harness. Prevents known mistakes. But it rots. Nobody updates the checklist when the environment changes.

Level 2 is treating each agent failure as an inoculation. Agent duplicates a util function? Don't just fix it — write a rule file: "grep existing helpers before writing new ones." Agent tries to build a feature while the build is broken? Rule: "fix blockers first." After a few months you have 30+ of these. Each one is an antibody against a specific failure class. The harness becomes an immune system that compounds.

Level 3 is what I haven't seen discussed much: specs need to push, not just be read. If a requirement in auth-spec.md changes, every linked in-progress task should get flagged automatically. The spec shouldn't wait to be consulted.

The real bottleneck isn't agent capability — it's supervision cost. Every type of drift (requirements change, environments diverge, docs rot) inflates the cost of checking the agent's work.

Crush that cost and adoption follows.

30. How much electricity (and associated materials like water) must this use?

It makes me profoundly sad to think of the huge number of AI agents running endlessly to produce vibe-coded slop. The environmental impact must be massive.

31. It's amusing how everyone seems to be going through the same journey.

I do run multiple models at once now. On different parts of the code base.

I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself.

I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better.

Autocompleting our brains! What a crazy time.

32. I can't speak for parent, but I use gptel, and it sounds like they do as well. It has a number of features, but primarily it just gives you a chat buffer you can freely edit at any time. That gives you 100% control over the context, you just quickly remove the parts of the conversation where the LLM went off the rails and keep it clean. You can replace or compress the context so far any way you like.

While I also use LLMs in other ways, this is my core workflow. I quickly get frustrated when I can't _quickly_ modify the context.

If you have some mastery over your editor, you can just run commands and post relevant output and make suggested changes to get an agent like experience, at a speed not too different from having the agent call tools. But you retain 100% control over the context, and use a tiny fraction of the tokens OpenCode and other agents systems would use.

It's not the only or best way to use LLMs, but I find it incredibly powerful, and it certainly has it's place.

A very nice positive effect I noticed personally is that as opposed to using agents, I actually retain an understanding of the code automatically, I don't have to go in and review the work, I review and adjust on the fly.

33. I was using it the same way you just described but for C# and Angular and you're spot on. It feels amazing not having to memorize APIs and just let the AI even do code coverage near to 100%, however at some point I began noticing 2 things:

- When tests didn't work I had to check what was going on and the LLMs do cheat a lot with Volkswagen tests, so that began to make me skeptic even of what is being written by the agents

- When things were broken, spaghetti and awful code tends to be written in an obnoxius way it's beyond repairable and made me wish I had done it from scratch.

Thankfully I just tried using agents for tests and not for the actual code, but it makes me think a lot if "vibe coding" really produces quality work.

34. I don't understand why you were letting your code get into such a state just because an agent wrote it? I won't approve such code from a human, and will ask them to change it with suggestions on how. I do the same for code written by claude.

And then I raise the PR and other humans review it, and they won't let me merge crap code.

Is it that a lot of you are working with much lighter weight processes and you're not as strict about what gets merged to main?

35. AI is getting to the game-changing point. We need more hand-written reflections on how individuals are managing to get productivity gains for real (not a vibe coded app) software engineering.

36. This are all valid points and a hype-free pragmatic take, I've been wondering about the same things even when I'm still in the skeptics side. I think there are other things that should be added since Mitchell's reality won't apply to everyone:

- What about non opensource work that's not on Github?

- Costs! I would think "an agent always running" would add up quickly

- In open source work, how does it amplify others. Are you seeing AI Slop as PRs? Can you tell the difference?

37. > babysitting my kind of stupid and yet mysteriously productive robot friend

LOL, been there, done that. It is much less frustrating and demoralizing than babysitting your kind of stupid colleague though. (Thankfully, I don't have any of those anymore. But at previous big companies? Oh man, if only their commits were ONLY as bad as a bad AI commit.)

38. The value Mitchell describes aligns well with the lack of value I'm getting. He feels that guiding an agent through a task is neither faster nor slower than doing it himself, and there's some tasks he doesn't even try to do with an agent because he knows it won't work, but it's easier to parallelize reviewing agentic work than it is to parallelize direct coding work. That's just not a usage pattern that's valuable to me personally - I rarely find myself in a situation where I have large number of well-scoped programming tasks I need to complete, and it's a fun treat to do myself when I do.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

The Code Review Bottleneck # Concerns that generating code faster merely shifts the bottleneck to reviewing code, which is often harder and more time-consuming than writing it. Users discuss the cognitive load of verifying 'vibe code' and the risks of blindly trusting output that looks correct but contains subtle bugs or security flaws.

commentCount

38

← Back to job