Summarizer

LLM Input

llm/9db4e77f-8dd5-46da-972e-40d33f3399ef/topic-1-db6b201f-5159-4ee3-ae3b-146e1dd47e8b-input.json

Pretty-print

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Code Quantity versus Quality # Discussions on whether generating 50-100 Pull Requests a week represents true productivity or merely 'token-maxxing', with concerns about code churn, technical debt, and the inability of humans to properly review such high volumes of generated code.
</topic>

<comments_about_topic>
1. This is interesting to hear, but I don't understand how this workflow actually works.

I don't need 10 parallel agents making 50-100 PRs a week, I need 1 agent that successfully solves the most important problem.

I don't understand how you can generate requirements quicky enough to have 10 parallel agents chewing away at meaningful work. I don't understand how you can have any meaningful supervising role over 10 things at once given the limits of human working memory.

It's like someone is claiming they unlocked ultimate productivity by washing dishes, in parallel with doing laundry, and cleaning their house.

Likely I am missing something. This is just my gut reaction as someone who has definitely not mastered using agents. Would love to hear from anyone that has a similar workflow where there is high parallelism.

2. Linux is valuable, because very difficult bugs got fixed over time, by talented programmers. Bugs which would cause terrible security problems of external attacks, or corrupted databases and many more.

All difficult problems are solved, by solving simple problems first and combining the simple solutions to solve more difficult problems etc etc.

Claude can do that, but you seriously overestimate it's capabilities by a factor of a thousand or a million.

Code that works but it is buggy, is not what Linux is.

3. I’m building an ERP system, I’ve already been at it for a 3 years (full time, but half the system is already in production with two tenants so not all of my time is spent on completing the product, this revenue completely sustains the project). AI is now speeding this up tremendously. Maybe 2x velocity, which is a game changer but more realistic than what you hear. The post AI features are just as good and stable as pre AI, why wouldn’t they be? I’m not going to put “slop” into my product, it’s all vetted by me. I do anticipate that when the complexity is built out and there are less new features and more maintaining/improving, the productivity will be immense.

4. Haha, thanks for checking it out! I really appreciate the feedback.

> First thing I got was “browser not supported” on mobile.

Yeah, I use some APIs that were only implemented in Safari on iOS 26. Kind of annoying but I use Android so I didn't realize until too late. I should fix it, but it's not a priority given the numerous other things that need improvement (as you noticed!)

> The voices in Portuguese are particular inexcusable, using the Portuguese flag with Brazilian voices; the accents are nothing alike and it’s not uncommon for native speakers of one to have difficulty understanding the other in verbal communication.

That's good feedback, thanks! I only added Portuguese this weekend ( https://github.com/yaptown/yap/pull/73 ) so it's definitely still very alpha (as noted on the website :P )

> The knowledge assessments were subpar and didn’t seem to do anything; the words it tested almost all started with “a” and several are just the masculine/feminine variants.

Thanks, will fix this tonight. The placement test was just added last week ( https://github.com/yaptown/yap/pull/72 ) so there are still some kinks to work out.

> Then, even after I confirmed I knew every word, it still showed me some of those in the learning process, including incredibly basic ones like “I”, or “the”.

Yeah, the logic doesn't really work for people who already know every word. It tries to show words in the following order (descending): probability_of_knowledge * ln(frequency). But if you already know every word, probability_of_knowledge is the same for every word and the ln(frequency) is the only one remaining, meaning you just get the most common words. I'll add a warning to the site for people who are too advanced for the app's dictionary size – as you pointed out, it's not a good UX.

> there is so much more out there to explore in language learning

There is! I usually recommend pimsleur to people. My hope is just for my app to be a useful supplement.

5. > shouldn't we be seeing a ton of 1 person startups?

Who should be seeing that? The thing about 1 person startups is that it requires little to no communication to start up, and also very little capital. Seems easy to fly below the radar.

Also "a ton", idk. Doing a startup is still hard, for reasons outside of just being able to write a lot of code. In my experience churning out all this code at 10x is coming with a significant complexity tax: Turns out writing code and thinking about code problems was the relaxing part. When that goes away you have to think about real world problems only. What a fucking mess.

Still, I would assume that it's more of a thing now, and something you could observe when you have YC data for example. Do we know that's not the case? I am in no position to say, one way or the other.

6. I suppose he may have a list of feature requests and bug reports to work on, but it does seem a bit odd from a human perspective to want to work on 5 or more things literally in parallel, unless they are all so simple that there is no cognitive load and context switching required to mentally juggle them.

Washing dishes in parallel with laundry and cleaning is of course easily possible, but precisely because there is no cognitive load involved. When the washing machine stops you can interrupt what you are doing to load clothes into the drier, then go back to cleaning/whatever. Software development for anything non-trivial obviously has a much higher task-switching overhead. Optimal flow for a purely human developer is to "load context" at the beginning of the day, then remain in flow-state without interruptions.

The cynical part of me can't also help but wonder if Cherny/Anthopic aren't just advocating token-maxxing!

7. > don't need 10 parallel agents making 50-100 PRs a week

I don't like to be mean, but I few weeks ago the guy bragged about Claude helping him do +50k loc and -48k loc(netting a 2k loc), I thought he was joking because I know plenty of programmers who do exactly that without AI, they just commit 10 huge json test files or re-format code.

I almost never open a PR without a thorough cleanup whereas some people seem to love opening huge PRs.

8. This is it! “I don't need 10 parallel agents making 50-100 PRs a week, I need 1 agent that successfully solves the most important problem.”

9. maybe more like throw shits to the wall and see what sticks?

10. 50-100 PRs a week but they still can't fix the 'flickering' bug

11. 50-100 PRs a week to me is insane. I'm a little skeptical and wonder how large/impactful they are. I use AI a lot and have seen significant productivity gains but not at that level lol.

12. Yeh, 100 PRs a week is a PR every 24 minutes at standard working hours (not including lunch break). That would be crazy to even review.

13. I work for a FAANG and I'm the top reviewer in my team (in terms of number of PRs reviewed). I work on an internal greenfield project, so something really fast moving.

For ALL of 2025 I reviewed around 400 PRs. And that already took me an extreme amount of time.

Nobody is reviewing this many PRs.

I've also raised around 350 PRs in the same year, which is also #1 for my team.

AI or not, nobody is raising upwards of 3,500 CRs a year. In fact, my WHOLE TEAM of 15 people has barely raised this number of CRs for the year.

I don't know why people keep believing those wild unproven claims from actors who have everything to gain from you believing them. Has common sense gone down the drain that much, even for educated professionals?

14. > I don't know why people keep believing those wild unproven claims from actors who have everything to gain from you believing them.

It's grifters all the way down. The majority of people pushing this narrative have vested interests, either because they own some AI shovelware company or are employed by one of the AI shovelware companies. Anthropic specifically is running guerilla marketing campaigns fucking everywhere at the moment, it's why every single one of these types of spammed posts reads the same way. They've also switched up a bit of late, they stopped going with the "It makes me a 10x engineer!" BS (though you still see plenty of that) and are instead going with this weird "I can finally have fun developing again!" narrative instead, I guess trying to cater to the ex-devs that are now managers or whatever.

What happens is you get juniors and non-technical people seeing big numbers and being like "Wow, that's so impressive!" without stopping to think for 5 seconds what the kind of number they're trying to push even actually means. 100 PRs is absurd unless they're tiny oneliners, and even if they were tiny changes, there's 0 chance anyone is looking at the code being shat out here.

15. Reviewing PRs should be for junior engineers, architectural changes, brand new code, or broken tests. You should not review every PR; if you do, you're only doing it out of habit, not because it's necessary.

PRs come originally from the idea that there's an outsider trying to merge code into somebody's open source project, and the Benevolent Dictator wants to make sure it's done right. If you work on a corporate SWEng team, this is a completely different paradigm. You should trust your team members to write good-enough code, as long as conventions are followed, linters used, acceptance tests pass, etc.

16. > You should trust your team members to write good-enough code...

That's the thing, I trust my teammate, I absolutely do not trust any LLM blindly. So if I were to receive 100 PRs a week and they were all AI-generated, I would have to check all 100 PRs unless I just didn't give a shit about the quality of the code being shit out I guess.

And regardless, whether I trust my teammates or not, it's still good to have 2 eyes on code changes, even if they're simple ones. The majority of the PRs I review are indeed boring (boring is good, in this context) ones where I don't need to say anything, but everyone inevitably makes mistakes, and in my experience the biggest mistakes can be found in the simplest of PRs because people get complacent in those situations.

17. I am also skeptical about the need for such a large number of PRs. Do those open because of previous PRs not accomplishing their goals?

It's frustrating because being part of a small team, I absolutely fucking hate it when any LLM product writes or refractors thousands of lines of code. It's genuinely infuriating because now I am fully reliant on it to make any changes, even if it's really simple. Just seems like a new version of vendor lock-in to me.

18. Because he is working on a product that is hot and has demand from the users for new features/bug fixes/whatnot and also gets visibility on getting such things delivered. Most of us don't work on products that have that on a daily basis.

19. In other words, nobody cares that the generated code is shit, because there is no human who can review that much code. Not even on high level.

According to the discussion here, they don’t even care whether the tests are real. They just care about that it’s green. If tests are useless in reality? Who cares, nobody has time to check them!

And who will suffer because of this? Who cares, they pray that not them!

20. >nobody cares that the generated code is shit

That is the case, whether the code is AI generated or not. Go take a look at some of the source code for tools you use ever day, and you'll find a lot of shit code. I'd go so far as to say, after ~30 years of contributing to open source, that it's the rare jewel that has clean code.

21. Yeah, but there is a difference, between if at least one people at one point of time understood the code (or the specific part of it), and none. Also, there are different levels. Wildfly’s code for example is utterly incomprehensible, because the flow jumps on huge inheritance chains up and down to random points all the time; some Java Enterprise people are terrible with this. Anyway, the average for tools used by many is way better than that. So it’s definitely possible to make it worse. Blindly trusting AI is one possible way to reach those new lows. So it would be good to prevent it, before it’s too late, and not praising it without that, and even throwing out one of the (broken, but better than nothing) safeguard. Especially how code review is obviously dead with such amount of generated code per week. (The situation wasn’t great there either before) So it’s a two in one bad situation.

22. For comparison, I remember doing 250 PRs in 2.5 months of my internship at FB (working on a fullstack web app). So that’s 2-4x faster. What’s interesting is that it’s Boris, not an intern (although the LLM can play an intern well).

23. 50-100 is a lot, but 15 a week should be normal with continuous integration, you should be merging multiple times a day

24. Where have you worked? I have been at a lot of places and I have never seen people consistently checking in 2 PR/day every day.

25. iirc he (or his colleague) did mention somewhere on X that most of the PRs are small

26. When people do PR counting then I assume they're dependabot-style stuff.

27. Why stop at 5-10?
Make it 5 billion - 10 billion parallel agents.
PR number go up

28. I have a coworker who is basically doing this right now he leads our team and is second place overall. Regularly runs opus in parallel he alone is burning through 1k worth of credits a day.

He is also one of our worst performers.

29. I've tried running a number of claude's in paralell on a CRUD full stack JS app. Yes, it got features made faster, yes it definitely did not leave me enough time to acutally look at what they did, yes it definitely produced sub-par code.

At the moment with one claude + manually fixing crap it produces I am faster at solving "easier" features (Think add API endpoint, re-build API client, implement frontend logic for API endpoint + UI) faster than if I write it myself.

Things that are more logic dense, it tends to produce so many errors that it's faster to solve myself.

30. I spent a whole day running 3x local CC sessions and about 7 Claude code web sessions over the day. This was the most heavy usage day ever for me, about 30 pull requests created and merged over 3 projects.

I got a lot done, but my brain was fried after that. Like wired but totally exhausted.

Has anyone else experienced this and did you find strategies to help (or find that it gets easier)?

31. Did you manage to make proper reviews of all the 30 PR ?

32. I have formal requirements for all implemented code. This is all on relatively greenfield solo developed codebases with tools I know inside out (Django, click based cli etc) so yes. Thanks so much for your concern, internet person!

33. I was genuinely interested in knowing if you did it properly or not, since I read a lot of tales like this but don't understand how it can be true.

34. I think we all know the answer to that.

35. > I assume "what sort of problems you must have" was directed at me.

I don't really have any sort of personal problem with Boris' post, if what your inflammatory statement was implying.

I also think it was a fairly good description of his workflow, technically speaking, but also glosses over the actual monetary costs of what he is doing, and also as noted above, doesn't really describe the actual outcomes other than a lot of PRs.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Code Quantity versus Quality # Discussions on whether generating 50-100 Pull Requests a week represents true productivity or merely 'token-maxxing', with concerns about code churn, technical debt, and the inability of humans to properly review such high volumes of generated code.

commentCount

← Back to job