llm/9db4e77f-8dd5-46da-972e-40d33f3399ef/topic-11-6c004897-5970-42fd-934f-59d6e24461e5-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them.
<topic>
Agentic Limitations and Reliability # Criticisms of current AI agents acting like 'slot machines' requiring constant steering, their struggle with complex concurrency bugs, and the observation that they often produce boilerplate rather than solving deep architectural problems.
</topic>
<comments_about_topic>
1. This is what gets me... Even at companies with relatively small engineering teams compared to company size, actually getting coherent requirements and buy-in from every stakeholder on a single direction was enough work that we didn't really struggle with getting things done.
Sure, there was some lead, but not nearly enough to 2x the team's productivity, let alone 10x.
Even when presented with something, there was still lead time turning that into something actually actionable as edge cases were sussed out.
2. So true, as a mere software developer on a payroll: I might spend 10 minutes doing a task with AI rather than an hour (w/o AI), but trust me - I am going to keep 50 minutes to myself, not deliver 5 more tasks )))) And when I work on my hobby project - having one AI agent crawling around my codebase is like watching a baby in a glassware shop. 10 babies? no thanks!
3. Same. I am doing this as Claude knocked out two annoying yak shaving tasks I did not really want to do. Required careful review and tweaking.
Claiming that you now have 10 AI minions just wrecking your codebase sounds like showboating. I do not pity the people who will inherit those codebases later.
4. Its not true. The best vibe coders have been able to accomplish is projects which look like corporate boilerplates but have no inherent complexity.
Its nothing more than surface level projects that we built when we wanted to pad out the resume.
5. DO you have any idea of the man hours it took to build those large projects you are speaking of? Lets take Linux for example. Suppose for the sake of argument that Claude Code with Opus 4.5 is as smart as an average person(AGI), but with the added benefit that he can work 24/7. Suppose now i have millions of dollars to burn and am running 1000 such instances on max plans. Now if I have started running this agent since the date Claude Opus 4.5 was released and i prompted him to create a commercial-grade multi-platform OS from the caliber of Linux.
An estimate of the linux kernel is 100 million man hours of work. divide by 1000. We expect to have a functioning OS like Linux by 2058 from these calcualtions.
How long has claude been released? 2 months.
6. Linux is valuable, because very difficult bugs got fixed over time, by talented programmers. Bugs which would cause terrible security problems of external attacks, or corrupted databases and many more.
All difficult problems are solved, by solving simple problems first and combining the simple solutions to solve more difficult problems etc etc.
Claude can do that, but you seriously overestimate it's capabilities by a factor of a thousand or a million.
Code that works but it is buggy, is not what Linux is.
7. Linux is 34 years old, most large software projects are not. Also your using a specific version of Claude, and sure maybe this time is different (and every other time I've heard that over the past 5 years just isn't the same). I don't buy it, but lets go along with it. Going off that, we have the equivalent of 2 years development time according to whats being promised. Have you seen any software projects come out of Claude 4.5 Opus that you'd guess to have been a 2 year project? If so, please do share
8. > check out my bio for one example.
First thing I got was “browser not supported” on mobile. Then I visited the website on desktop and tested languages I’m fluent in and found immediate problems with all of them.
The voices in Portuguese are particular inexcusable, using the Portuguese flag with Brazilian voices; the accents are nothing alike and it’s not uncommon for native speakers of one to have difficulty understanding the other in verbal communication.
The knowledge assessments were subpar and didn’t seem to do anything; the words it tested almost all started with “a” and several are just the masculine/feminine variants. Then, even after I confirmed I knew every word, it still showed me some of those in the learning process, including incredibly basic ones like “I”, or “the”.
The website is something, and I very much appreciate you appear to be trying to build a service which respects the user, but I wouldn’t in good conscience recommend it to anyone. It feels like you have a particular disdain for Duolingo-style apps (I don’t blame you!) but there is so much more out there to explore in language learning.
9. "Built out products" like you're earning money on this? Having actual users, working through edge cases, browser quirks, race conditions, marketing, communication - the real battle testing 5% that's actually 95% of the work that in my view is impossible for the LLM? Because yeah the easy part is to create a big boilerplate app and have it sit somewhere with 2 users.
The hard part is day to day operations for years with thousands of edge cases, actual human feedback and errors, knocking on 1000 doors etc.
Otherwise you're just doing slot machine coding on crack, where you work and work and work one some amazing thing then it goes nowhere - and now you haven't even learned anything because you didn't code so the sideproject isn't even education anymore.
What's the point of such a project?
10. > I see Bay area startups pushing 996 and requiring living in the Bay area because of the importance of working in an office to reduce communication hurdles.
This is toxic behavior by these companies, and is not backed by any empirical data that I’ve ever seen. It should be shunned and called out.
As far as the remainder of your post, I think you’ve uncovered solid evidence that the abilities of LLMs to code on their own, without human planning, architecting, and constant correction, is significantly oversold by most of the companies pushing the tech.
11. I have not used Claude. But my experience with Gemini and aider is that multiple instances of agents will absolutely stomp over each other. Even in a single sessions overwriting my changes after telling the agent that I did modifications will often result in clobbering.
12. You should try Claude opus 4.5 then. I haven’t had that issue. The key is you need to have well defined specs and detailed instructions for each agent.
13. > Sloppy? Perhaps, but Claude has never made such a big mess that it has needed its work wiped.
I think a key thing to point out to people here is that Claude's built in editing tools won't generally allow it to write to a file that has changed since last time it read it, so if it tries to write and gets an error it will tend to re-read the file, adjust its changes accordingly before trying again. I don't know how foolproof those tests are, because Claude can get creative with sed and cat to edit files, and of course if a change crosses file boundaries this might not avoid broken changes entirely. But generally - as you said - it seems good at avoiding big messes.
14. LLM agents can be a bit like slot machines. The more the merrier.
And at least two generate continuous shitposts for their companies Slack.
That said, having one write code and a clean context review it is helpful.
15. > It's like someone is claiming they unlocked ultimate productivity by washing dishes, in parallel with doing laundry, and cleaning their house.
In this case you have to take a leap of faith and assume that Claude or Codex will get each task done correctly enough that your house won't burn down.
16. The problem isn't generating requirements, it's validating work. Spec driven development and voice chat with ticket/chat context is pretty fast, but the validation loop is still mostly manual. When I'm building, I can orchestrate multiple swarm no problem, however any time I have to drop in to validate stuff, my throughput drops and I can only drive 1-2 agents at a time.
17. >> I need 1 agent that successfully solves the most important problem
In most of these kinds of posts, that's still you. I don't believe i've come across a pro-faster-keyboard post yet that claims AGI. Despite the name, LLMs have no agency, it's still all on you.
Once you've defined the next most important problem, you have a smaller problem - translate those requirements into code which accurately meets them. That's the bit where these models can successfully take over. I think of them as a faster keyboard and i've not seen a reason to change my mind yet despite using them heavily.
18. Why do you assume AGI needs to have agency?
19. Not OP, but I think that without some creative impetus like 'agency', how useful is an AGI going to be?
20. If cars do not have agency how useful are they going to be. If the Internet does not have agency how useful is going to be. if fire has no agency (debatable) how useful is going be.
21. Call it what you want, but people are going to call the LLM with tools in a loop, and it will do something . There was the AI slop email to Rob Pike thing the other day, which was from someone giving an agent the instruction to "do good", or some vague high level thing like that.
22. Is it?
Yesterday, gemini told me to run this:
echo 'export ANDROID_HOME=/opt/my-user/android-sdk' > ~/.bashrc
Which would have effectively overriden my whole bashrc config if I had blindly copy-pasted it.
A few minutes later, asking it to create a .gitignore file for the current project - right after generating a private key, it failed to include the private key file to the .gitignore.
I don't see yet how these tools can be labeled as 'major productivity boosters' if you loose basic security and privacy with them...
23. > You should trust your team members to write good-enough code...
That's the thing, I trust my teammate, I absolutely do not trust any LLM blindly. So if I were to receive 100 PRs a week and they were all AI-generated, I would have to check all 100 PRs unless I just didn't give a shit about the quality of the code being shit out I guess.
And regardless, whether I trust my teammates or not, it's still good to have 2 eyes on code changes, even if they're simple ones. The majority of the PRs I review are indeed boring (boring is good, in this context) ones where I don't need to say anything, but everyone inevitably makes mistakes, and in my experience the biggest mistakes can be found in the simplest of PRs because people get complacent in those situations.
24. > Neovim has a decade old feature request for multiple clients to be able to connect to it. No traction alas.
Why cram all features into one giant software instead of using multiple smaller pieces of software in conjunction? For the feature you mentioned I just use tmux which is built for this stuff.
Also, OpenCode has been extremely unreliable. I opened a PR about one of the simplest tools ever: `ls`, and they haven't fixed it yet. In a folder, their ls doesn't actually do what you'd expect: if iterates over all files of all folders (200 limit) and shows them to the model...
25. I tried Claude Code a while back when I decided to give "vibe-coding" a go. That was was actually quite successful, producing a little utility that I use to this day, completely without looking at the code. (Well, I did briefly glance at it after completion and it made my eyeballs melt.) I concluded the value of this to me personally was nowhere near the price I was charged so I didn't continue using it, but I was impressed nonetheless.
This brief use of Claude Code was done mostly on a train using my mobile phone's wi-fi hotspot. Since the connection would be lost whenever the train went through a tunnel, I encountered a bug in Claude Code [1]. The result of it was that whenever the connection dropped and came up again I had to edit an internal json file it used to track the state of its tool use, which had become corrupt.
The issue had been open for months then, and still is. The discussion under it is truly remarkable, and includes this comment from the devs:
> While we are always monitoring instances of this error and and looking to fix them, it's unlikely we will ever completely eliminate it due to how tricky concurrency problems are in general.
Claude Code is, in principle, a simple command-line utility. I am confident that (given the backend and model, ofc) I could implement the functionality of it that I used in (generously!) at most a few thousand lines of python or javascript, I am very confident that I could do so without introducing concurrency bugs and I am extremely confident that I could do it without messing up the design so badly that concurrency issues crop up continually and I have to admit to being powerless to fix them all.
Programming is hard, concurrency problems are tricky and I don't like to cast aspersions on other developers, but we're being told this is the future of programming and we'd better get on board or be left behind and it looks like we're being told this by people who, with presumably unlimited access to all this wonderful tooling, don't appear to be able to write decent software .
[1] https://github.com/anthropics/claude-code/issues/6836
26. It would be very interesting to see the outputs of his operations. How productive is one of his agents? How long does it take to complete a task, and how often does it require steering?
I'm a bit of a skeptic. Claude Code is good, but I've had varied results during my usage. Even just 5 minutes ago, I asked CC to view the most recent commit diff using git show. Even when I provided the command, it was doing dumb shit like git show --stat and then running wc for some reason...
I've been working on something called postkit[1], which has required me to build incrementally on a codebase that started from nothing and has now grown quite a lot. As it's grown, Claude Code's performance has definitely dipped.
[1] https://github.com/varunchopra/postkit
27. What I find surprising is how much human intervention the creator of Claude uses. Every time Claude does something bad we write it in claude.md so he learns from it... Why not create an agent to handle this and learn automatically from previous implementations.
B: Outcome Weighting
# memory/store.py
OUTCOME_WEIGHTS = {
RunOutcome.SUCCESS: 1.0, # Full weight
RunOutcome.PARTIAL: 0.7, # Some issues but shipped
RunOutcome.FAILED: 0.3, # Downweighted but still findable
RunOutcome.CANCELLED: 0.2, # Minimal weight
}
# Applied during scoring:
final_score = score * decay_factor * outcome_weight
C: Anti-Pattern Retrieval
# Similar features → SUCCESS/PARTIAL only
similar_features = store.search(..., outcome_filter=[SUCCESS, PARTIAL])
# Anti-patterns → FAILED only (separate section)
anti_patterns = store.search(..., outcome_filter=[FAILED])
Injected into agent prompt:
## Similar Past Features (Successful)
1. "Add rate limiting with Redis..." (Outcome: success, Score: 0.87)
## Anti-Patterns (What NOT to Do)
_These similar attempts failed - avoid these approaches:_
1. "Add rate limiting with in-memory..." (FAILED, Score: 0.72)
## Watch Out For
- **Redis connection timeout**: Set connection pool size
The flow now:
Query: "Add rate limiting"
│
├──► Similar successful features (ranked by outcome × decay × similarity)
│
├──► Failed attempts (shown as warnings)
│
└──► Agent sees both "what worked" AND "what didn't"
28. How has Claude Code (as a CLI tool, not the backing models) evolved over the last year?
For me it's practically the same, except for features that I don't need, don't work that well and are context-hungry.
Meanwhile, Claude Code still doesn't know how to jump to a dependency (library's) source to obtain factual information about it. Which is actually quite easy by hand (normally it's cd'ing into a directory or unzipping some file).
So, this wasteful workflow only resulted in vibecoded, non-core features while at the domain level, Claude Code remains overly agnostic if not stupid.
29. I'm a bit jealous. I would like to experiment with having a similar setup, but 10x Opus 4.5 running practically non stop must amount to a very high inference bill. Is it really worth the output?
From experimentation, I need to coach the models quite closely in order to get enough value. Letting it loose only works when I've given very specific instructions. But I'm using Codex and Clai, perhaps Claude code is better.
30. I've tried running a number of claude's in paralell on a CRUD full stack JS app. Yes, it got features made faster, yes it definitely did not leave me enough time to acutally look at what they did, yes it definitely produced sub-par code.
At the moment with one claude + manually fixing crap it produces I am faster at solving "easier" features (Think add API endpoint, re-build API client, implement frontend logic for API endpoint + UI) faster than if I write it myself.
Things that are more logic dense, it tends to produce so many errors that it's faster to solve myself.
31. > manually fixing crap it produces
> it tends to produce so many errors
I get some of the skepticism in this thread, but I don't get takes like this. How are you using cc that the output you look at is "full of errors"? By the time I look at the output of a session the agent has already ran linting, formatting, testing and so on. The things I look at are adherence to the conventions, files touched, libraries used, and so on. And the "error rate" on those has been steadily coming down. Especially if you also use a review loop (w/ codex since it has been the best at review lately).
You have to set these things up for success. You need loops with clear feedback. You need a project that has lots of clear things to adhere to. You need tight integrations. But once you have these things, if you're looking at "errors", you're doing something wrong IMO.
32. I don't think he meant like syntax errors, but thinking errors. I get these a lot with CC. Especially for example with CSS. So much useless code it produces, it blows my mind. Once I deleted 50 lines of code and manually added 4 which was enough to fix the error.
33. One thing that’s helped me is creating a bake-off. I’ll do it between Claude and codex. Same prompt but separate environments. They’ll both do their thing and then I’ll score them at the end. I find it helps me because frequently only one of them makes a mistake, or one of them finds an interesting solution. Then once I declare a winner I have scripts to reset the bake-off environments.
34. Having the 5 instances going at once sounds like Google Antigravity.
I haven't used Claude Code too much. One snag I found is the tendency when running into snags to fix them incorrectly by rolling back to older versions of things. I think it would benefit from an MCP server for say Maven Central. Likewise it should prefer to generate code using things like project scaffolding tooling whenever possible.
35. Yeah... I had a fairly in-depth conversation with Claude a couple of days ago about Claude Code and the way it works, and usage limits, and comparison to how other AI coding tools work, and the extremely blunt advice from Claude was that Claude Code was not suitable for serious software development due to usage limits! (props to Anthropic for not sugar coating it!)
Maybe on the Max 20x plan it becomes viable, and no doubt on the Boris Cherny unlimited usage plan it does, but it seems that without very aggressive non-stop context pruning you will rapidly hit limits and the 5-hour timeout even working with a single session, let alone 5 Claude Code sessions and another 5-10 web ones!
The key to this is the way that Claude Code (the local part) works and interacts with Claude AI (the actual model, running in the cloud). Basically Claude Code maintains the context, comprising mostly of the session history, contents of source files it has accessed, and the read/write/edit tools (based on Node.js) it is providing for Claude AI. This entire context, including all files that have been read, and the tools definitions, are sent to Claude AI (eating into your token usage limit) with EVERY request, so once Claude Code has accessed a few source files then the content of those files will "silently" be sent as part of every subsequent request, regardless of what it is. Claude gave me an example of where with 3 smallish files open (a few thousand lines of code), then within 5 requests the token usage might be 80,000 or so, vs the 40,000 limit of the Pro plan or 200,000 limit of the Max 5x plan. Once you hit limit then you have to wait 5 hours for a usage reset, so without Cherny's infinite usage limit this becomes a game of hurry up and wait (make 5 requests, then wait 5 hours and make 5 more).
You can restrict what source files Claude Code has access to, to try to manage context size (e.g. in a C++ project, let it access all the .h module definition files, but block all the .cpp ones) as well as manually inspecting the context all the time to see what is being sent that can be removed. I believe there is some automatic context compaction happening periodically too, but apparently not enough to prevent many/most people hitting usage time outs when working on larger projects.
Not relevant here, but Claude also explained how Cursor manages to provide fast/cheap autocomplete using it's own models by building a vector index of the code base to only pull relevant chunks of code into the context.
36. A classic hacker news post that will surely interest coders from all walks of life! ~
After regular use of an AI coding assistant for some time, I see something unusual: my biggest wins came from neither better prompts nor a smarter model. They originated from the way I operated.
At first, I thought of it as autocomplete. Afterwards, similar to a junior developer. In the end, a collaborator who requires constraints.
Here is a framework I have landed on.
First Step: Request for everything. Obtain acceleration, but lots of noise.
Stage two: Include regulations. Less Shock, More Trust.
Phase 3: Allow time for acting but don’t hesitate to perform reviews aggressively.
A few habits that made a big difference.
Specify what can be touched or come into contact with.
Asking it to explain differences before applying them.
Consider “wrong but confident” answers as signal to tighten scope.
Wondering what others see only after time.
What transformations occurred after the second or fourth week?
When was the trust increased or reduced?
What regulations do you wish you had added earlier?
</comments_about_topic>
Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Agentic Limitations and Reliability # Criticisms of current AI agents acting like 'slot machines' requiring constant steering, their struggle with complex concurrency bugs, and the observation that they often produce boilerplate rather than solving deep architectural problems.
36