Summarizer

LLM Input

llm/122b8d72-a8a3-4fcf-8eca-6a52786d1a8b/topic-2-6dc31e87-b439-4d14-9307-59b9e24e5858-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
AI Coding Tool Limitations # Discussion of how AI tools work well for simple, repetitive, or locally-scoped tasks but fail with complex systems, large codebases, and non-trivial problems requiring significant human guidance
</topic>

<comments_about_topic>
1. There's an odd trend with these sorts of posts where the author claims to have had some transformative change in their workflow brought upon by LLM coding tools, but also seemingly has nothing to show for it. To me, using the most recent ChatGPT Codex (5.3 on "Extra High" reasoning), it's incredibly obvious that while these tools are surprisingly good at doing repetitive or locally-scoped tasks, they immediately fall apart when faced with the types of things that are actually difficult in software development and require non-trivial amounts of guidance and hand-holding to get things right. This can still be useful, but is a far cry from what seems to be the online discourse right now.

As a real world example, I was told to evaluate Claude Code and ChatGPT codex at my current job since my boss had heard about them and wanted to know what it would mean for our operations. Our main environment is a C# and Typescript monorepo with 2 products being developed, and even with a pretty extensive test suite and a nearly 100 line "AGENTS.md" file, all models I tried basically fail or try to shortcut nearly every task I give it, even when using "plan mode" to give it time to come up with a plan before starting. To be fair, I was able to get it to work pretty well after giving it extremely detailed instructions and monitoring the "thinking" output and stopping it when I see something wrong there to correct it, but at that point I felt silly for spending all that effort just driving the bot instead of doing it myself.

It almost feels like this is some "open secret" which we're all pretending isn't the case too, since if it were really as good as a lot of people are saying there should be a massive increase in the number of high quality projects/products being developed. I don't mean to sound dismissive, but I really do feel like I'm going crazy here.

2. You're not going crazy. That is what I see as well. But, I do think there is value in:

- driving the LLM instead of doing it yourself. - sometimes I just can't get the activation energy and the LLM is always ready to go so it gives me a kickstart

- doing things you normally don't know. I learned a lot of command like tools and trucks by seeing what Claude does. Doing short scripts for stuff is super useful. Of course, the catch here is if you don't know stuff you can't drive it very well. So you need to use the things in isolation.

- exploring alternative solutions. Stuff that by definition you don't know. Of course, some will not work, but it widens your horizon

- exploring unfamiliar codebases. It can ingest huge amounts of data so exploration will be faster. (But less comprehensive than if you do it yourself fully)

- maintaining change consistency. This I think it's just better than humans. If you have stuff you need to change at 2 or 3 places, you will probably forget. LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.)

3. >driving the LLM instead of doing it yourself. - sometimes I just can't get the activation energy and the LLM is always ready to go so it gives me a kickstart

There is a counter issue though, realizing mid session that the model won’t be able to deliver that last 10%, and now you have to either grok a dump of half finished code or start from scratch.

4. > - maintaining change consistency. This I think it's just better than humans. If you have stuff you need to change at 2 or 3 places, you will probably forget. LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.)

I use Claude Code a decent amount, and I actually find that sometimes this can be the opposite for me. Sometimes it is actually missing other areas that the change will impact and causing things to break. Sometimes when I go to test it I need to correct it and point out it missed something or I notice when in the planning phase that it is missing something.

However I do find if you use a more powerful opus model when planning, it does consider things fully a lot better than it used to. This is actually one area I have been seeing some very good improvements as the models and tooling improves.

In fact, I actually hope that these AI tools keep getting better at the point you mention, as humans also have a "context limit". There are only so many small details I can remember about the codebase so it is good if AI can "remember" or check these things.

I guess a lot of the AI can also depend on your codebase itself, how you prompt it, and what kind of agents file you have. If you have a robust set of tests for your application you can very easily have AI tools check their work to ensure things aren't being broken and quickly fix it before even completing the task. If you don't have any testing more could be missed. So I guess it's just like a human in some sense. If you have a crappy codebase for the AI to work with, the AI may also sometimes create sloppy work.

5. > LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.)

I think it makes sense? Unlike small details which are certain to be explicitly part of the training data, "big picture stuff" feels like it would mostly be captured only indirectly.

6. It's like a little kid, you tell it to do the dishes and it does half of them and then runs away.

7. Pretty much every software engineer I've talked to sees it more or less like you do, with some amount of variance on exactly where you draw the line of "this is where the value prop of an LLM falls off". I think we're just awash in corporate propaganda and the output of social networks, and "it's good for certain things, mixed for others" is just not very memetic.

8. From what I get out of this is that these models are trained on basic coding and not enterprise level where you have thousands and thousands of project files all intertwined and linked with dependencies. It didn’t have access to all of that.

9. I definitely think it's language specific. My history may deceive me here, but i believe that LLMs are infinitely better at pumping out python scripts than java. Now i have much, much more experience with java than python, so maybe it's just a case of what you don't know.... However, The tools it writes in python just work for me, and i can incrementally improve them and the tools get rationally better and more aligned with what i want.

I then ask it to do the same thing in java, and it spends a half hour trying to do the same job and gets caught in some bit of trivia around how to convert html escape characters, for instance, s.replace("<", "<").replace(">", ">").replace("\"").replace("""); as an example and endlessly compiles and fails over and over again, never able to figure out what it has done wrong, nor decides to give up on the minutia and continue with the more important parts.

10. I think LLMs have a hard time with large code bases (obviously so do devs).

A giant monorepo would be a bad fit for an LLM IMO.

11. With agentic search, they actually do pretty well with monorepos.

12. I remember when Anthropic was running their Built with Claude contest on reddit. The submissions were few and let's just say less than impressive. I use Claude Code and am very pro-AI in general, but the deeper you go, the more glaring the limitations become. I could write an essay about it, but I feel like there's no point in this day and age, where floods of slop in fractured echo chambers dominate.

13. The pattern matching and absence or real thinking is still strong.

Tried to move some excel generation logic from epplus to closedxml library.

ClosedXml has basically the same API so the conversion was successful. Not a one-shot but relatively easy with a few manual edits.

But closedxml has no batch operations (like apply style to the entire column): the api is there but internal implementation is on cell after cell basis. So if you have 10k rows and 50 columns every style update is a slow operation.

Naturally, told all about this to codex 5.3 max thinking level. The fucker still succumbed to range updates here and there.

Told it explicitly to make a style cache and reuse styles on cells on same y axis.

5-6 attempts — fucker still tried ranges here and there. Because that is what is usually done.

Not here yet. Maybe in a year. Maybe never.

14. That real engineer knows decent. This parrot knows only its own best (current attempt).

15. As an engineer, I can never actually let a system write code on behalf of me with the level of complacency I've accumulated over the years. I always have opinionated design decisions, variable naming practices. It's memorable, relatable, repeatable across N projects. Sure, you can argue that you can feed all this into the context, but I've found most models to hallucinate and make things unnecessarily opaque and complex. And then, I eventually have to spend time cleaning up all that mess. OP claims they can tell the model over the phone what to do and it does it. Good for OP, but I've never personally had that level of success with my own product development workflow. It sounds too good to be true if this level of autonomy is even possible today without the AI fucking something up.

16. A lot of more senior coders when they actively try vibe coding a greenfield project find that it does actually work. But only for the first ~10kloc. After that the AI, no matter how well you try to prompt it, will start to destroy existing features accidentally, will add unnecessary convoluted logic to the code, will leave benhind dead code, add random traces "for backwards compatibility", will avoid doing the correct thing as "it is too big of a refactor", doesn't understand that the dev database is not the prod database and avoids migrations. And so forth.

I've got 10+ years of coding experience, I am an AI advocate, but not vibe coding. AI is a great tool to help with the boring bits, using it to initialize files, help figure out various approaches, as a first pass code reviewer, helping with configuring, those things all work well.

But full-on replacing coders? It's not there yet. Will require an order of magnitude more improvement.

17. > There is no code, there are no tools, there is no configuration, and there are no projects.

To add to this, OpenClaw is incapable of doing anything meaningful. The context management is horrible, the bot constantly forgets basic instructions, and often misconfigures itself to the point of crashing.

18. I think some of it might be genuine. For people that don't code (like management), going from 0 to being able to create a landing page that looks like it came from a big corporation is a miracle.

They are not able to comprehend that for anything more complicated than that, the code might compile, but the logical errors and failure to implement the specs start piling up.

19. Last night I was debugging a website where some users, some times were getting a message that they were attempting to sign up too many times, even when they only had tried to sign-up once.

I tried using LLMs to help debug at different points, but they went in circles on bad ideas, even when I gave them what turned out to be a correct clue.

Root cause turned out to be that IPv6 wasn't enabled for Docker networking, but was enabled for the websites DNS. So people who connected over IPv6 were getting their IPs all converted to the same internal Docker IP before being handed to the per-IP throttling algorithm.

I spotted that there were no IPv6 IPs in the logs, but the LLMs missed that the key pattern was the absence of something expected, instead drawing wrong conclusions.

So no, I'm not about to turn OpenClaw loose on building anything at all complex.

20. I played with it extensively for three days. I think there are a few things it does that people are finding interesting:

1. It has a lot of files that it loads into it's context for each conversation, and it consistently updates them. Plus it stores and can reference each conversation. So there's a sense of continuity over time.

2. It connects to messaging services and other accounts of yours, so again it feels continuous. You can use it on your desktop and then pick up your phone and send it an iMessage.

3. It hooks into a lot of things, so it feels like it has more agency. You could send it a voice message over discord and say "hey remember that conversation about birds? Send an email to Steve and ask him what he thinks about it"

It feels more like a smart assistant that's always around than an app you open to ask questions to.

However, it's worth stressing how terrible the software actually is. Not a single thing I attempted to do worked correctly, important issues (like the discord integration having huge message delays and sometimes dropping messages) get closed because "sorry we have too many issues", and I really got the impression that the whole thing is just a vibe coded pile of garbage. And I don't like to be that critical about an open source project like this, but I think considering the level of hype and the dramatic claims that humans shouldn't be writing code anymore, I think it's worth being clear about.

Ended up deleting it and setting up something much simpler. I installed a little discord relay called kimaki, and that lets me interact with instances of opencode over discord when I want to. I also spent some time setting up persistent files and made sure the llm can update them, although only when I ask it to in this case. That's covered enough of what I liked from OpenClaw to satisfy me.

21. While Claude was trying fix a bug for me (one of these "here! It's fixed now!" "no it's not, the ut still doesn't pass", "ah, I see, lets fix the ut", "no you dont, fix the code" loops), I was updating my oncall rotation after having to run after people to refresh my credentials to so, after attending a ship room where I had to provide updates and estimates.

Why isn't Claude doing all that for me, while I code? Why the obsession that we must use code generation, while other gabage activities would free me to do what I'm, on paper, paid to do?

It's less sexy of course, it doesn't have the promise of removing me in the end. But the reason, in the present state, is that IT admins would never accept for an llm to handle permissions, rotations, management would never accept an llm to report status or provide estimate. This is all "serious" work where we can't have all the errors llm create.

Dev isn't that bad, devs can clean slop and customers can deal with bugs.

22. What I don’t understand in these posts is how exactly is the AI checking its work. That’s literally what I’m here for now. It doesn’t know how to log in to my iOS app using the simulator, or navigate to the firebase console and download a plist file.

Once we get to a spot where the AI can check its work and iterate, the loop is closed. But we are a long way off from that atm. Even for the web. I mean, have you tried the Playwright MCP server? Aside from being the slowest tool calls I have ever seen, the agent struggles mightily to figure out the simplest of navigation and interaction.

Yes yes Unit tests, but functional is the be all end all and until it can iterate and create its own functional test suite, I just don’t get it.

What am I missing?

23. I don't buy it. It's the same model underneath running whatever UI. It's the same model that keeps forgetting and missing details. And somehow when it is given a bunch of CLI tools and more interfaces to interact with, it suddenly becomes x10 AI? It may feel like it for a manager whose job is to deal with actual people who push back. Will it stop bypassing a test because it is directly not related to a feature I asked for? I don't think so.

24. I've been experimenting with getting Cursor/ChatGPT to take an old legacy project ( https://github.com/skullspace/Net-Symon-Netbrite ) which is not terribly complex, but interacts with hardware with some very specific instructions and converting that into a python version.
I've tried a few different versions/forks of the code (and other code to resurrect these signs) and each time it just absolutely cannot manage it. Which is quite frustrating and so instead the best thing I've been able to do is get it to comment each line of the code and explain what it is doing so I can manually implement it.

25. everything I see people do with openclaw is less like LLM work and more like 'Yahoo! Pipes' work.

I haven't been able to find a good use for myself yet. Almost everything I use an LLM for has some kind of hard human-in-the-loop factor that is as of yet inescapable -- but I also don't really use LLMs for things like "sort my email.". mostly entirely coding.

26. That's a very inefficient way to interact with CC. There will be transmission losses that need too much feedback looping.

So, it appears that we have come a long way bubbling up through abstraction layers: assembly code -> high-level languages -> scripting -> prompting -> openclaw.

27. I‘ve done some phone programming over the Xmas holidays with clawdbot. This does work, BUT you absolutely need demand clearly measurable outcomes of the agent, like a closed feedback loop or comparison with a reference implementation, or perfect score in a simulated environment. Without this, the implementation will be incomplete and likely utter crap.

Even then, the architecture will be horrible unless you chat _a lot_ about it upfront. At some point, it’s easier to just look in the terminal.

28. If you use Cursor or Claude, you have to oversee it and steer it so it gets very close to what you want to achieve.

If you delegate these tasks to OpenClaw, I am not really sure the result is exactly what you want to achieve and it works like you want it to.

29. This euphoria quickly turns into disappointment once you finish scaffolding and actually start the development/refinement phase and claude/codex starts shitting all over the code and you have to babysit it 100% of the time.

30. You have to be joking. I tried Codex for several hours and it has to be one of the worst models I’ve seen. It was extremely fast at spitting out the worst broken code possible. Claude is fine, but what they said is completely correct. At a certain point, no matter what model you use, llms cannot write good working code. This usually occurs after they’ve written thousands of lines of relatively decent code. Then the project gets large enough that if they touch one thing they break ten others.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

AI Coding Tool Limitations # Discussion of how AI tools work well for simple, repetitive, or locally-scoped tasks but fail with complex systems, large codebases, and non-trivial problems requiring significant human guidance

commentCount

30

← Back to job