Summarizer

AI Performance on Greenfield vs. Legacy

Users debate whether agents excel primarily at starting new projects from scratch while struggling to maintain large, complex, or legacy codebases without breaking existing conventions.

← Back to Opus 4.5 is not the normal AI agent experience that I have had thus far

84 comments tagged with this topic

View on HN · Topics
> We have an in-house, Rust-based proxy server. Claude is unable to contribute to it meaningfully outside I have a great time using Claude Code in Rust projects, so I know it's not about the language exactly. My working model is is that since LLM are basically inference/correlation based, the more you deviate from the mainstream corpus of training data, the more confused LLM gets. Because LLM doesn't "understand" anything. But if it was trained on a lot of things kind of like the problem, it can match the patterns just fine, and it can generalize over a lot layers, including programming languages. Also I've noticed that it can get confused about stupid stuff. E.g. I had two different things named kind of the same in two parts of the codebase, and it would constantly stumble on conflating them. Changing the name in the codebase immediately improved it. So yeah, we've got another potentially powerful tool that requires understanding how it works under the hood to be useful. Kind of like git.
View on HN · Topics
It can do this at the level of a function, and that's -useful-, but like the parent reply to top-level comment, and despite investing the time, using skills & subagents, etc., I haven't gotten it to do well with C++ or Rust projects of sufficient complexity. I'm not going to say they won't some day, but, it's not today.
View on HN · Topics
Anecdotally, we use Opus 4.5 constantly on Zed's code base, which is almost a million lines of Rust code and has over 150K active users, and we use it for basically every task you can think of - new features, bug fixes, refactors, prototypes, you name it. The code base is a complex native GUI with no Web tech anywhere in it. I'm not talking about "write this function" but rather like implementing the whole feature by writing only English to the agent, over the course of numerous back-and-forth interactions and exhausting multiple 200K-token context windows. For me personally, definitely at least 99% all of the Rust code I've committed at work since Opus 4.5 came out has been from an agent running that model. I'm reading lots of Rust code (that Opus generated) but I'm essentially no longer writing any of it. If dot-autocomplete (and LLM autocomplete) disappeared from IDE existence, I would not notice.
View on HN · Topics
> Do you think it can replace you basically one-shotting features/bugs in Zed? Nobody is one-shotting anything nontrivial in Zed's code base, with Opus 4.5 or any other model. What about a future model? Literally nobody knows. Forecasts about AI capabilities have had horrendously low accuracy in both directions - e.g. most people underestimated what LLMs would be capable of today, and almost everyone who thought AI would at least be where it is today...instead overestimated and predicted we'd have AGI or even superintelligence by now. I see zero signs of that forecasting accuracy improving. In aggregate, we are atrocious at it. The only safe bet is that hardware will be faster and cheaper (because the most reliable trend in the history of computing has been that hardware gets faster and cheaper), which will naturally affect the software running on it. > And also - doesn’t that make Zed (and other editors) pointless? It means there's now demand for supporting use cases that didn't exist until recently, which comes with the territory of building a product for technologists! :)
View on HN · Topics
Trying to one-shot large codebases is a exercise in futility. You need to let Claude figure out and document the architecture first, then setup agents for each major part of the project. Doing this keeps the context clean for the main agent, since it doesn't have to go read the code each time. So one agent can fill it's entire context understanding part of the code and then the main agent asks it how to do something and gets a shorter response. It takes more work than one-shot, but not a lot, and it pays dividends.
View on HN · Topics
Is there a guide for doing that successfully somewhere? I would love to play with this on a large codebase. I would also love to not reinvent the wheel on getting Claude working effectively on a large code base. I don’t even know where to start with, e.g., setting up agents for each part.
View on HN · Topics
I'll second this. I'm making a fairly basic iOS/Swift app with an accompanying React-based site. I was able to vibe-code the React site (it isn't pretty, but it works and the code is fairly decent). But I've struggled to get the Swift code to be reliable. Which makes sense. I'm sure there's lots of training data for React/HTML/CSS/etc. but much less with Swift, especially the newer versions.
View on HN · Topics
I built an open to "game engine" entirely in Lua a many years ago, but relying on many third party libraries that I would bind to with FFI. I thought I'd revive it, but this time with Vulkan and no third-party dependencies (except for Vulkan) 4.5 Sonet, Opus and Gemini 3.5 flash has helped me write image decoders for dds, png jpg, exr, a wayland window implementation, macOS window implementation, etc. I find that Gemini 3.5 flash is really good at understanding 3d in general while sonnet might be lacking a little. All these sota models seem to understand my bespoke Lua framework and the right level of abstraction. For example at the low level you have the generated Vulkan bindings, then after that you have objects around Vulkan types, then finally a high level pipeline builder and whatnot which does not mention Vulkan anywhere. However with a larger C# codebase at work, they really struggle. My theory is that there are too many files and abstractions so that they cannot understand where to begin looking.
View on HN · Topics
>> Maybe they are struggling to convince others because they are unable to produce evidence that is able to convince people? Simon has produced plenty of evidence over the past year. You can check their submission history and their blog: https://simonwillison.net/ The problem with people asking for evidence is that there's no level of evidence that will convince them. They will say things like "that's great but this is not a novel problem so obviously the AI did well" or "the AI worked only because this is a greenfield project, it fails miserably in large codebases".
View on HN · Topics
I think it's worth understanding why. Because that's not everyone's experience and there's a chance you could make a change such that you find it extremely useful. There's a lesser chance that you're working on a code base that Claude Code just isn't capable of helping with.
View on HN · Topics
The more explicit/detailed your plan, the more context it uses up, the less accurate and generally functional it is. Don't get me wrong, it's amazing, but on a complex problem with large enough context it will consistently shit the bed.
View on HN · Topics
The human still has to manage complexity. A properly modularized and maintainable code base is much easier for the LLM to operate on — but the LLM has difficulty keeping the code base in that state without strong guidance. Putting “Make minimal changes” in my standard prompt helped a lot with the tendency of basically all agents to make too many changes at once. With that addition it became possible to direct the LLM to make something similar to the logical progression of commits I would have made anyway, but now don’t have to work as hard at crafting. Most of the hype merchants avoid the topic of maintainability because they’re playing to non-technical management skeptical of the importance of engineering fundamentals. But everything I’ve experienced so far working with LLMs screams that the fundamentals are more important than ever.
View on HN · Topics
This is where the LLM coding shines in my opinion, there's a list of things they are doing very well: - single scripts. Anything which can be reduced to a single script. - starting greenfield projects from scratch - code maintenance (package upgrades, old code...) - tasks which have a very clear and single definition. This isn't linked to complexity, some tasks can be both very complex but with a single definition. If your work falls into this list they will do some amazing work (and yours clearly fits that), if it doesn't though, prepare yourself because it will be painful.
View on HN · Topics
I'm trying to determine what programming tasks are not in this list. :) I think it is trying to exclude adding new features and fixing bugs in existing code. I've done enough of that with LLMs, though not in large codebases. I should say I'm hardly ever vibe-coding, unlike the original article. If I think I want code that will last, I'll steer the models in ways that lean on years of non-LLM experience. E.g., I'll reject results that might work if they violate my taste in code. It also helps that I can read code very fast. I estimate I can read code 100x faster than most students. I'm not sure there is any way to teach that other than the old-fashioned way, which involves reading (and writing) a lot of code.
View on HN · Topics
> I'm trying to determine what programming tasks are not in this list. :) I think it is trying to exclude adding new features and fixing bugs in existing code Yes indeed, these are the things on the other hand which aren't working well in my opinion: - large codebase - complex domain knowledge - creating any feature where you need product insights - tasks requiring choices (again, complexity doesn't matter here, the task may be simple but require some choices) - anything unclear where you don't know where you are going first While you don't experience any of these when teaching or side projects, these are very common in any enterprise context.
View on HN · Topics
Everybody says how good Claude is and I go to my code base and I can't get it to correctly update one xaml file for me. It is quicker to make changes myself than to explain exactly what I need or learn how to do "prompt engineering". Disclaimer: I don't have access to Claude Code. My employer has only granted me Claude Teams. Supposedly, they don't use my poopy code to train their models if I use my work email Claude so I am supposed to use that. If I'm not pasting code (asking general questions) into Claude, I believe I'm allowed to use whatever.
View on HN · Topics
My main experience is with anthropic models. I've had some encounters with inaccuracies but my general experience has been amazing. I've cloned completely foreign git repos, cranked up the tool and just said "I'm having this bug, give me an overview of how X and Y work" and it will create great high level conceptual outlines that mean I can drive straight in where without it I would spend a long time just flailing around. I do think an essential skill is developing just the right level of scepticism. It's not really different to working with a human though. If a human tells me X or Y works in a certain way i always allow a small margin of possibility they are wrong.
View on HN · Topics
What I think people get wrong (especially non-coders) is that they believe the limitation of LLMs is to build a complex algorithm. That issue in reality was fixed a long time ago. The real issue is to build a product. Think about microservices in different projects, using APIs that are not perfectly documented or whose documentation is massive, etc. Honestly I don't know what commenters on hackernews are building, but a few months back I was hoping to use AI to build the interaction layer with Stripe to handle multiple products and delayed cancellations via subscription schedules. Everything is documented, the documentation is a bit scattered across pages, but the information is out there. At the time there was Opus 4.1, so I used that. It wrote 1000 lines of non-functional code with 0 reusability after several prompts. I then asked something to Chat gpt to see if it was possible without using schedules, it told me yes (even if there is not) and when I told Claude to recode it, it started coding random stuff that doesn't exist. I built everything to be functional and reusable myself, in approximately 300 lines of code. The above is a software engineering problem. Reimplementing a JSON parser using Opus is not fun nor useful, so that should not be used as a metric
View on HN · Topics
This hits the nail on the head. There's a marked difference between a JSON parser and a real world feature in a product. Real world features are complex because they have opaque dependencies, or ones that are unknown altogether. Creating a good solution requires building a mental model of the actual complex system you're working with, which an LLM can't do. A JSON parser is effectively a book problem with no dependencies.
View on HN · Topics
What bothers me about posts like this is: mid-level engineers are not tasked with atomic, greenfield projects. If all an engineer did all day was build apps from scratch, with no expectation that others may come along and extend, build on top of, or depend on, then sure, Opus 4.5 could replace them. The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible. No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.
View on HN · Topics
Not necessarily responding to you directly, but I find this take to be interesting, and I see it every time an article like this makes the rounds. Starting back in 2022/2023: - (~2022) It can auto-complete one line, but it can't write a full function. - (~2023) Ok, it can write a full function, but it can't write a full feature. - (~2024) Ok, it can write a full feature, but it can't write a simple application. - (~2025) Ok, it can write a simple application, but it can't create a full application that is actually a valuable product. - (~2025+) Ok, it can write a full application that is actually a valuable product, but it can't create a long-lived complex codebase for a product that is extensible and scalable over the long term. It's pretty clear to me where this is going. The only question is how long it takes to get there.
View on HN · Topics
> It's pretty clear to me where this is going. The only question is how long it takes to get there. I don't think its a guarantee. all of the things it can do from that list are greenfield, they just have increasing complexity. The problem comes because even in agentic mode, these models do not (and I would argue, can not) understand code or how it works, they just see patterns and generate a plausible sounding explanation or solution. agentic mode means they can try/fail/try/fail/try/fail until something works, but without understanding the code, especially of a large, complex, long-lived codebase, they can unwittingly break something without realising - just like an intern or newbie on the project, which is the most common analogy for LLMs, with good reason.
View on HN · Topics
While I do agree with you. To play the counterpoint advocate though. What if we get to the point where all software is basically created 'on the fly' as greenfield projects as needed? And you never need to have complex large long lived codebase? It is probably incredibly wasteful, but ignoring that, could it work?
View on HN · Topics
That sounds like an insane way to do anything that matters. Sure, create a one-off app to post things to your Facebook page. But a one-off app for the OS it's running on? Freshly generating the code for your bank transaction rules? Generating an authorization service that gates access to your email? The only reason it's quick to create green-field projects is because of all these complex, large, long-lived codebases that it's gluing together. There's ample training data out there for how to use the Firebase API, the Facebook API, OS calls, etc. Without those long-lived abstraction layers, you can't vibe out anything that matters.
View on HN · Topics
In Japan buildings (apartments) aren't built to last forever. They are built with a specific age in mind. They acknowledge the fact that houses are depreciating assets which have a value lim->0. The only reason we don't do that with code (or didn't use to do it) was because rewriting from scratch NEVER worked[0]. And large scale refactors take massive amounts of time and resources, so much so that there are whole books written about how to do it. But today trivial to simple applications can be rewritten from spec or scratch in an afternoon with an LLM. And even pretty complex parsers can be ported provided that the tests are robust enough[1]. It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec". [0] https://www.joelonsoftware.com/2000/04/06/things-you-should-... [1] https://simonwillison.net/2025/Dec/15/porting-justhtml/
View on HN · Topics
I haven't seen an AI successfully write a full feature to an existing codebase without substantial help, I don't think we are there yet. > The only question is how long it takes to get there. This is the question and I would temper expectations with the fact that we are likely to hit diminishing returns from real gains in intelligence as task difficulty increases. Real world tasks probably fit into a complexity hierarchy similar to computational complexity. One of the reasons that the AI predictions made in the 1950s for the 1960s did not come to be was because we assumed problem difficulty scaled linearly. Double the computing speed, get twice as good at chess or get twice as good at planning an economy. P, NP separation planed these predictions. It is likely that current predictions will run into similar separations. It is probably the case that if you made a human 10x as smart they would only be 1.25x more productive at software engineering. The reason we have 10x engineers is less about raw intelligence, they are not 10x more intelligent, rather they have more knowledge and wisdom.
View on HN · Topics
This is disingenuous because LLMs were already writing full, simple applications in 2023.[0] They're definitely better now, but it's not like ChatGPT 3.5 couldn't write a full simple todo list app in 2023. There were a billion blog posts talking about that and how it meant the death of the software industry. Plus I'd actually argue more of the improvements have come from tooling around the models rather than what's in the models themselves. [0] eg https://www.youtube.com/watch?v=GizsSo-EevA
View on HN · Topics
I use it on a 10 years codebase, needs to explain where to get context but successfully works 90% of time
View on HN · Topics
Anecdata but I’ve found Claude code with Opus 4.5 able to do many of my real tickets in real mid and large codebases at a large public startup. I’m at senior level (15+ years). It can browse and figure out the existing patterns better than some engineers on my team. It used a few rare features in the codebase that even I had forgotten about and was about to duplicate. To me it feels like a real step change from the previous models I’ve used which I found at best useless. It’s following style guides and existing patterns well, not just greenfield. Kind of impressive, kind of scary
View on HN · Topics
Same anecdote for me (except I'm +/- 40 years experience). I consider my self a pretty good dev for non-web dev (GPU's, assembly, optimisation,...) and my conclusion is the same as you: impressive and scary. If the somehow the idea of what you want to do is on the web in text or in code, then Claude most likely has it. And its ability to understand my own codebases is just crazy (at my age, memory is declining and having Claude to help is just waow). Of course it fails some times, of course it need direction, but the thing it produces is really good.
View on HN · Topics
I've also found it to keep such a constrained context window (on large codebases), that it writes a secondary block of code that already had a solution in a different area of the same file. Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it. It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.
View on HN · Topics
Yeah I love working with Claude Code, I agree that the new models are amazing, but I spend a decent amount of time saying "wait, why are we writing that from scratch, haven't we written a library for that, or don't we have examples of using a third party library for it?". There is probably some effective way to put this direction into the claude.md, but so far it still seems to do unnecessary reimplementation quite a lot.
View on HN · Topics
Yes LLMs aren't very good at architecture. I suspect because the average project online has pretty bad architecture. The training set is poisoned. It's kind of bittersweet for me because I was dreaming of becoming a software architect when I graduated university and the role started disappearing so I never actually became one! But the upside of this is that now LLMs suck at software architecture... Maybe companies will bring back the software architect role? The training set has been totally poisoned from the architecture PoV. I don't think LLMs (as they are) will be able to learn software architecture now because the more time passes, the more poorly architected slop gets added online and finds its way into the training set. Good software architecture tends to be additive, as opposed to subtractive. You start with a clean slate then build up from there. It's almost impossible to start with a complete mess of spaghetti code and end up with a clean architecture... Spaghetti code abstractions tend to mislead you and lead you astray... It's like; understanding spaghetti code tends to soil your understanding of the problem domain. You start to think of everything in terms of terrible leaky abstraction and can't think of the problem clearly. It's hard even for humans to look at a problem through fresh eyes; it's likely even harder for LLMs to do it. For example, if you use a word in a prompt, the LLM tends to try to incorporate that word into the solution... So if the AI sees a bunch of leaky abstractions in the code; it will tend to try to work with them as opposed to removing them and finding better abstractions. I see this all the time with hacks; if the code is full of hacks, then an LLM tends to produce hacks all the time and it's almost impossible to make it address root causes... Also hacks tend to beget more hacks.
View on HN · Topics
Refactoring is a very mechanistic way of turning bad code into good. I don’t see a world in which our tools (LLMs or otherwise) don’t learn this.
View on HN · Topics
Most code out there is a legacy security nightmare, surely its good to train on that.
View on HN · Topics
Most legacy apps are barely understood by anyone, and yet continue to generate value and and are (somehow) kept alive.
View on HN · Topics
Many here have been doing the "understanding of legacy code" as a job +50 years. This "legacy apps are barely understood by anybody", is just somnething you made up.
View on HN · Topics
Give it another 10 years if the "LLM as compiler" people get their way.
View on HN · Topics
We don't know what Opus 5.0 will be able to refactor. If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this. (Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".) Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.) But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well. That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.
View on HN · Topics
In my experience, LLMs perform significantly better on readable maintainable code. It's what they were trained on after-all. However what they produce is often highly readable but not very maintainable due to the verbosity and obvious comments. This seems to pollute codebases over time and you see AI coding efficiency slowly decline.
View on HN · Topics
A greenfield project is definitely 'easy mode' for an LLM; especially if the problem area is well understood (and documented). Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base. In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase. For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up. I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out. I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
View on HN · Topics
One thing I've been tossing around in my head is: - How quickly is cost of refactor to a new pattern with functional parity going down? - How does that change the calculus around tech debt? If engineering uses 3 different abstractions in inconsistent ways that leak implementation details across components and duplicate functionality in ways that are very hard to reason about, that is, in conventional terms, an existential problem that might kill the entire business, as all dev time will end up consumed by bug fixes and dealing with pointless complexity, velocity will fall to nothing, and the company will stop being able to iterate. But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. And it has changed more if you are willing to bet that models within a year will be much better at such tasks. And in my experience, claude is imperfect at refactors and still requires review and a lot of steering, but it's one of the things it's better at, because it has clear requirements and testing workflows already built to work with around the existing behavior. Refactoring is definitely a hell of a lot faster than it used to be, at least on the few I've dealt with recently. In my mind it might be kind of like thinking about financial debt in a world with high inflation, in that the debt seems like it might get cheaper over time rather than more expensive.
View on HN · Topics
> But if claude can reliably reorganize code, fix patterns, and write working migrations for state when prompted to do so, it seems like the entire way to reason about tech debt has changed. Yup, I recently spent 4 days using Claude to clean up a tool that's been in production for over 7 years. (There's only about 3 months of engineering time spent on it in those years.) We've known what the tool needed for many years, but ugh, the actual work was fairly messy and it was never a priority. I reviewed all of Opus's cleanup work carefully and I'm quite content with the result. Maybe even "enthusiastic" would be accurate. So even if Claude can't clean up all the tech debt in a totally unsupervised fashion, it can still help address some kinds of tech debt extremely rapidly.
View on HN · Topics
I had Opus write a whole app for me in 30 seconds the other night. I use a very extensive AGENTS.md to guide AI in how I like my code chiseled. I've been happily running the app without looking at a line of it, but I was discussing the app with someone today, so I popped the code open to see what it looked like. Perfect. 10/10 in every way. I would not have written it that good. It came up with at least one idea I would not have thought of. I'm very lucky that I rarely have to deal with other devs and I'm writing a lot of code from scratch using whatever is the latest version of the frameworks. I understand that gives me a lot of privileges others don't have.
View on HN · Topics
And low-code/no-code (pre-LLMs). Our company spent probably the same amount of dev-time and money on rewriting low-code back to "code" (Python in our case) as it did writing low-code in the first place. LLMs are not quite comparable in damage, but some future maintenance for LLM-code will be needed for sure.
View on HN · Topics
in my experience, what happens is the code base starts to collapse under its own weight. it becomes impossible to fix one thing without breaking another. the coding agent fails to recognize the global scope of the problem and tries local fixes over and over. progress gets slower, new features cost more. all the same problems faced by an inexperienced developer on a greenfield project! has your experience been otherwise?
View on HN · Topics
Right, I am a daily user of agentic LLM tools and have this exact problem in one large project that has complex business logic externally dictated by real world requirements out of my control, and let's say, variable quality of legacy code. I remember when Gemini Pro 3 was the latest hotness and I started to get FOMO seeing demos on X posted to HN showing it one shot-ing all sorts of impressive stuff. So I tried it out for a couple days in Gemini CLI/OpenCode and ran into the exact same pain points I was dealing with using CC/Codex. Flashy one shot demos of greenfield prompts are a natural hype magnet so get lots of attention, but in my experience aren't particularly useful for evaluating value in complex, legacy projects with tightly bounded requirements that can't be easily reduced to a page or two of prose for a prompt.
View on HN · Topics
To be fair, you're not supposed to be doing the "one shot" thing with LLMs in a mature codebase. You have to supply it the right context with a well formed prompt, get a plan, then execute and do some cleanup. LLMs are only as good as the engineers using them, you need to master the tool first before you can be productive with it.
View on HN · Topics
I’m well aware, as I said I am regularly using CC/Codex/OC in a variety of projects, and I certainly didn’t claim that can’t be used productively in a large code base. But that different challenges become apparent that aren’t addressed by examples like this article which tend to focus on narrow, greenfield applications that can be readily rebuilt in one shot. I already get plenty of value in small side projects that Claude can create in minutes. And while extremely cool, these examples aren’t the kind of “step change” improvement I’d like to see in the area where agentic tools are currently weakest in my daily usage.
View on HN · Topics
I would be much more impressed with implementing new, long-requested features into existing software (that are open to later maintain LLM-generated code).
View on HN · Topics
Fully agreed! That’s the exact kind of thing I was hoping to find when I read the article title, but unfortunately it was really just another “normal AI agent experience” I’ve seen (and built) many examples of before.
View on HN · Topics
Why not have the LLM rewrite the entire codebase?
View on HN · Topics
In ~25 years or so of dealing with large, existing codebases, I've seen time and time again that there's a ton of business value and domain knowledge locked up inside all of that "messy" code. Weird edge cases that weren't well covered in the design, defensive checks and data validations, bolted-on extensions and integrations, etc., etc. "Just rewrite it" is usually -- not always, but _usually_ -- a sure path to a long, painful migration that usually ends up not quite reproducing the old features/capabilities and adding new bugs and edge cases along the way.
View on HN · Topics
Classic Joel Spolsky: https://www.joelonsoftware.com/2000/04/06/things-you-should-... > the single worst strategic mistake that any software company can make: > rewrite the code from scratch.
View on HN · Topics
Steve Yegge talks about this exact post a lot - how it stayed correct advice for over 25 years - up until October 2025.
View on HN · Topics
Time will tell. I’d bet on Spolsky, because of Hyrum’s Law. https://www.hyrumslaw.com/ > With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. An LLM rewriting a codebase from scratch is only as good as the spec. If “all observable behaviors” are fair game, the LLM is not going to know which of those behaviors are important. Furthermore, Spolsky talks about how to do incremental rewrites of legacy code in his post. I’ve done many of these and I expect LLMs will make the next one much easier.
View on HN · Topics
That’s a fair point — I agree that LLMs do a good job predicting the documentation that might accompany some code. I feel relieved when I can rely on the LLM to write docs that I only need to edit and review. But I’m using LLMs regularly and I feel pretty effectively — including Opus 4.5 — and these “they can rewrite your entire codebase” assertions just seem crazy incongruous with my lived experience guiding LLMs to write even individual features bug-free.
View on HN · Topics
If the LLM just wrote the whole thing last week, surely it can write it again.
View on HN · Topics
If an LLM wrote the whole project last week and it already requires a full rewrite, what makes you think that the quality of that rewrite will be significantly higher, and that it will address all of the issues? Sure, it's all probabilistic so there's probably a nonzero chance for it to stumble into something where all the moving parts are moving correctly, but to me it feels like with our current tech, these odds continue shrinking as you toss on more requirements and features, like any mature project. It's like really early LLMs where if they just couldn't parse what you wanted, past a certain point you could've regenerated the output a million times and nothing would change.
View on HN · Topics
* With a slightly different set of assumption, which may or may not matter. UAT is cheap. And data migration is lossy, becsuse nobody care the data fidelity anyway.
View on HN · Topics
> What bothers me about posts like this is: mid-level engineers are not tasked with atomic, greenfield projects They get those ocassionally all the time though too. Depends on the company. In some software houses it's constant "greenfield projects", one after another. And even in companies with 1-2 pieces of main established software to maintain, there are all kinds of smaller utilities or pipelines needed. > But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". In some cases that's legit. In other cases it's just "it did it well, but not how I'd done it", which is often needless stickness to some particular style (often a contention between 2 human programmers too). Basically, what FloorEgg says in this thread: "There are two types of right/wrong ways to build: the context specific right/wrong way to build something and an overly generalized engineer specific right/wrong way to build things." And you can always not just tell it "build me this feature", but tell it (high level way) how to do it, and give it a generic context about such preferences too.
View on HN · Topics
Even if you are going green field, you need to build it the way it is likely to be used based a having a deep familiarity with what that customer's problems are and how their current workflow is done. As much as we imagine everything is on the internet, a bunch of this stuff is not documented anywhere. An LLM could ask the customer requirement questions but that familiarity is often needed to know the right questions to ask. It is hard to bootstrap. Even if it could build the perfect greenfield app, as it updates the app it is needs to consider backwards compatibility and breaking changes. LLMs seem very far as growing apps. I think this is because LLMs are trained on the final outcome of the engineering process, but not on the incremental sub-commit work of first getting a faked out outline of the code running and then slowly building up that code until you have something that works. This isn't to say that LLMs or other AI approaches couldn't replace software engineering some day, but they clear aren't good enough yet and the training sets they have currently have access to are unlikely to provide the needed examples.
View on HN · Topics
Yeah. Just like another engineer. When you tell another engineer to build you a feature, it's improbable they'll do it they way that you consider "right." This sounds a lot like the old arguments around using compilers vs hand-writing asm. But now you can tell the LLM how you want to implement the changes you want. This will become more and more relevant as we try to maintain the code it generates. But, for right now, another thing Claude's great at is answering questions about the codebase. It'll do the analysis and bring up reports for you. You can use that information to guide the instructions for changes, or just to help you be more productive.
View on HN · Topics
You can look at my comment history to see the evidence to how hostile I was to agentic coding. Opus 4.5 completely changed my opinion. This thing jumped into a giant JSF (yes, JSF) codebase and started fixing things with nearly zero guidance.
View on HN · Topics
After recently applying Codex to a gigantic old and hairy project that is as far from greenfield it can be, I can assure you this assertion is false. It’s bonkers seeing 5.2 churn though the complexity and understanding dependencies that would take me days or weeks to wrap my head around.
View on HN · Topics
In my personal experience, Claude is better at greenfield, Codex is better at fitting in. Claude is the perfect tool for a "vibe coder", Codex is for the serious engineer who wants to get great and real work done. Codex will regularly give me 1000+ line diffs where all my comments (I review every single line of what agents write) are basically nitpicks. "Make this shallow w/ early return, use | None instead of Optional", that sort of thing. I do prompt it in detail though. It feels like I'm the person coming in with the architecture most of the time, AI "draws the rest of the owl."
View on HN · Topics
My favorite benchmark for LLMs and agents is to have it port a medium-complexity library to another programming language. If it can do that well, it's pretty capable of doing real tasks. So far, I always have to spend a lot of time fixing errors. There are also often deep issues that aren't obvious until you start using it.
View on HN · Topics
Comments on here often criticise ports as easy for LLMs to do because there's a lot of training and tests are all there, which is not as complex as real word tasks
View on HN · Topics
I find Opus 4.5 very, very strong at matching the prevailing conventions/idioms/abstractions in a large, established codebase. But I guess I'm quite sensitive to this kind of thing so I explicitly ask Opus 4.5 to read adjacent code which is perhaps why it does it so well. All it takes is a sentence or two, though.
View on HN · Topics
I don’t know what I’m doing wrong. Today I tried to get it to upgrade Nx, yarn and some resolutions in a typescript monorepo with about 20 apps at work (Opus 4.5 through Kiro) and it just…couldn’t do it. It hit some snags with some of the configuration changes required by the upgrade and resorted to trying to make unwanted changes to get it to build correctly. I would have thought that’s something it could hit out of the park. I finally gave up and just looked at the docs and some stack overflow and fixed it myself. I had to correct it a few times about correct config params too. It kept imagining config options that weren’t valid.
View on HN · Topics
On the contrary, Opus 4.5 is the best agent I’ve ever used for making cohesive changes across many files in a large, existing codebase. It maintains our patterns and looks like all the other code. Sometimes it hiccups for sure.
View on HN · Topics
But time I spend asking is time I could have been writing exactly what I wanted in the first place, if I already did the planning to understand what I wanted. Once I know what I want, it doesn't take that long, usually. Which is why it's so great for prototyping, because it can create something during the planning, when you haven't planned out quite what you want yet.
View on HN · Topics
> greenfield LLMs are pretty good at picking up existing codebases. Even with cleared context they can do „look at this codebase and this spec doc that created it. I want to add feature x“
View on HN · Topics
What size of code base are you talking about? And this is your personal experience?
View on HN · Topics
Overall Codebase size vs context matter less when you set it up as microservices style architecture from the starts. I just split it into boundaries that make sense to me. Get the LLM to make a quick cheat sheet about the api and then feed that into adjacent modules. It doesn’t need to know everything about all of it to make changes if you’ve got a grip on big picture and the boundaries are somewhat sane
View on HN · Topics
Overall Codebase size vs context matter less when you set it up as microservices style architecture from the starts. It'll be fun if the primary benefit of microservices turns out to be that LLMs can understand the codebase.
View on HN · Topics
So "pretty good at picking up existing codebases" so long as the existing codebase is all microservices.
View on HN · Topics
I work with multiple monoliths that span anywhere from 100k to 500k lines of code, in a non-mainstream language (Elixir). Opus 4.5 crushes everything I throw at it: complex bugs, extending existing features, adding new features in a way that matches conventions, refactors, migrations... The only time it struggles is if my instructions are unclear or incomplete. For example if I ask it to fix a bug but don't specify that such-and-such should continue to work the way it does due to an undocumented business requirement, Opus might mess that up. But I consider that normal because a human developer would also do fail at it.
View on HN · Topics
With all due respect those are very small codebases compared to the kinds of things a lot of software engineers work on.
View on HN · Topics
It doesn't have to be micro services, just code that is decoupled properly, so it can search and build its context easily.
View on HN · Topics
It just one shots bug fixes in complex codebases. Copy-paste the bug report and watch it go.
View on HN · Topics
If you have microservices architecture in your project you are set for AI. You can swap out any lacking, legacy microservice in your system with "greenfield" vibecoded one.
View on HN · Topics
Man, I've been biting my tongue all day with regards to this thread and overall discussion. I've been building a somewhat-novel, complex, greenfield desktop app for 6 months now, conceived and architected by a human (me), visually designed by a human (me), implementation heavily leaning on mostly Claude Code but with Codex and Gemini thrown in the mix for the grunt work. I have decades of experience, could have built it bespoke in like 1-2 years probably, but I wanted a real project to kick the tires on "the future of our profession". TL;DR I started with 100% vibe code simply to test the limits of what was being promised. It was a functional toy that had a lot of problems. I started over and tried a CLI version. It needed a therapist. I started over and went back to visual UI. It worked but was too constrained. I started over again. After about 10 complete start-overs in blank folders, I had a better vision of what I wanted to make, and how to achieve it. Since then, I've been working day after day, screen after screen, building, refactoring, going feature by feature, bug after bug, exactly how I would if I was coding manually. Many times I've reached a point where it feels "feature complete", until I throw a bigger dataset at it, which brings it to its knees. Time to re-architect, re-think memory and storage and algorithms and libraries used. Code bloated, and I put it on a diet until it was trim and svelte. I've tried many different approaches to hard problems, some of which LLMs would suggest that truly surprised me in their efficacy, but only after I presented the issues with the previous implementation. There's a lot of conversation and back and forth with the machine, but we always end up getting there in the end. Opus 4.5 has been significantly better than previous Anthropic models. As I hit milestones, I manually audit code, rewrite things, reformat things, generally polish the turd. I tell this story only because I'm 95% there to a real, legitimate product, with 90% of the way to go still. It's been half a year. Vibe coding a simple app that you just want to use personally is cool; let the machine do it all, don't worry about under the hood, and I think a lot of people will be doing that kind of stuff more and more because it's so empowering and immediate. Using these tools is also neat and amazing because they're a force multiplier for a single person or small group who really understand what needs done and what decisions need made. These tools can build very complex, maintainable software if you can walk with them step by step and articulate the guidelines and guardrails, testing every feature, pushing back when it gets it wrong, growing with the codebase, getting in there manually whenever and wherever needed. These tools CANNOT one-shot truly new stuff, but they can be slowly cajoled and massaged into eventually getting you to where you want to go; like, hard things are hard, and things that take time don't get done for a while. I have no moral compunctions or philosophical musings about utilizing these tools, but IMO there's still significant effort and coordination needed to make something really great using them (and literally minimal effort and no coordination needed to make something passable) If you're solo, know what you want, and know what you're doing, I believe you might see 2x, 4x gains in time and efficiency using Claude Code and all of his magical agents, but if your project is more than a toy, I would bet that 2x or 4x is applied to a temporal period of years, not days or months!
View on HN · Topics
Recent Claude will just look at your code and copy what you've been doing, mostly, in an existing codebase - without being asked. In a new codebase, you can just ask it to "be conscice, keep it simple" or something.