Opus 4.5 vs. Previous Models

Users describe the specific model as a "step change" or "inflection point" compared to Sonnet 3.5 or GPT-4, citing better reasoning and autonomous behavior.

View on HN · Topics

Are you still talking about Opus 4.5 I’ve been working on a Rust, kotlin and c++ and it’s been doing well. Incredible at C++, like the number of mistakes it doesn’t make

View on HN · Topics

I've had Opus 4.5 hand rolling CUDA kernels and writing a custom event loop on io_uring lately and both were done really well. Need to set up the right feedback loops so it can test its work thoroughly but then it flies.

View on HN · Topics

Anecdotally, we use Opus 4.5 constantly on Zed's code base, which is almost a million lines of Rust code and has over 150K active users, and we use it for basically every task you can think of - new features, bug fixes, refactors, prototypes, you name it. The code base is a complex native GUI with no Web tech anywhere in it.

I'm not talking about "write this function" but rather like implementing the whole feature by writing only English to the agent, over the course of numerous back-and-forth interactions and exhausting multiple 200K-token context windows.

For me personally, definitely at least 99% all of the Rust code I've committed at work since Opus 4.5 came out has been from an agent running that model. I'm reading lots of Rust code (that Opus generated) but I'm essentially no longer writing any of it. If dot-autocomplete (and LLM autocomplete) disappeared from IDE existence, I would not notice.

View on HN · Topics

I had surprising success vibe coding a swift iOS app a while back. Just for fun, since I have a bluetooth OBD2 dongle and an electric truck, I told Claude to make me an app that could connect to the truck using the dongle, read me the VIN, odometer, and state of charge. This was middle of 2025, so before Opus 4.5. It took Claude a few attempts and some feedback on what was failing, but it did eventually make a working app after a couple hours.

Now, was the code quality any good? Beats me, I am not a swift developer. I did it partly as an experiment to see what Claude was currently capable of and partly because I wanted to test the feasibility of setting up a simple passive data logger for my truck.

I'm tempted to take another swing with Opus 4.5 for the science.

View on HN · Topics

I'm a quite senior frontend using React and even I see Sonnet 4.5 struggle with basic things. Today it wrote my Zod validation incorrectly, mixing up versions, then just decided it wasn't working and attempted to replace the entire thing with a different library.

View on HN · Topics

There’s little reason to use sonnet anymore. Haiku for summaries, opus for anything else. Sonnet isn’t a good model by today’s standards.

View on HN · Topics

Have you experimented with all of these things on the latest models (e.g. Opus 4.5) since Nov 2025? They are significantly better at coding than earlier models.

View on HN · Topics

Yes, December 2025 and January 2026.

View on HN · Topics

I really think a lof of people tried AI coding earlier, got frustrated at the errors and gave up. That's where the rejection of all these doomer predictions comes from.

And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up

Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.

View on HN · Topics

This was me. I was a huge AI coding detractor on here for a while (you can check my comment history). But, in order to stay informed and not just be that grouchy curmudgeon all the time, I kept up with the models and regularly tried them out. Opus 4.5 is so much better than anything I've tried before, I'm ready to change my mind about AI assistance.

I even gave -True Vibe Coding- a whirl. Yesterday, from a blank directory and text file list of requirements, I had Opus 4.5 build an Android TV video player that could read a directory over NFS, show a grid view of movie poster thumbnails, and play the selected video file on the TV. The result wasn't exactly full-featured Kodi, but it works in the emulator and actual device, it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything. It was pretty astounding.

Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.

View on HN · Topics

Vs code copilot extension the harness is not great, but Opus 4.5 with Copilot CLI works quite well.

View on HN · Topics

Thanks for posting this. It's a nice reminder that despite all the noise from hype-mongers and skeptics in the past few years, most of us here are just trying to figure this all out with an open mind and are ready to change our opinions when the facts change. And a lot of people in the industry that I respect on HN or elsewhere have changed their minds about this stuff in the last year, having previously been quite justifiably skeptical. We're not in 2023 anymore.

If you were someone saying at the start of 2025 "this is a flash in the pan and a bunch of hype, it's not going to fundamentally change how we write code", that was still a reasonable belief to hold back then. At the start of 2026 that position is basically untenable: it's just burying your head in the sand and wishing for AI to go away. If you're someone who still holds it you really really need to download Claude Code and set it to Opus and start trying it with an open mind: I don't know what else to tell you. So now the question has shifted from whether this is going to transform our profession (it is), to how exactly it's going to play out. I personally don't think we will be replacing human engineers anytime soon ("coders", maybe!), but I'm prepared to change my mind on that too if the facts change. We'll see.

I was a fellow mind-changer, although it was back around the first half of last year when Claude Code was good enough to do things for me in a mature codebase under supervision. It clearly still had a long way to go but it was at that tipping point from "not really useful" to "useful". But Opus 4.5 is something different - I don't feel I have to keep pulling it back on track in quite the way I used to with Sonnet 3.7, 4, even Sonnet 4.5.

For the record, I still think we're in a bubble. AI companies are overvalued. But that's a separate question from whether this is going to change the software development profession.

View on HN · Topics

I have been out of the loop for a couple of months (vacation). I tried Claude Opus 4.5 at the end of November 2025 with the corporate Github Copilot subscription in Agent mode and it was awful: basically ignoring code and hallucinating.

My team is using it with Claude Code and say it works brilliantly, so I'll be giving it another go.

How much of the value comes from Opus 4.5, how much comes from Claude Code, and how much comes from the combination?

View on HN · Topics

I strongly concur with your second statement. Anything other than agent mode in GH copilot feels useless to me. If I want to engage Opus through GH copilot for planning work, I still use agent mode and just indicate the desired output is whatever.md. I obviously only do this in environments lacking a better tool (Claude Code).

View on HN · Topics

Check out Antigravity+Google AI Pro $20 plan+Opus 4.5. apparently the Opus limits are insanely generous (of course that could change on a dime).

View on HN · Topics

I'd used both CC and Copilot Agent Mode in VSCode, but not the combination of CC + Opus 4.5, and I agree, I was happy enough with Copilot.

The gap didn't seem big, but in November (which admittedly was when Opus 4.5 was in preview on Copilot) Opus 4.5 with Copilot was awful.

View on HN · Topics

> Even better, start getting the feel for local models. Current gen home hardware is getting good enough and the local models smart enough so you can, with the correct tooling, use them for suprisingly many things.

Are there any local models that are at least somewhat comparable to the latest-and-greatest (e.g. Opus 4.5, Gemini 3), especially in terms of coding?

View on HN · Topics

> I really think a lof of people tried AI coding earlier, got frustrated at the errors and gave up. That's where the rejection of all these doomer predictions comes from.

It's not just the deficiencies of earlier versions, but the mismatch between the praise from AI enthusiasts and the reality.

I mean maybe it is really different now and I should definitely try uploading all of my employer's IP on Claude's cloud and see how well it works. But so many people were as hyped by GPT-4 as they are now, despite GPT-4 actually being underwhelming.

Too much hype for disappointing results leads to skepticism later on, even when the product has improved.

View on HN · Topics

> Opus 4.5 really is at a new tier however. It just...works.

Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.

View on HN · Topics

Again, you're basically explaining how Claude has a very short limited context and you have to implement multiple workarounds to "prevent cluttering". Aka: try to keep context as small as possible, restart context often, try and feed it only small relevant information.

What I very succinctly called "crippled context" despite claims that Opus 4.5 is somehow "next tier". It's all the same techniques we've been using for over a year now.

View on HN · Topics

I get by because I also have long-term memory, and experience, and I can learn. LLMs have none of that, and every new session is rebuilding the world anew.

And even my short-term memory is significantly larger than the at most 50% of the 200k-token context window that Claude has. It runs out of context before my short-term memory is probably not even 1% full, for the same task ( and I'm capable of more context-switching in the meantime).

And so even the "Opus 4.5 really is at a new tier" runs into the very same limitations all models have been running into since the beginning.

View on HN · Topics

> For LLMs long term memory is achieved by tooling. Which you discounted in your previous comments.

My specific complaint, which is an observable fact about "Opus 4.5 is next tier": it has the same crippled context that degrades the quality of the model as soon as it fills 50%.

EMM_386: no-no-no, it's not crippled. All you have to do is keep track across multiple files, clear out context often, feed very specific information not to overflow context.

Me: so... it's crippled, and you need multiple workarounds

scotty79: After all it's the same as your own short-term memory, and <some unspecified tooling (I guess those same files)> provide long-term memory for LLMs.

Me: Your comparison is invalid because I can go have lunch, and come back to the problem at hand and continue where I left off. "Next tier Opus 4.5" will have to be fed the entire world from scratch after a context clear/compact/in a new session.

Unless, of course, you meant to say that "next tier Opus model" only has 15-30 second short term memory, and needs to keep multiple notes around like the guy from Memento. Which... makes it crippled.

View on HN · Topics

I think the premise is that if it was the "next tier" than you wouldn't need to use these workarounds.

View on HN · Topics

> If you refuse to use what you call workarounds

Who said I refuse them?

I evaluated the claim that Opus is somehow next tier/something different/amazeballs future at its face value . It still has all the same issues and needs all the same workarounds as whatever I was using two months ago (I had a bit of a coding hiatus between beginning of December and now).

> then you end up with a guy from Memento and regardless of how smart the model is

Those models are, and keep being the guy from memento. Your "long memory" is nothing but notes scribbled everywhere that you have to re-assemble every time.

> And that's why you can't tell the difference between smarter and dumber one while others can.

If it was "next tier smarter" it wouldn't need the exact same workarounds as the "dumber" models. You wouldn't compare the context to the 15-30 second short-term memory and need unspecified tools [1] to have "long-term memory". You wouldn't have the model behave in an indistinguishable way from a "dumber" model after half of its context windows has been filled. You wouldn't even think about context windows. And yet here we are

[1] For each person these tools will be a different collection of magic incantations. From scattered .md files to slop like Beads to MCP servers providing access to various external storage solutions to custom shell scripts to ...

BTW, I still find "superpowers" from https://github.com/obra/superpowers to be the single best improvement to Claude (and other providers) even if it's just another in a long serious of magic chants I've evaluated.

View on HN · Topics

It's a workaround for context limitations

It's the same workarounds we've been doing forever

It's indistinguishable from "clear context and re-feed the entire world of relevant info from scratch" we've had forever, just slightly more automated

That's why I don't understand all the "it's new tier" etc. It's all the same issues with all the same workarounds.

View on HN · Topics

That's because Opus has been out for almost 5 months now lol. Its the same model, so I think people have been vibe coding with a heavy dose of wine this holiday and are now convinced its the future.

View on HN · Topics

Looks like you hallucinated the Opus release date

Are you sure you're not an LLM?

View on HN · Topics

Opus 4.1 was released in August or smth.

View on HN · Topics

Opus 4.5 was released 24th November.

View on HN · Topics

> a pretty big context window if you are feeding it the right context.

Yup. There's some magical "right context" that will fix all the problems. What is that right context? No idea, I guess I need to read a yet-another 20 000-word post describing magical incantations that you should or shouldn't do in the context.

The "Opus 4.5 is something else/nex tier/just works" claims in my mind means that I wouldn't need to babysit its every decision, or that it would actually read relevant lines from relevant files etc. Nope. Exact same behaviors as whatever the previous model was.

Oh, and that "200k tokens context window"? It's a lie. The quality quickly degrades as soon as Claude reaches somewhere around 50% of the context window. At 80+% it's nearly indistinguishable from a model from two years ago. (BTW, same for Codex/GPT with it's "1 million token window")

View on HN · Topics

I realize your experience has been frustrating. I hope you see that every generation of model and harness is converting more hold-outs. We're still a few years from hard diminishing returns assuming capital keeps flowing (and that's without any major new architectures which are likely) so you should be able to see how this is going to play out.

It's in your interest to deal with your frustration and figure out how you can leverage the new tools to stay relevant (to the degree that you want to).

Regarding the context window, Claude needs thinking turned up for long context accuracy, it's quite forgetful without thinking.

View on HN · Topics

I use Sonnet and Opus all the time and the differences are almost negligible

View on HN · Topics

Opus 4.5 is fucking up just like Sonnet really. I don't know how your use is that much different than mine.

View on HN · Topics

I don't think I can scientifically compare the agents. As it is, you can use Opus / Codex in Cursor. The speed of Cursor composer-1 is phenomenal -- you can use it interactively for many tasks. There are also tasks that are not easier to describe in English, but you can tab through them.

View on HN · Topics

> What do you mean by "have it learn your conventions"?

I'll give you an example: I use ruff to format my python code, which has an opinionated way of formatting certain things. After an initial formatting, Opus 4.5, without prompting, will write code in this same style so that the ruff formatter almost never has anything to do on new commits. Sonnet 4.5 is actually pretty good at this too.

View on HN · Topics

Here's an example:

We have some tests in "GIVEN WHEN THEN" style, and others in other styles. Opus will try to match each style of testing by the project it is in by reading adjacent tests.

View on HN · Topics

Starting to use Opus 4.5 I'm reduces instrutions in claude.md and just ask claude to look in the codebase to understand the patterns already in use. Going from prompts/docs to instead having code being the "truth". Show don't tell. I've found this patterns has made a huge leap with Opus 4.5.

View on HN · Topics

FYI Opus is available and pretty usable in claude-code on the $20/Mo plan if you are at all judicious.

I exclusively use opus for architecture / speccing, and then mostly Sonnet and occasionally Haiku to write the code. If my usage has been light and the code isn't too straightforward, I'll have Opus write code as well.

View on HN · Topics

> except for the fact that almost everyone else can do this, too. Or at least try to, resulting in a fast race to the bottom.

Ironically, that race to the bottom is no different then we already have. Have you already worked for a company before? A lot of software is developed, BADLY. I dare to say that a lot of software that Opus 4.5 generates, is often a higher quality then what i have seen in my 25 year carrier.

The amount of companies that cheapen out, hiring juniors fresh from school, to work as coding monkies is insane. Then projects have bugs / security issues, with tons of copy/pasted code, or people not knowing a darn thing.

Is that any different then your feared future? I dare to say, that LLms like Opus are frankly better then most juniors. As a junior to do a code review for security issues. Opus literally creates extensive tests, points out issues that you expect from a mid or higher level dev. Of course, you need to know to ask! You are the manager.

> Do you really want to be a middle manager to a bunch of text boxes, churning out slop, while they drive up our power bills and slowly terraform the planet?

Frankly, yes ... If you are a real developer, do you still think development is fun after 10 years, 20 years? Doing the exact same boring work. Reimplementing the 1001 login page, the 101 contact form ... A ton of our work is in reality repeating the same crap over and over again. And if we try to bypass it, we end up tied to tied to those systems / frameworks that often become a block around our necks.

Our industry has a lot of burnout because most tasks may start small but then grow beyond our scope. Todays its ruby on rails programming, then its angular, no wait, react, no wait, Vue, no wait, the new hotness is whatever again.

> slowly terraform the planet?

Well, i am actually making something.

Can you say the same for all the power / gpu draw with bitcoin, Ethereum whatever crap mining. One is productive, a tool with insane potential and usage, the other is a virtual currency where only one is ever popular with limited usage. Yet, it burns just as much for a way more limited return of usability.

Those LLMs that you are so against, make me a ton more productive. You wan to to try out something, but never really wanted to get committed because it was weeks of programming. Well, now you as manager, can get projects done fast. Learn from them way faster then your little fingers ever did.

View on HN · Topics

Also new haiku. Not as smart but lighting fast, I've it review code changes impact or if i need a wide but shallow change done I've it scan the files and create a change plan. Saves a lot of time waiting for claude or codex to get their bearing.

View on HN · Topics

Not entirely disagreeing with your point but I think they've mostly been forced to pivot recently for their own sakes; they will never say it though. As much as they may seem eager the most public people tend to also be better at outside communication and knowing what they should say in public to enjoy more opportunities, remain employed or for the top engineers to still seem relevant in the face of the communities they are a part of. Its less about money and more about respect there I think.

The "sudden switch" since Opus 4.5 when many were saying just a few months ago "I enjoy actual coding" but now are praising LLM's isn't a one off occurrence. I do think underneath it is somewhat motivated by fear; not for the job however but for relevance. i.e. its in being relevant to discussions, tech talks, new opportunities, etc.

View on HN · Topics

isn't Claude Teams powerful? does it not have access to Opus?

pardon my ignorance.

I use GitHub Copilot which has access to llms like Gemini 3, Sonnet/Opus 4.5 ang GPT 5.2

View on HN · Topics

It's not "AI tool does everything", it's specifically Claude Code with Opus 4.5 is great at "it", for whatever "it" a given commenter is claiming.

View on HN · Topics

Opus 4.5 ate through my Copilot quota last month, and it's already halfway through it for this month. I've used it a lot, for really complex code.

And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.

Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.

So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.

Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.

View on HN · Topics

Maybe not, then. I'm afraid I have no idea what those numbers mean, but it looks like Gemini and ChatGPT 4 can handle a much larger context than Opus, and Opus 4.5 is cheaper than older versions. Is that correct? Because I could be misinterpreting that table.

View on HN · Topics

You need to find where context breaks down, Claude was better at it even when Gemini had 5X more on paper, but both have improved with last releases.

View on HN · Topics

> Opus 4.5 ate theough my Copilot quota last month

Sure, Copilot charges 3x tokens for using Opus 4.5, but, how were you still able to use up half the allocated tokens not even one week into January?

I thought using up 50% was mad for me (inline completions + opencode), that's even worse

View on HN · Topics

Even if Opus 4.5 is the limit it’s still a massively useful tool. I don’t believe it’s the limit though for the simple fact that a lot could be done by creating more specialized models for each subdomain i.e. they’ve focused mostly on web based development but could do the same for any other paradigm.

View on HN · Topics

Anecdata but I’ve found Claude code with Opus 4.5 able to do many of my real tickets in real mid and large codebases at a large public startup. I’m at senior level (15+ years). It can browse and figure out the existing patterns better than some engineers on my team. It used a few rare features in the codebase that even I had forgotten about and was about to duplicate. To me it feels like a real step change from the previous models I’ve used which I found at best useless. It’s following style guides and existing patterns well, not just greenfield. Kind of impressive, kind of scary

View on HN · Topics

Same anecdote for me (except I'm +/- 40 years experience). I consider my self a pretty good dev for non-web dev (GPU's, assembly, optimisation,...) and my conclusion is the same as you: impressive and scary. If the somehow the idea of what you want to do is on the web in text or in code, then Claude most likely has it. And its ability to understand my own codebases is just crazy (at my age, memory is declining and having Claude to help is just waow). Of course it fails some times, of course it need direction, but the thing it produces is really good.

View on HN · Topics

I've also found it to keep such a constrained context window (on large codebases), that it writes a secondary block of code that already had a solution in a different area of the same file.

Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it.

It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.

View on HN · Topics

> The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible.

You’re talking like in the year 2026 we’re still writing code for future humans to understand and improve.

I fear we are not doing that. Right now, Opus 4.5 is writing code that later Opus 5.0 will refactor and extend. And so on.

View on HN · Topics

Opus 4.5 is writing code that Opus 5.0 will refactor and extend. And Opus 5.5 will take that code and rewrite it in C from the ground up. And Opus 6.0 will take that code and make it assembly. And Opus 7.0 will design its own CPU. And Opus 8.0 will make a factory for its own CPUs. And Opus 9.0 will populate mars. And Opus 10.0 will be able to achieve AGI. And Opus 11.0 will find God. And Opus 12.0 will make us a time machine. And so on.

View on HN · Topics

Objectively, we are talking about systems that have gone from being cute toys to outmatching most juniors using only rigid and slow batch training cycles.

As soon as models have persistent memory for their own try/fail/succeed attempts, and can directly modify what's currently called their training data in real time, they're going to develop very, very quickly.

We may even be underestimating how quickly this will happen.

We're also underestimating how much more powerful they become if you give them analysis and documentation tasks referencing high quality software design principles before giving them code to write.

This is very much 1.0 tech. It's already scary smart compared to the median industry skill level.

The 2.0 version is going to be something else entirely.

View on HN · Topics

Can't wait to see what Opus 13.0 does with the multiverse.

View on HN · Topics

Wake me up at Opus 12

View on HN · Topics

Just one more OPUS bro.

View on HN · Topics

Honestly the scary part is that we don’t really even need one more Opus. If all we had for the rest of our lives was Opus 4.5, the software engineering world would still radically change.

But there’s no sign of them slowing down.

View on HN · Topics

I also love how AI enthusiasts just ignore the issue of exhausted training data... You cant just magically create more training data. Also synthetic training data reduces the quality of models.

View on HN · Topics

Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.

View on HN · Topics

But that doesn't really matter and it shows how confused people really are about how a coding agent like Claude or OSS models are actually created -- the system can learn on its own without simply mimicking existing codebases even though scraped/licensed/commissioned code traces are part of the training cycle.

Training looks like:

- Pretraining (all data, non-code, etc, include everything including garbage)

- Specialized pre-training (high quality curated codebases, long context -- synthetic etc)

- Supervised Fine Tuning (SFT) -- these are things like curated prompt + patch pairs, curated Q/A (like stack overflow, people are often cynical that this is done unethically but all of the major players are in fact very risk adverse and will simply license and ensure they have legal rights),

- Then more SFT for tool use -- actual curated agentic and human traces that are verified to be correct or at least produce the correct output.

- Then synthetic generation / improvement loops -- where you generate a bunch of data and filter the generations that pass unit tests and other spec requirements , followed by RL using verifiable rewards + possibly preference data to shape the vibes

- Then additional steps for e.g. safety, etc

So synthetic data is not a problem and is actually what explains the success coding models are having and why people are so focused on them and why "we're running out of data" is just a misunderstanding of how things work. It's why you don't see the same amount of focus on other areas (e.g. creative writing, art etc) that don't have verifiable rewards.

The

Agent --> Synthetic data --> filtering --> new agent --> better synthetic data --> filtering --> even better agent

flywheel is what you're seeing today so we definitely don't have any reason to suspect there is some sort of limit to this because there is in principle infinite data

View on HN · Topics

They don't ignore it, they just know it's not an actual problem.

It saddens me to see AI detractors being stuck in 2022 and still thinking language models are just regurgitating bits of training data.

View on HN · Topics

You are thankfully wrong. I watch lots of talks on the topic from actual experts. New models are just old models with more tooling. Training data is exhausted and its a real issue.

View on HN · Topics

Well, my experts disagree with your experts :). Sure, the supply of available fresh data is running out, but at the same time, there's way more data than needed. Most of it is low-quality noise anyway. New models aren't just old models with more tooling - the entire training pipeline has been evolving, as researchers and model vendors focus on making better use of data they have, and refining training datasets themselves.

There are more stages to LLM training than just the pre-training stage :).

View on HN · Topics

Not saying it's not a problem, I actually don't know, but new CPU's are just old models with more improvements/tooling. Same with TV's. And cars. And clothes. Everything is. That's how improving things works. Running out of raw data doesn't mean running out of room for improvement. The data has been the same for the last 20 years, AI isn't new, things keep improving anyways.

View on HN · Topics

Well from cars or CPUs its not expected for them to eventually reach AGI, they also don't eat a trillion dollar hole into us peasants pockets.
Sure, improvements can be made. But on a fundamental level, agents/LLMs can not reason (even though they love to act like they can). They are parrots learning words, these parrots wont ever invent new words once the list of words is exhausted though.

View on HN · Topics

That's been my main argument for why LLMs might be at their zenith. But I recently started wondering whether all those codebases we expose to them are maybe good enough training data for the next generation. It's not high quality like accepted stackoverflow answers but it's working software for the most part.

View on HN · Topics

We don't know what Opus 5.0 will be able to refactor.

If argument is "humans and Opus 4.5 cannot maintain this, but if requirements change we can vibe-code a new one from scratch", that's a coherent thesis, but people need to be explicit about this.

(Instead this feels like the mott that is retreated to, and the bailey is essentially "who cares, we'll figure out what to do with our fresh slop later".)

Ironically, I've been Claude to be really good at refactors, but these are refactors I choose very explicitly. (Such as I start the thing manually, then let it finish.) (For an example of it, see me force-pushing to https://github.com/NixOS/nix/pull/14863 implementing my own code review.)

But I suspect this is not what people want. To actually fire devs and not rely on from-scratch vibe-coding, we need to figure out which refactors to attempt in order to implement a given feature well.

That's a very creative open-ended question that I haven't even tried to let the LLMs take a crack at it, because why I would I? I'm plenty fast being the "ideas guy". If the LLM had better ideas than me, how would I even know? I'm either very arrogant or very good because I cannot recall regretting one of my refactors, at least not one I didn't back out of immediately.

View on HN · Topics

Follow up: Opus is also great for doing the planning work before you start. You can use plan mode or just do it in a web chat and have them create all of the necessary files based on your explanation. The advantage of using plan mode is that they can explore the codebase in order to get a better understanding of things. The default at the end of plan mode is to go straight into implementation but if you're planning a large refactor or other significant work then I'd suggest having them produce the documentation outlined above instead and then following the workflow using a new session each time. You could use plan mode at the start of each session but I don't find this necessary most of the time unless I'm deviating from the initial plan.

View on HN · Topics

That’s a fair point — I agree that LLMs do a good job predicting the documentation that might accompany some code. I feel relieved when I can rely on the LLM to write docs that I only need to edit and review.

But I’m using LLMs regularly and I feel pretty effectively — including Opus 4.5 — and these “they can rewrite your entire codebase” assertions just seem crazy incongruous with my lived experience guiding LLMs to write even individual features bug-free.

View on HN · Topics

You can look at my comment history to see the evidence to how hostile I was to agentic coding. Opus 4.5 completely changed my opinion.

This thing jumped into a giant JSF (yes, JSF) codebase and started fixing things with nearly zero guidance.

View on HN · Topics

I find Opus 4.5 very, very strong at matching the prevailing conventions/idioms/abstractions in a large, established codebase. But I guess I'm quite sensitive to this kind of thing so I explicitly ask Opus 4.5 to read adjacent code which is perhaps why it does it so well. All it takes is a sentence or two, though.

View on HN · Topics

I don’t know what I’m doing wrong. Today I tried to get it to upgrade Nx, yarn and some resolutions in a typescript monorepo with about 20 apps at work (Opus 4.5 through Kiro) and it just…couldn’t do it. It hit some snags with some of the configuration changes required by the upgrade and resorted to trying to make unwanted changes to get it to build correctly. I would have thought that’s something it could hit out of the park. I finally gave up and just looked at the docs and some stack overflow and fixed it myself. I had to correct it a few times about correct config params too. It kept imagining config options that weren’t valid.

View on HN · Topics

> ask Opus 4.5 to read adjacent code which is perhaps why it does it so well. All it takes is a sentence or two, though.

People keep telling me that an LLM is not intelligence, it's simply spitting out statistically relevant tokens. But surely it takes intelligence to understand (and actually execute!) the request to "read adjacent code".

View on HN · Topics

On the contrary, Opus 4.5 is the best agent I’ve ever used for making cohesive changes across many files in a large, existing codebase. It maintains our patterns and looks like all the other code. Sometimes it hiccups for sure.

View on HN · Topics

I work with multiple monoliths that span anywhere from 100k to 500k lines of code, in a non-mainstream language (Elixir). Opus 4.5 crushes everything I throw at it: complex bugs, extending existing features, adding new features in a way that matches conventions, refactors, migrations... The only time it struggles is if my instructions are unclear or incomplete. For example if I ask it to fix a bug but don't specify that such-and-such should continue to work the way it does due to an undocumented business requirement, Opus might mess that up. But I consider that normal because a human developer would also do fail at it.

View on HN · Topics

It might scale.

So far, Im not convinced, but lets take a look at fundmentally whats happening and why humans > agents > LLMs.

At its heart, programming is a constraint satisfaction problem.

The more constraints (requirements, syntax, standards, etc) you have, the harder it is to solve them all simultaneously.

New projects with few contributors have fewer constraints.

The process of “any change” is therefore simpler.

Now, undeniably

1) agents have improved the ability to solve constraints by iterating ; eg. Generate, test, modify, etc. over raw LLm output.

2) There is an upper bound (context size, model capability) to solve simultaneous constraints.

3) Most people have a better ability to do this than agents (including claude code using opus 4.5).

So, if youre seeing good results from agents, you probably have a smaller set of constraints than other people.

Similarly, if youre getting bad results, you can probably improve them by relaxing some of the constraints (consistent ui, number of contributors, requirements, standards, security requirements, split code into well defined packages).

This will make both agents and humans more productive.

The open question is: will models continue to improve enough to approach or exceed human level ability in this?

Are humans willing to relax the constraints enough for it to be plausible?

I would say currently people clambering about the end of human developers are cluelessly deceived by the “appearance of complexity” which does not match the “reality of constraints” in larger applications.

Opus 4.5 cannot do the work of a human on code bases Ive worked on. Hell, talented humans struggle to work on some of them.

…but that doesnt mean it doesnt work.

Just that, right now, the constraint set it can solve is not large enough to be useful in those situations .

…and increasingly we see low quality software where people care only about speed of delivery; again, lowering the bar in terms of requirements.

So… you know. Watch this space. Im not counting on having a dev job in 10 years. If I do, it might be making a pile of barely working garbage.

…but I have one now, and anyone who thinks that this year people will be largely replaced by AI is probably poorly informed and has misunderstood the capabilities on these models.

Theres only so low you can go in terms of quality.

View on HN · Topics

Man, I've been biting my tongue all day with regards to this thread and overall discussion.

I've been building a somewhat-novel, complex, greenfield desktop app for 6 months now, conceived and architected by a human (me), visually designed by a human (me), implementation heavily leaning on mostly Claude Code but with Codex and Gemini thrown in the mix for the grunt work. I have decades of experience, could have built it bespoke in like 1-2 years probably, but I wanted a real project to kick the tires on "the future of our profession".

TL;DR I started with 100% vibe code simply to test the limits of what was being promised. It was a functional toy that had a lot of problems. I started over and tried a CLI version. It needed a therapist. I started over and went back to visual UI. It worked but was too constrained. I started over again. After about 10 complete start-overs in blank folders, I had a better vision of what I wanted to make, and how to achieve it. Since then, I've been working day after day, screen after screen, building, refactoring, going feature by feature, bug after bug, exactly how I would if I was coding manually. Many times I've reached a point where it feels "feature complete", until I throw a bigger dataset at it, which brings it to its knees. Time to re-architect, re-think memory and storage and algorithms and libraries used. Code bloated, and I put it on a diet until it was trim and svelte. I've tried many different approaches to hard problems, some of which LLMs would suggest that truly surprised me in their efficacy, but only after I presented the issues with the previous implementation. There's a lot of conversation and back and forth with the machine, but we always end up getting there in the end. Opus 4.5 has been significantly better than previous Anthropic models. As I hit milestones, I manually audit code, rewrite things, reformat things, generally polish the turd.

I tell this story only because I'm 95% there to a real, legitimate product, with 90% of the way to go still. It's been half a year.

Vibe coding a simple app that you just want to use personally is cool; let the machine do it all, don't worry about under the hood, and I think a lot of people will be doing that kind of stuff more and more because it's so empowering and immediate.

Using these tools is also neat and amazing because they're a force multiplier for a single person or small group who really understand what needs done and what decisions need made.

These tools can build very complex, maintainable software if you can walk with them step by step and articulate the guidelines and guardrails, testing every feature, pushing back when it gets it wrong, growing with the codebase, getting in there manually whenever and wherever needed.

These tools CANNOT one-shot truly new stuff, but they can be slowly cajoled and massaged into eventually getting you to where you want to go; like, hard things are hard, and things that take time don't get done for a while. I have no moral compunctions or philosophical musings about utilizing these tools, but IMO there's still significant effort and coordination needed to make something really great using them (and literally minimal effort and no coordination needed to make something passable)

If you're solo, know what you want, and know what you're doing, I believe you might see 2x, 4x gains in time and efficiency using Claude Code and all of his magical agents, but if your project is more than a toy, I would bet that 2x or 4x is applied to a temporal period of years, not days or months!

View on HN · Topics

>day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right"

Then don't ask it to "build me this feature" instead lay out a software development process with designated human in the loop where you want it and guard rails to keep it on track. Create a code review agent to look for and reject strange abstractions. Tell it what you don't like and it's really good at finding it.

I find Opus 4.5, properly prompted, to be significantly better at reviewing code than writing it, but you can just put it in a loop until the code it writes matches the review.

View on HN · Topics

I just use Cursor because I can pick any mode. The difference is hard to say exactly, Opus seems good but 5.2 seems smarter on the tasks I tried. Or possibly I just "trust" it more. I tend to use high or extra high reasoning.

View on HN · Topics

"its building it the right way, in an easily understood way, in a way that's easily extensible"

I am in a unique situation where I work with a variety of codebases over the week. I have had no problem at all utilizing Claude Code w/ Opus 4.5 and Gemini CLI w/ Gemini 3.0 Pro to make excellent code that is indisputably "the right way", in an extremely clear and understandable way, and that is maximally extensible. None of them are greenfield projects.

I feel like this is a bit of je ne sais quoi where people appeal to some indemonstrable essence that these tools just can't accomplish, and only the "non-technical" people are foolish enough to not realize it. I'm a pretty technical person (about 30 years of software development, up to staff engineer and then VP). I think they have reached a pretty high level of competence. I still audit the code and monitor their creations, but I don't think they're the oft claimed "junior developer" replacement, but instead do the work I would have gotten from a very experienced, expert-level developer, but instead of being an expert at a niche, they're experts at almost every niche.

Are they perfect? Far from it. It still requires a practitioner who knows what they're doing. But frequently on here I see people giving takes that sound like they last used some early variant of Copilot or something and think that remains state of the art. The rest of us are just accelerating our lives with these tools, knowing that pretending they suck online won't slow their ascent an iota.

View on HN · Topics

I also have >30 years and I've had the same experience. I noticed an immediate improvement with 4.5 and I've been getting great results in general.

And yes I do make sure it's not generating crazy architecture. It might do that.. if you let it. So don't let it.

View on HN · Topics

Opus 4.5 has become really capable.

Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.

You have to experience it yourself on your own real problems and over the course of days or weeks.

Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.

In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.

Curious how this will perform in complex, large production environments.

Summarizer