Summarizer

Testing and Verification

The reliance on test-driven development (TDD), linters, and compilers to constrain non-deterministic AI output, ensuring generated code actually runs and meets requirements.

← Back to Opus 4.5 is not the normal AI agent experience that I have had thus far

30 comments tagged with this topic

View on HN · Topics
/commands are like macros or mayyybe aliases. You just put in the commands you see yourself repeating often, like "commit the unstaged files in distinct commits, use xxx style for the commit messages..." - then you can iterate on it if you see any gaps or confusion, even give example commands to use in the different steps. Skills on the other hand are commands ON STEROIDS. They can be packaged with actual scripts and executables, the PEP723 Python style + uv is super useful. I have one skill for example that uses Python+Treesitter to check the unit thest quality of a Go project. It does some AST magic to check the code for repetition, stupid things like sleeps and relative timestamps etc. A /command _can_ do it, but it's not as efficient, the scripts for the skill are specifically designed for LLM use and output the result in a hyper-compact form a human could never be arsed to read.
View on HN · Topics
I've had Opus 4.5 hand rolling CUDA kernels and writing a custom event loop on io_uring lately and both were done really well. Need to set up the right feedback loops so it can test its work thoroughly but then it flies.
View on HN · Topics
Yeah I've handed it a naive scalar implementation and said "Make this use SIMD for Mac Silicon / NEON" and it just spits out a working implementation that's 3-6x faster and passes the tests, which are binary exact specifications.
View on HN · Topics
Honestly I think the more you can give Claude a type system and effective tests, the more effective it can be. Rust is quite high up on the test strictness front (though I think more could be done...), so it's a great candidate. I also like it's performance on Haskell and Go, both get you pretty great code out of the box.
View on HN · Topics
I have a few Go projects now and I speak Go as well as you speak Kotlin. I predict that we'll see some languages really pull ahead of others in the next few years based on their advantages for AI-powered development. For instance, I always respected types, but I'm too lazy to go spend hours working on types when I can just do ruby-style duck typing and get a long ways before the inevitable problems rear their head. Now, I can use a strongly typed language and get the advantages for "free".
View on HN · Topics
> I predict that we'll see some languages really pull ahead of others in the next few years based on their advantages for AI-powered development. Oh absolutely. I've been using Python for past 15 or so years for everything. I've never written a single line of Rust in my life, and all my new projects are Rust now, even the quick-script-throwaway things, because it's so much better at instantly screaming at claude when it goes off track. It may take it longer to finish what I asked it to do, but requires so much less involvement from me. I will likely never start another new project in python ever. EDIT: Forgot to add that paired with a good linter, this is even more impressive. I told Claude to come up with the most masochistic clippy configuration possible, where even a tiny mistake is instantly punished and exceptions have to be truly exceptional (I have another agent that verifies this each run). I just wish there was cargo-clippy for enforcing architectural patterns.
View on HN · Topics
and with types, it makes it easier for rounds of agents to pick up mistakes at compile time, statically. linting and sanity checking untyped languages only goes so far. I've not seen LLM's one shot perl style regexes. and javascript can still have ugly runtime WTFs
View on HN · Topics
I've found this too. I find I'm doing more Typescript projects than Python because of the superior typing, despite the fact I prefer Python.
View on HN · Topics
Part of the "one day" development time was exhaustively testing it. Since the tool's scope is so small, getting good test coverage was pretty easy. Of course, I'm not guaranteeing through formal verification methods that the code is bug free. I did find bugs, but they were all areas that were poorly specified by me in the requirements.
View on HN · Topics
> "asking it to fix it." This is what people are still doing wrong. Tools in a loop people, tools in a loop. The agent has to have the tools to detect whatever it just created is producing errors during linting/testing/running. When it can do that, I can loop again, fix the error and again - use the tools to see whether it worked. I _still_ encounter people who think "AI programming" is pasting stuff into ChatGPT on the browser and they complain it hallucinates functions and produces invalid code. Well, d'oh.
View on HN · Topics
Last weekend I was debugging some blocking issue on a microcontroller with embassy-rs, where the whole microcontroller would lock up as soon as I started trying to connect to an MQTT server. I was having Opus investigate it and I kept building and deploying the firmware for testing.. then I just figured I'd explain how it could do the same and pull the logs. Off it went, for the next ~15 minutes it would flash the firmware multiple times until it figured out the issue and fixed it. There was something so interesting about seeing a microcontroller on the desk being flashed by Claude Code, with LEDs blinking indicating failure states. There's something about it not being just code on your laptop that felt so interesting to me. But I agree, absolutely, red/green test or have a way of validating (linting, testing, whatever it is) and explain the end-to-end loop, then the agent is able to work much faster without being blocked by you multiple times along the way.
View on HN · Topics
my issue hasn't been for a long time now that the code they write works or doesn't work. My issues all stem from that it works, but does the wrong thing
View on HN · Topics
Even better, have it write code to describe the right thing then run its code against that, taking yourself out of that loop.
View on HN · Topics
The crazy part is, once you have it setup and adapted your workflow, you start to notice all sorts of other "small" things: claude can call ssh and do system admin tasks. It works amazingly well. I have 3 VM's, which depends on each other (proxmox with openwrt, adguard, unbound), and claude can prove to me that my dns chains works perfectly, my firewalls are perfect etc as claude can ssh into each. Setting up services, diagnosing issues, auditing configs... you name it. Just awesome. claude can call other sh scripts on the machine, so over time, you can create a bunch of scripts that lets claude one shot certain tasks that would normally eat tokens. It works great. One script per intention - don't have a script do more than one thing. claude can call the compiler, run the debug executable and read the debug logs.. in real time. So claude can read my android apps debug stream via adb.. or my C# debug console because claude calls the compiler, not me. Just ask it to do it and it will diagnose stuff really quickly. It can also analyze your db tables (give it readonly sql access), look at the application code and queries, and diagnose performance issues. The opportunities are endless here. People need to wake up to this.
View on HN · Topics
I have a /fix-ci-build slash command that instructs Claude how to use `gh` to get the latest build from that specific project's Github Actions and get the logs for the build In addition there are instructions on how and where to push the possible fixes and how to check the results. I've yet to encounter a build failure it couldn't fix automatically.
View on HN · Topics
My compiler runs on my computer and produces the same machine code given the same input. Neither of these are true with AI.
View on HN · Topics
Isn't this a meaningless example? Formatters already exist. Generating code that doesn't need to be formatted is exactly the same as generating code and then formatting it. I care about the norms in my codebase that can't be automatically enforced by machine. How is state managed? How are end-to-end tests written to minimize change detectors? When is it appropriate to log something?
View on HN · Topics
The second part is what I'd also like to have. But I think it should be doable. You can tell it how YOU want the state to be managed and then have it write a custom "linter" that makes the check deterministic. I haven't tried this myself, but claude did create some custom clippy scripts in rust when I wanted to enforce something that isn't automatically enforced by anything out there.
View on HN · Topics
I believe part of why Claude Code is so great because it has the chance to catch its own mistakes. It can run compilers, linters, browsers and check its own output. If it makes a mistake, it takes one or two extra iterations until it gets it right.
View on HN · Topics
Didn't feel like reading all this so I shortened it! sorry! I shortened it for anyone else that might need it ---- Software engineers are sleeping on Claude Code agents. By teaching it your conventions, you can automate your entire workflow: Custom Skills: Generates code matching your UI library and API patterns. Quality Ops: Automates ESLint, doc syncing, and E2E coverage audits. Agentic Reviews: Performs deep PR checks against custom checklists. Smart Triage: Pre-analyzes tickets to give devs a head start. Check out the showcase repo to see these patterns in action.
View on HN · Topics
If it can consistently verify that the error persists after fix--you can run (ok maybe you can't budget wise but theoretically) 10000 parallel instances of fixer agents then verify afterwards (this is in line with how the imo/ioi models work according to rumors)
View on HN · Topics
> How are you qualified to judge its performance on real code if you don't know how to write a hello world? The ultimate test of all software is "run it and see if it's useful for you." You do not need to be a programmer at all to be qualified to test this.
View on HN · Topics
Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.
View on HN · Topics
But that doesn't really matter and it shows how confused people really are about how a coding agent like Claude or OSS models are actually created -- the system can learn on its own without simply mimicking existing codebases even though scraped/licensed/commissioned code traces are part of the training cycle. Training looks like: - Pretraining (all data, non-code, etc, include everything including garbage) - Specialized pre-training (high quality curated codebases, long context -- synthetic etc) - Supervised Fine Tuning (SFT) -- these are things like curated prompt + patch pairs, curated Q/A (like stack overflow, people are often cynical that this is done unethically but all of the major players are in fact very risk adverse and will simply license and ensure they have legal rights), - Then more SFT for tool use -- actual curated agentic and human traces that are verified to be correct or at least produce the correct output. - Then synthetic generation / improvement loops -- where you generate a bunch of data and filter the generations that pass unit tests and other spec requirements , followed by RL using verifiable rewards + possibly preference data to shape the vibes - Then additional steps for e.g. safety, etc So synthetic data is not a problem and is actually what explains the success coding models are having and why people are so focused on them and why "we're running out of data" is just a misunderstanding of how things work. It's why you don't see the same amount of focus on other areas (e.g. creative writing, art etc) that don't have verifiable rewards. The Agent --> Synthetic data --> filtering --> new agent --> better synthetic data --> filtering --> even better agent flywheel is what you're seeing today so we definitely don't have any reason to suspect there is some sort of limit to this because there is in principle infinite data
View on HN · Topics
They at least understood that it was something deterministic that they could reproduce. That puts them ahead of the LLM crowd.
View on HN · Topics
I might give that a go in the future, but in this case it would've been faster for me to just do the work than to coach it for each file. Also as this was an architectural change there are no tests to run until it's done. Everything would just fail. It's only done when the whole thing is done. I think that might be one of the reasons it got stuck: it was trying to solve issues that it did not prove existed yet. If it had just finished the job and run the tests it would've probably gotten further or even completed it. It's a bit like stopping half way through renaming a function and then trying to run the tests and finding out the build does not compile because it can't find 'old_function'. You have to actually finish and know you've finished before you can verify your changes worked. I still haven't actually addressed this tech debt item (it's not that important :)). But I might try again and either see if it succeeds this time (with plan in an md) or just do the work myself and get Opus to fix the unit tests (the most tedious part).
View on HN · Topics
This is relatively easily fixed with increasing test coverage to near 100% and lifting critical components into model checker space; both approaches were prohibitively expensive before November. They’ll be accepted best practices by the summer.
View on HN · Topics
* With a slightly different set of assumption, which may or may not matter. UAT is cheap. And data migration is lossy, becsuse nobody care the data fidelity anyway.
View on HN · Topics
"Write the specs and let the outsourced labor hit them" is not a new tale. Let's assume the LLM agents can write tests for, and hit, specs better and cheaper than the outsourced offshore teams could. So let's assume now you can have a working product that hits your spec without understanding the code. How many bugs and security vulnerabilities have slipped through "well tested" code because of edge cases of certain input/state combinations? Ok, throw an LLM at the codebase to scan for vulnerabilities; ok, throw another one at it to ensure no nasty side effects of the changes that one made; ok, add some functionality and a new set of tests and let it churn through a bunch of gross code changes needed to bolt that functionality into the pile of spaghetti... How long do you want your critical business logic relying on not-understood code with "100% coverage" (of lines of code and spec'd features) but super-low coverage of actual possible combinations of input+machine+system state? How big can that codebase get before "rewrite the entire world to pass all the existing specs and tests" starts getting very very very slow? We've learned MANY hard lessons about security, extensibility, and maintainability of multi-million-LOC-or-larger long-lived business systems and those don't go away just because you're no longer reading the code that's making you the money. They might even get more urgent. Is there perhaps a reason Google and Amazon didn't just hire 10x the number of people at 1/10th the salary to replace the vast majority of their engineering teams year ago?
View on HN · Topics
Exactly. The main issue IMO is that "software that seems to work" and "software that works" can be very hard to tell apart without validating the code, yet these are drastically different in terms of long-term outcomes. Especially when there's a lot of money, or even lives, riding on these outcomes. Just because LLMs can write software to run the Therac-25 doesn't mean it's acceptable for them to do so. Your hobby project, though, knock yourself out.