Summarizer

Documentation and Specification

The shift from writing code to writing specs; users find that detailed markdown documentation or "plan mode" yields significantly better AI results than vague prompts.

← Back to Opus 4.5 is not the normal AI agent experience that I have had thus far

50 comments tagged with this topic

View on HN · Topics
Most software engineers are seriously sleeping on how good LLM agents are right now, especially something like Claude Code. Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time. For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box. We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review. Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it. On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly. We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already. There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call. (used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!) Edit: made an example repo for ya https://github.com/ChrisWiles/claude-code-showcase
View on HN · Topics
This isn't meant as a criticism, or to doubt your experience, but I've talked to a few people who had experiences like this. But, I helped them get Claude code setup, analyze the codebase and document the architecture into markdown (edit as needed after), create an agent for the architecture, and prompt it in an incremental way. Maybe 15-30 minutes of prep. Everyone I helped with this responded with things like "This is amazing", "Wow!", etc. For some things you can fire up Claude and have it generate great code from scratch. But for bigger code bases and more complex architecture, you need to break it down ahead of time so it can just read about the architecture rather than analyze it every time.
View on HN · Topics
Not that I have seen, which is probably a big part of the disconnect. Mostly it's tribal knowledge. I learned through experimentation, but I've seen tips here and there. Here's my workflow (roughly) > Create a CLAUDE.md for a c++ application that uses libraries x/y/z [Then I edit it, adding general information about the architecture] > Analyze the library in the xxx directory, and produce a xxx_architecture.md describing the major components and design > /agent [let claude make the agent, but when it asks what you want it to do, explain that you want it to specialize in subsystem xxx, and refer to xxx_architecture.md Then repeat until you have the major components covered. Then: > Using the files named with architecture.md analyze the entire system and update CLAUDE.md to use refer to them and use the specialized agents. Now, when you need to do something, put it in planning mode and say something like: > There's a bug in the xxx part of the application, where when I do yyy, it does zzz, but it should do aaa. Analyze the problem and come up with a plan to fix it, and automated tests you can perform if possible. Then, iterate on the plan with it if you need to, or just approve it. One of the most important things you can do when dealing with something complex is let it come up with a test case so it can fix or implement something and then iterate until it's done. I had an image processing problem and I gave it some sample data, then it iterated (looking at the output image) until it fixed it. It spent at least an hour, but I didn't have to touch it while it worked.
View on HN · Topics
This is some great advice. What I would add is to avoid the internal plan mode and just build your own. Built in one creates md files outside the project, gives the files random names and its hard to reference in the future. It's also hard to steer the plan mode or have it remember some behavior that you want to enforce. It's much better to create a custom command with custom instructions that acts as the plan mode. My system works like this: /implement command acts as an orchestrator & plan mode, and it is instructed to launch predefined set of agents based on the problem and have them utilize specific skills. Every time /implement command is initiated, it has to create markdown file inside my own project, and then each subagent is also instructed to update the file when it finished working. This way, orchestrator can spot that agent misbehaved, and reviewer agent can see what developer agent tried to do and why it was wrong.
View on HN · Topics
I don't know if you've tried Chatgpt-5.2 but I find codex much better for Rust mostly due to the underlying model. You have to do planning and provide context, but 80%+ of the time it's a oneshot for small-to-medium size features in an existing codebase that's fairly complex. I honestly have to say that it's a better programmer than I am, it's just not anywhere near as good a software developer for all of the higher and lower level concerns that are the other 50% of the job. If you have any opensource examples of your codebase, prompt, and/or output, I would happily learn from it / give advice. I think we're all still figuring it out. Also this SIMD translation wasn't just a single function - it was multiple functions across a whole region of the codebase dealing with video and frame capture, so pretty substantial.
View on HN · Topics
I strongly concur with your second statement. Anything other than agent mode in GH copilot feels useless to me. If I want to engage Opus through GH copilot for planning work, I still use agent mode and just indicate the desired output is whatever.md. I obviously only do this in environments lacking a better tool (Claude Code).
View on HN · Topics
> My issues all stem from that it works, but does the wrong thing It's an opportunity, not a problem. Because it means there's a gap in your specifications and then your tests. I use Aider not Claude but I run it with Anthropic models. And what I found is that comprehensively writing up the documentation for a feature spec style before starting eliminates a huge amount of what you're referring to. It serves a triple purpose (a) you get the documentation, (b) you guide the AI and (c) it's surprising how often this helps to refine the feature itself. Sometimes I invoke the AI to help me write the spec as well, asking it to prompt for areas where clarification is needed etc.
View on HN · Topics
This is how Beads works, especially with Claude Code. What I do is I tell Claude to always create a Bead when I tell it to add something, or about something that needs to be added, then I start brainstorming, and even ask it to do market research what are top apps doing for x, y or z. Then ask it to update the bead (I call them tasks) and then finally when its got enough detail, I tell it, do all of these in parallel.
View on HN · Topics
Beads is amazing. It’s such a simple concept but elevates agentic coding to another levels
View on HN · Topics
IDEs vs vim makes a lot of sense. AI really does feel like using an IDE in a certain way Using AI for me absolutely makes it feel like I'm more productive. When I look back on my work at the end of the day and look at what I got done, it would be ludicrous to say it was multiple times the amount as my output pre-AI Despite all the people replying to me saying "you're holding it wrong" I know the fix to it doing the wrong thing. Specify in more detail what I want. The problem with that is twofold: 1. How much to specify? As little as possible is the ideal, if we want to maximize how much it can help us. A balance here is key. If I need to detail every minute thing I may as well write the code myself 2. If I get this step wrong, I still have to review everything, rethink it, go back and re-prompt, costing time When I'm working on production code, I have to understand it all to confidently commit. It costs time for me to go over everything, sometimes multiple iterations. Sometimes the AI uses things I don't know about and I need to dig into it to understand it AI is currently writing 90% of my code. Quality is fine . It's fun! It's magical when it nails something one-shot. I'm just not confident it's faster overall
View on HN · Topics
> 1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success. I can run multiple agents at once, across multiple code bases (or the same codebase but multiple different branches), doing the same or different things. You absolutely can't keep up with that. Maybe the one singular task you were working on, sure, but the fact that I can work on multiple different things without the same cognitive load will blow you out of the water. > 2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1. Tell the LLM to document in comments why it did things. Human developers often leave and then people with no knowledge of their codebase or their "whys" are even around to give details. Devs are notoriously terrible about documentation. > 3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves. You can't develop at the same velocity, so drop that assumption now. There's all kinds of lower abstractions that you build on top of that you probably can't explain currently. > 4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics. You aren't keeping up with the actual economics. This shit is technically profitable, the unprofitable part is the ongoing battle between LLM providers to have the best model. They know software in the past has often been winner takes all so they're all trying to win.
View on HN · Topics
In a circuitous way, you can rather successfully have one agent write a specification and another one execute the code changes. Claude code has a planning mode that lets you work with the model to create a robust specification that can then be executed, asking the sort of leading questions for which it already seems to know it could make an incorrect assumption. I say 'agent' but I'm really just talking about separate model contexts, nothing fancy.
View on HN · Topics
Cursor's planning functionality is very similar and I have found that I can even use "cheap" models like their Composer-1 and get great results in the planning phase, and then turn on Sonnet or Opus to actually produce the plan. 90% of the stuff I need to argue about is during the planning phase, so I save a ton of tokens and rework just making a really good spec. It turns out that Waterfall was always the correct method, it's just really slow ;)
View on HN · Topics
Did you know that software specifications used to be almost entirely flow charts? There is something to be said for that and waterfall.
View on HN · Topics
> I'm working mostly in a web framework that's used by me and almost nobody else (the weird little ASGI wrapper buried in Datasette) and I find the coding agents pick it up pretty fast Potentially because there is no baggage with similar frameworks. I'm sure it would have an easier time with this if it was not spun off from other frameworks. > If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code! If Claude cannot read the code well enough to begin with, and needs supplemental documentation, I certainly don't want it generating the docs from the code. That's just compounding hallucinations on top of each other.
View on HN · Topics
Correct it then, and next time craft a more explicit plan.
View on HN · Topics
The more explicit/detailed your plan, the more context it uses up, the less accurate and generally functional it is. Don't get me wrong, it's amazing, but on a complex problem with large enough context it will consistently shit the bed.
View on HN · Topics
It usually works well for me. With very big tasks I break the plan into multiple MD files with the relevant context included and work through in individual sessions, updating remaining plans appropriately at the end of each one (usually there will be decision changes or additions during iteration).
View on HN · Topics
The context window isn't "crippled". Create a markdown document of your task (or use CLAUDE.md), put it in "plan mode" which allows Claude to use tool calls to ask questions before it generates the plan. When it finishes one part of the plan, have it create a another markdown document - "progress.md" or whatever with the whole plan and what is completed at that point. Type /clear (no more context window), tell Claude to read the two documents. Repeat until even a massive project is complete - with those 2 markdown documents and no context window issues.
View on HN · Topics
> ... Proceeds to explain how it's crippled and all the workarounds you have to do to make it less crippled. No - that's not what I did. You don't need an extra-long context full of irrelevant tokens. Claude doesn't need to see the code it implemented 40 steps ago in a working method from Phase 1 if it is on Phase 3 and not using that method. It doesn't need reasoning traces for things it already "thought" through. This other information is cluttering , not helpful. It is making signal to noise ratio worse. If Claude needs to know something it did in Phase 1 for Phase 4 it will put a note on it in the living markdown document to simply find it again when it needs it.
View on HN · Topics
It's like working with humans: 1) define problem 2) split problem into small independently verifiable tasks 3) implement tasks one by one, verify with tools With humans 1) is the spec, 2) is the Jira or whatever tasks With an LLM usually 1) is just a markdown file, 2) is a markdown checklist, Github issues (which Claude can use with the `gh` cli) and every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2 I haven't ran into context issues in a LONG time, and if I have it's usually been either intentional (it's a problem where compacting wont' hurt) or an error on my part.
View on HN · Topics
Yes and no. I've worked quite a bit with juniors, offshore consultants and just in companies where processes are a bit shit. The exact same method that worked for those happened to also work for LLMs, I didn't have to learn anything new or change much in my workflow. "Fix bug in FoobarComponent" is enough of a bug ticket for the 100x developer in your team with experience with that specific product, but bad for AI, juniors and offshored teams. Thus, giving enough context in each ticket to tell whoever is working on it where to look and a few ideas what might be the root cause and how to fix it is kinda second nature to me. Also my own brain is mostly neurospicy mush, so _I_ need to write the context to the tickets even if I'm the one on it a few weeks from now. Because now-me remembers things, two-weeks-from-now me most likely doesn't.
View on HN · Topics
> Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time. I agree with this, but I haven't needed to use any advanced features to get good results. I think the simple approach gets you most of the benefits. Broadly, I just have markdown files in the repo written for a human dev audience that the agent can also use. Basically: - README.md with a quick start section for devs, descriptions of all build targets and tests, etc. Normal stuff. - AGENTS.md (only file that's not written for people specifically) that just describes the overall directory structure and has a short step of instructions for the agent: (1) Always read the readme before you start. (2) Always read the relevant design docs before you start. (3) Always run the linter, a build, and tests whenever you make code changes. - docs/*.md that contain design docs, architecture docs, and user stories, just text. It's important to have these resources anyway, agent or no. As with human devs, the better the docs/requirements the better the results.
View on HN · Topics
Why do all these AI generated readmes have a directory structure sections it's so redundant because you know I could just run tree
View on HN · Topics
It makes me so exhausted trying to read them... my brain can tell immediately when there's so much redundant information that it just starts shutting itself off.
View on HN · Topics
One benefit of conventional code is that it expresses logic in an unambiguous way. Much of "the minutiae" is deciding what happens in edge cases. It's even harder to express that in a human language than in computer languages. For some domains it probably doesn't matter.
View on HN · Topics
Thanks for the example! There's a lot (of boilerplate?) here that I don't understand. Does anyone have good references for catching up to speed what's the purpose of all of these files in the demo?
View on HN · Topics
Because I've seen how difficult it is to get a client to explain to me what they need their software to do.
View on HN · Topics
Came across official anthropic repo on gh actions very relevant to what you mentioned. Your idea on scheduled doc updation using llm is brilliant, I’m stealing this idea. https://github.com/anthropics/claude-code-action
View on HN · Topics
Also new haiku. Not as smart but lighting fast, I've it review code changes impact or if i need a wide but shallow change done I've it scan the files and create a change plan. Saves a lot of time waiting for claude or codex to get their bearing.
View on HN · Topics
yes just using AI for code analysis is way under appreciated I think. Even the most sceptical people on using it for coding should try it out as a tool for Q&A style code interrogation as well as generating documentation. I would say it zero-shots documentation generation better than most human efforts would to the point it begs the question of whether it's worth having the documentation in the first place. Obviously it can make mistakes but I would say they are below the threshold of human mistakes from what I've seen.
View on HN · Topics
But have you actually thoroughly checked the documentation it generated? My experience suggests it can often be subtly wrong.
View on HN · Topics
In Japan buildings (apartments) aren't built to last forever. They are built with a specific age in mind. They acknowledge the fact that houses are depreciating assets which have a value lim->0. The only reason we don't do that with code (or didn't use to do it) was because rewriting from scratch NEVER worked[0]. And large scale refactors take massive amounts of time and resources, so much so that there are whole books written about how to do it. But today trivial to simple applications can be rewritten from spec or scratch in an afternoon with an LLM. And even pretty complex parsers can be ported provided that the tests are robust enough[1]. It's just a metter of time someone rewrites a small to medium size application from one language to another using the previous app as the "spec". [0] https://www.joelonsoftware.com/2000/04/06/things-you-should-... [1] https://simonwillison.net/2025/Dec/15/porting-justhtml/
View on HN · Topics
Input your roadmap into an llm of your choosing and see if you can create that code.
View on HN · Topics
Objectively, we are talking about systems that have gone from being cute toys to outmatching most juniors using only rigid and slow batch training cycles. As soon as models have persistent memory for their own try/fail/succeed attempts, and can directly modify what's currently called their training data in real time, they're going to develop very, very quickly. We may even be underestimating how quickly this will happen. We're also underestimating how much more powerful they become if you give them analysis and documentation tasks referencing high quality software design principles before giving them code to write. This is very much 1.0 tech. It's already scary smart compared to the median industry skill level. The 2.0 version is going to be something else entirely.
View on HN · Topics
Youre mixing up several concepts. Synthetic data works for coding because coding is a verifiable domain. You train via reinforcement learning to reward code generation behavior that passes detailed specs and meets other deseridata. It’s literally how things are done today and how progress gets made.
View on HN · Topics
In my experience, using LLMs to code encouraged me to write better documentation, because I can get better results when I feed the documentation to the LLM. Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better. Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples. I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track. I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
View on HN · Topics
I just did something similar and it went swimmingly by doing this: Keep the plan and status in an md file. Tell it to finish one file at a time and run tests and fix issues and then to ask whether to proceed with the next file. You can then easily start a new chat with the same instructions and plan and status if the context gets poisoned.
View on HN · Topics
You have to think of Opus as a developer whose job at your company lasts somewhere between 30 to 60 minutes before you fire them and hire a new one. Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly. So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed. What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible. The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them. For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone. Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer. If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window. I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself. If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc. People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project. With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time. Hope this helps and happy Clauding!
View on HN · Topics
Follow up: Opus is also great for doing the planning work before you start. You can use plan mode or just do it in a web chat and have them create all of the necessary files based on your explanation. The advantage of using plan mode is that they can explore the codebase in order to get a better understanding of things. The default at the end of plan mode is to go straight into implementation but if you're planning a large refactor or other significant work then I'd suggest having them produce the documentation outlined above instead and then following the workflow using a new session each time. You could use plan mode at the start of each session but I don't find this necessary most of the time unless I'm deviating from the initial plan.
View on HN · Topics
To be fair, you're not supposed to be doing the "one shot" thing with LLMs in a mature codebase. You have to supply it the right context with a well formed prompt, get a plan, then execute and do some cleanup. LLMs are only as good as the engineers using them, you need to master the tool first before you can be productive with it.
View on HN · Topics
Time will tell. I’d bet on Spolsky, because of Hyrum’s Law. https://www.hyrumslaw.com/ > With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody. An LLM rewriting a codebase from scratch is only as good as the spec. If “all observable behaviors” are fair game, the LLM is not going to know which of those behaviors are important. Furthermore, Spolsky talks about how to do incremental rewrites of legacy code in his post. I’ve done many of these and I expect LLMs will make the next one much easier.
View on HN · Topics
>An LLM rewriting a codebase from scratch is only as good as the spec. If “all observable behaviors” are fair game, the LLM is not going to know which of those behaviors are important. I've been using LLMs to write docs and specs and they are very very good at it.
View on HN · Topics
The business is identifying the correct specs and filter the customer needs/requests so that the product does not become irrelevant.
View on HN · Topics
But... you can ask! Ask claude to use encapsulation, or to write the equivalent of interfaces in the language you using, and to map out dependencies and duplicate features, or to maintain a dictionary of component responsibilities. AI coding is a multiplier of writing speed but doesn't excuse planning out and mapping out features. You can have reasonably engineered code if you get models to stick to well designed modules but you need to tell them.
View on HN · Topics
> greenfield LLMs are pretty good at picking up existing codebases. Even with cleared context they can do „look at this codebase and this spec doc that created it. I want to add feature x“
View on HN · Topics
Overall Codebase size vs context matter less when you set it up as microservices style architecture from the starts. I just split it into boundaries that make sense to me. Get the LLM to make a quick cheat sheet about the api and then feed that into adjacent modules. It doesn’t need to know everything about all of it to make changes if you’ve got a grip on big picture and the boundaries are somewhat sane
View on HN · Topics
Opus 4.5 has become really capable. Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them. You have to experience it yourself on your own real problems and over the course of days or weeks. Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop. In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them. Curious how this will perform in complex, large production environments.
View on HN · Topics
Not in my experience - you need to build the fact that you don’t want it to do that into your design and specification.
View on HN · Topics
Difficult and it really depends on the complexity. I definitely work in a spec-driven way, with a step-by-step implementation phase. If it goes the wrong way I prefer to rewrite the spec and throw away the code.