Summarizer

Context Window Limitations and Management

Discussions focus on token limits (200k), performance degradation as context fills, and strategies like compacting history, using sub-agents, or maintaining summary files to preserve long-term memory.

← Back to Opus 4.5 is not the normal AI agent experience that I have had thus far

57 comments tagged with this topic

View on HN · Topics
This isn't meant as a criticism, or to doubt your experience, but I've talked to a few people who had experiences like this. But, I helped them get Claude code setup, analyze the codebase and document the architecture into markdown (edit as needed after), create an agent for the architecture, and prompt it in an incremental way. Maybe 15-30 minutes of prep. Everyone I helped with this responded with things like "This is amazing", "Wow!", etc. For some things you can fire up Claude and have it generate great code from scratch. But for bigger code bases and more complex architecture, you need to break it down ahead of time so it can just read about the architecture rather than analyze it every time.
View on HN · Topics
You don't need to do much, the /agent command is the most useful, and it walks you through it. The main thing though is to give the agent something to work with before you create it. That's why I go through the steps of letting Claude analyze different components and document the design/architecture. The major benefit of agents is that it keeps context clean for the main job. So the agent might have a huge context working through some specific code, but the main process can do something to the effect of "Hey UI library agent, where do I need to put code to change the color of widget xyz", then the agent does all the thinking and can reply with "that's in file 123.js, line 200". The cleaner you keep the main context, the better it works.
View on HN · Topics
> if you did /init in a new code base, that Claude would set itself up to maximize its own understanding of the code. This is definitely not the case, and the reason anthropic doesnt make claude do this is because its quality degrades massively as you use up its context. So the solution is to let users manage the context themselves in order to minimize the amount that is "wasted" on prep work. Context windows have been increasing quite a bit so I suspect that by 2030 this will no longer be an issue for any but the largest codebases, but for now you need to be strategic.
View on HN · Topics
Trying to one-shot large codebases is a exercise in futility. You need to let Claude figure out and document the architecture first, then setup agents for each major part of the project. Doing this keeps the context clean for the main agent, since it doesn't have to go read the code each time. So one agent can fill it's entire context understanding part of the code and then the main agent asks it how to do something and gets a shorter response. It takes more work than one-shot, but not a lot, and it pays dividends.
View on HN · Topics
Is that a context issue? I wonder if LSP would help there. Though Claude Code should grep the codebase for all necessary context and LSP should in theory only save time, I think there would be a real improvement to outcomes as well. The bigger a project gets the more context you generally need to understand any particular part. And by default Claude Code doesn't inject context, you need to use 3rd party integrations for that.
View on HN · Topics
That's remarkably similar to something I've just started on - I want to create a self-compiling C compiler targeting (and to run on) an 8-bit micro via a custom VM. This a basically a retro-computing hobby project. I've worked with Gemini Fast on the web to help design the VM ISA, then next steps will be to have some AI (maybe Gemini CLI - currently free) write an assembler, disassembler and interpreter for the ISA, and then the recursive descent compiler (written in C) too. I already had Gemini 3.0 Fast write me a precedence climbing expression parser as a more efficient drop-in replacement for a recursive descent one, although I had it do that in C++ as a proof-of-concept since I don't know yet what C libraries I want to build and use (arena allocator, etc). This involved a lot of copy-paste between Gemini output and an online C++ dev environment (OnlineGDB), but that was not too bad, although Gemini CLI would have avoided that. Too bad that Gemini web only has "code interpreter" support for Python, not C and/or C++. Using Gemini to help define the ISA was an interesting process. It had useful input in a "pair-design" process, working on various parts of the ISA, but then failed to bring all the ideas together into a single ISA document, repeatedly missing parts of what had been previously discussed until I gave up and did that manually. The default persona of Gemini seems not very well suited to this type of work flow where you want to direct what to do next, since it seems they've RL'd the heck out of it to want to suggest next step and ask questions rather than do what is asked and wait for further instruction. I eventually had to keep asking it to "please answer then stop", and interestingly quality of the "conversation" seemed to fall apart after that (perhaps because Gemini was now predicting/generating a more adversarial conversation than a collaborative one?). I'm wondering/hoping that Gemini CLI might be better at working on documentation than Gemini web, since then the doc can be an actual file it is editing, and it can use it's edit tool for that, as opposed to hoping that Gemini web can assemble chunks of context (various parts of the ISA discussion) into a single document.
View on HN · Topics
Do they manage context differently or have different system prompts? I would assume a lot of that would be the same between them. I think GH Copilots biggest shortcoming is that they are too token cheap. Aggressively managing context to the detriment of the results. Watching Claude read a 500 line file in 100 line chunks just makes me sad.
View on HN · Topics
> 1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success. I can run multiple agents at once, across multiple code bases (or the same codebase but multiple different branches), doing the same or different things. You absolutely can't keep up with that. Maybe the one singular task you were working on, sure, but the fact that I can work on multiple different things without the same cognitive load will blow you out of the water. > 2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1. Tell the LLM to document in comments why it did things. Human developers often leave and then people with no knowledge of their codebase or their "whys" are even around to give details. Devs are notoriously terrible about documentation. > 3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves. You can't develop at the same velocity, so drop that assumption now. There's all kinds of lower abstractions that you build on top of that you probably can't explain currently. > 4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics. You aren't keeping up with the actual economics. This shit is technically profitable, the unprofitable part is the ongoing battle between LLM providers to have the best model. They know software in the past has often been winner takes all so they're all trying to win.
View on HN · Topics
In a circuitous way, you can rather successfully have one agent write a specification and another one execute the code changes. Claude code has a planning mode that lets you work with the model to create a robust specification that can then be executed, asking the sort of leading questions for which it already seems to know it could make an incorrect assumption. I say 'agent' but I'm really just talking about separate model contexts, nothing fancy.
View on HN · Topics
> With the latest models if you're clear enough with your requirements you'll usually find it does the right thing on the first try That's great that this is your experience, but it's not a lot of people's. There are projects where it's just not going to know what to do. I'm working in a web framework that is a Frankenstein-ing of Laravel and October CMS. It's so easy for the agent to get confused because, even when I tell it this is a different framework, it sees things that look like Laravel or October CMS and suggests solutions that are only for those frameworks. So there's constant made up methods and getting stuck in loops. The documentation is terrible, you just have to read the code. Which, despite what people say, Cursor is terrible at, because embeddings are not a real way to read a codebase.
View on HN · Topics
I'm working mostly in a web framework that's used by me and almost nobody else (the weird little ASGI wrapper buried in Datasette) and I find the coding agents pick it up pretty fast. One trick I use that might work for you as well: Clone GitHub.com/simonw/datasette to /tmp then look at /tmp/docs/datasette for documentation and search the code if you need to Try that with your own custom framework and it might unblock things. If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code!
View on HN · Topics
The more explicit/detailed your plan, the more context it uses up, the less accurate and generally functional it is. Don't get me wrong, it's amazing, but on a complex problem with large enough context it will consistently shit the bed.
View on HN · Topics
It takes a lot of plan to use up the context and most of the time the agent doesn't need the whole plan, they just need what's relevant to the current task.
View on HN · Topics
It usually works well for me. With very big tasks I break the plan into multiple MD files with the relevant context included and work through in individual sessions, updating remaining plans appropriately at the end of each one (usually there will be decision changes or additions during iteration).
View on HN · Topics
> Opus 4.5 really is at a new tier however. It just...works. Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.
View on HN · Topics
The context window isn't "crippled". Create a markdown document of your task (or use CLAUDE.md), put it in "plan mode" which allows Claude to use tool calls to ask questions before it generates the plan. When it finishes one part of the plan, have it create a another markdown document - "progress.md" or whatever with the whole plan and what is completed at that point. Type /clear (no more context window), tell Claude to read the two documents. Repeat until even a massive project is complete - with those 2 markdown documents and no context window issues.
View on HN · Topics
> The context window isn't "crippled". ... Proceeds to explain how it's crippled and all the workarounds you have to do to make it less crippled.
View on HN · Topics
> ... Proceeds to explain how it's crippled and all the workarounds you have to do to make it less crippled. No - that's not what I did. You don't need an extra-long context full of irrelevant tokens. Claude doesn't need to see the code it implemented 40 steps ago in a working method from Phase 1 if it is on Phase 3 and not using that method. It doesn't need reasoning traces for things it already "thought" through. This other information is cluttering , not helpful. It is making signal to noise ratio worse. If Claude needs to know something it did in Phase 1 for Phase 4 it will put a note on it in the living markdown document to simply find it again when it needs it.
View on HN · Topics
Again, you're basically explaining how Claude has a very short limited context and you have to implement multiple workarounds to "prevent cluttering". Aka: try to keep context as small as possible, restart context often, try and feed it only small relevant information. What I very succinctly called "crippled context" despite claims that Opus 4.5 is somehow "next tier". It's all the same techniques we've been using for over a year now.
View on HN · Topics
Context is a short term memory. Yours is even more limited and yet somehow you get by.
View on HN · Topics
I get by because I also have long-term memory, and experience, and I can learn. LLMs have none of that, and every new session is rebuilding the world anew. And even my short-term memory is significantly larger than the at most 50% of the 200k-token context window that Claude has. It runs out of context before my short-term memory is probably not even 1% full, for the same task ( and I'm capable of more context-switching in the meantime). And so even the "Opus 4.5 really is at a new tier" runs into the very same limitations all models have been running into since the beginning.
View on HN · Topics
> LLMs have none of that, and every new session is rebuilding the world anew. For LLMs long term memory is achieved by tooling. Which you discounted in your previous comments. You also overstimate capacity of your short-term memory by few orders of magnitude: https://my.clevelandclinic.org/health/articles/short-term-me...
View on HN · Topics
> For LLMs long term memory is achieved by tooling. Which you discounted in your previous comments. My specific complaint, which is an observable fact about "Opus 4.5 is next tier": it has the same crippled context that degrades the quality of the model as soon as it fills 50%. EMM_386: no-no-no, it's not crippled. All you have to do is keep track across multiple files, clear out context often, feed very specific information not to overflow context. Me: so... it's crippled, and you need multiple workarounds scotty79: After all it's the same as your own short-term memory, and <some unspecified tooling (I guess those same files)> provide long-term memory for LLMs. Me: Your comparison is invalid because I can go have lunch, and come back to the problem at hand and continue where I left off. "Next tier Opus 4.5" will have to be fed the entire world from scratch after a context clear/compact/in a new session. Unless, of course, you meant to say that "next tier Opus model" only has 15-30 second short term memory, and needs to keep multiple notes around like the guy from Memento. Which... makes it crippled.
View on HN · Topics
If you refuse to use what you call workarounds and I call long term memory then you end up with a guy from Memento and regardless of how smart the model is it can end up making same mistakes. And that's why you can't tell the difference between smarter and dumber one while others can.
View on HN · Topics
I think the premise is that if it was the "next tier" than you wouldn't need to use these workarounds.
View on HN · Topics
> If you refuse to use what you call workarounds Who said I refuse them? I evaluated the claim that Opus is somehow next tier/something different/amazeballs future at its face value . It still has all the same issues and needs all the same workarounds as whatever I was using two months ago (I had a bit of a coding hiatus between beginning of December and now). > then you end up with a guy from Memento and regardless of how smart the model is Those models are, and keep being the guy from memento. Your "long memory" is nothing but notes scribbled everywhere that you have to re-assemble every time. > And that's why you can't tell the difference between smarter and dumber one while others can. If it was "next tier smarter" it wouldn't need the exact same workarounds as the "dumber" models. You wouldn't compare the context to the 15-30 second short-term memory and need unspecified tools [1] to have "long-term memory". You wouldn't have the model behave in an indistinguishable way from a "dumber" model after half of its context windows has been filled. You wouldn't even think about context windows. And yet here we are [1] For each person these tools will be a different collection of magic incantations. From scattered .md files to slop like Beads to MCP servers providing access to various external storage solutions to custom shell scripts to ... BTW, I still find "superpowers" from https://github.com/obra/superpowers to be the single best improvement to Claude (and other providers) even if it's just another in a long serious of magic chants I've evaluated.
View on HN · Topics
I'm not familiar with any form of intelligence that does not suffer from a bloated context. If you want to try and improve your workflow, a good place to start is using sub-agents so individual task implementations do not fill up your top level agents context. I used to regularly have to compact and clear, but since using sub-agents for most direct tasks, I hardly do anymore.
View on HN · Topics
1. It's a workaround for context limitations 2. It's the same workarounds we've been doing forever 3. It's indistinguishable from "clear context and re-feed the entire world of relevant info from scratch" we've had forever, just slightly more automated That's why I don't understand all the "it's new tier" etc. It's all the same issues with all the same workarounds.
View on HN · Topics
200k+ tokens is a pretty big context window if you are feeding it the right context. Editors like Cursor are really good at indexing and curating context for you; perhaps it'd be worth trying something that does that better than Claude CLI does?
View on HN · Topics
> a pretty big context window if you are feeding it the right context. Yup. There's some magical "right context" that will fix all the problems. What is that right context? No idea, I guess I need to read a yet-another 20 000-word post describing magical incantations that you should or shouldn't do in the context. The "Opus 4.5 is something else/nex tier/just works" claims in my mind means that I wouldn't need to babysit its every decision, or that it would actually read relevant lines from relevant files etc. Nope. Exact same behaviors as whatever the previous model was. Oh, and that "200k tokens context window"? It's a lie. The quality quickly degrades as soon as Claude reaches somewhere around 50% of the context window. At 80+% it's nearly indistinguishable from a model from two years ago. (BTW, same for Codex/GPT with it's "1 million token window")
View on HN · Topics
It's like working with humans: 1) define problem 2) split problem into small independently verifiable tasks 3) implement tasks one by one, verify with tools With humans 1) is the spec, 2) is the Jira or whatever tasks With an LLM usually 1) is just a markdown file, 2) is a markdown checklist, Github issues (which Claude can use with the `gh` cli) and every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2 I haven't ran into context issues in a LONG time, and if I have it's usually been either intentional (it's a problem where compacting wont' hurt) or an error on my part.
View on HN · Topics
> every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2 > I haven't ran into context issues in a LONG time Because you've become the reverse centaur :) "a person who is serving as a squishy meat appendage for an uncaring machine." [1] You are very aware of the exact issues I'm talking about, and have trained yourself to do all the mechanical dance moves to avoid them. I do the same dances, that's why I'm pointing out that they are still necessary despite the claims of how model X/Y/Z are "next tier". [1] https://doctorow.medium.com/https-pluralistic-net-2025-12-05...
View on HN · Topics
The only thing I disagreed with in your post is your objectively incorrect statement regarding Claude's context behavior. Other than that I'm just trying to encourage you to make preparations for something that I don't think you're taking seriously enough yet. No need to get all worked up, it'll only reflect on you.
View on HN · Topics
comments? also reading into an agent so the agent doesnt have to tool-call/bash out
View on HN · Topics
Here's an example: We have some tests in "GIVEN WHEN THEN" style, and others in other styles. Opus will try to match each style of testing by the project it is in by reading adjacent tests.
View on HN · Topics
When I ask Claude to do something, it independently, without me even asking or instructing it to, searches the codebase to understand what the convention is. I’ve even found it searching node_modules to find the API of non-public libraries.
View on HN · Topics
If they're using Opus then it'll be the $100/month Claude Max 5x plan (could be the more expensive 20x plan depending on how intensive their use is). It does consume a lot of tokens, but I've been using the $100/mo plan and get a lot done without hitting limits. It helps to be mindful of context (regularly amending/pruning your CLAUDE.md instructions, clearing context between tasks, sizing your tasks to stay within the Opus context window). Claude Code plans have token limits that work in 5-hour blocks (that start when you send your first token, so it's often useful to prime it as early in the morning as possible). Claude Code will spawn sub-agents (that often use their cheap Haiki model) for exploration and planning tasks, with only the results imported into the main context. I've found the best results from a more interactive collaboration with Claude Code. As long as you describe the problem clearly, it does a good job on small/moderate tasks. I generally set two instances of Claude Code separate tasks and run them concurrently (the interaction with Claude Code distracts me too much to do my own independent coding simultaneously like with setting a task for a colleague, but I do work on architecture / planning tasks) The one manner of taste that I have had to compromise on is the sheer amount of code - it likes to write a lot of code. I have a better experience if I sweat the low-level code less, and just periodically have it clean up areas where I think it's written too much / too repetitive code. As you give it more freedom it's more prone to failure (and can often get itself stuck in a fruitless spiral) - however as you use it more you get a sense of what it can do independently and what's likely to choke on. A codebase with good human-designed unit & playwright tests is very good. Crucially, you get the best results where your tasks are complex but on the menial side of the spectrum - it can pay attention to a lot of details, but on the whole don't expect it to do great on senior-level tasks. To give you an idea, in a little over a month "npx ccusage" shows that via my Claude Code 5x sub I've used 5M input tokens, 1.5M output, 121M Cache Create, 1.7B Cache Read. Estimated pay-as-you-go API cost equivalent is $1500 (N.B. for the tail end of December they doubled everybody's API limits, so I was using a lot more tokens on more experimental on-the-fly tool construction work)
View on HN · Topics
i've never hit a limit with my $200 a month plan
View on HN · Topics
Copilot and many coding agents truncates the context window and uses dynamic summarization to keep costs low for them. That's how they are able to provide flat fee plans. You can see some of the context limits here: https://models.dev/ If you want the full capability, use the API and use something like opencode. You will find that a single PR can easily rack up 3 digits of consumption costs.
View on HN · Topics
Maybe not, then. I'm afraid I have no idea what those numbers mean, but it looks like Gemini and ChatGPT 4 can handle a much larger context than Opus, and Opus 4.5 is cheaper than older versions. Is that correct? Because I could be misinterpreting that table.
View on HN · Topics
I don't know about GPT4 but the latest one (GPT 5.2) has 200k context window while Gemini has 1m, five times higher. You'll be wanting to stay within the first 100k on all of them to avoid hitting quotas very quickly though (either start a new task or compact when you reach that) so in practice there's no difference. I've been cycling between a couple of $20 accounts to avoid running out of quota and the latest of all of them are great. I'd give GPT 5.2 codex the slight edge but not by a lot. The latest Claude is about the same too but the limits on the $20 plan are too low for me to bother with. The last week has made me realize how close these are to being commodities already. Even the CLI the agents are nearly the same bar some minor quirks (although I've hit more bugs in Gemini CLI but each time I can just save a checkpoint and restart). The real differentiating factor right now is quota and cost.
View on HN · Topics
You need to find where context breaks down, Claude was better at it even when Gemini had 5X more on paper, but both have improved with last releases.
View on HN · Topics
People are completely missing the points about agentic development. The model is obviously a huge factor in the quality of the output, but the real magic lies in how the tools are managing and injecting context in to them, as well as the tooling. I switched from Copilot to Cursor at the end of 2025, and it was absolute night and day in terms of how the agents behaved.
View on HN · Topics
I use it on a 10 years codebase, needs to explain where to get context but successfully works 90% of time
View on HN · Topics
I've also found it to keep such a constrained context window (on large codebases), that it writes a secondary block of code that already had a solution in a different area of the same file. Nothing I do seems to fix that in its initial code writing steps. Only after it finishes, when I've asked it to go back and rewrite the changes, this time making only 2 or 3 lines of code, does it magically (or finally) find the other implementation and reuse it. It's freakin incredible at tracing through code and figuring it out. I <3 Opus. However, it's still quite far from any kind of set-and-forget-it.
View on HN · Topics
Reminds me of a post I read a few days ago of someone crowing about an LLM writing for them an email format validator. They did not have the LLM code up an accompanying send-an-email-validation loop, and were blithely kept uninformed by the LLM of the scar tissue built up by experience in the industry on how curiously a deep rabbit hole email validation becomes. If you’ve been around the block and are judicious how you use them, LLM’s are a really amazing productivity boost. For those without that judgement and taste, I’m seeing footguns proliferate and the LLM’s are not warning them when someone steps on the pressure plate that’s about to blow off their foot. I’m hopeful we will this year create better context window-based or recursive guardrails for the coding agents to solve for this.
View on HN · Topics
With the right prompt the LLM will solve it in the first place. But this is an issue of not knowing what you don't know, so it makes it difficult to write the right prompt. One way around this is to spawn more agents with specific tasks, or to have an agent that is ONLY focused on finding patterns/code where you're reinventing the wheel. I often have one agent/prompt where I build things but then I have another agent/prompt where their only job is to find codesmells, bad patterns, outdated libraries, and make issues or fix these problems.
View on HN · Topics
In my experience, using LLMs to code encouraged me to write better documentation, because I can get better results when I feed the documentation to the LLM. Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better. Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples. I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track. I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
View on HN · Topics
I've also noticed that going off the rails. At the start of a session, they're pretty sharp and focused, but the longer the session lasts, the more confused they get. At some point they start hallucinating bullshit that they wouldn't have earlier in the session. It's a vital skill to recognise when that happens and start a new session.
View on HN · Topics
A greenfield project is definitely 'easy mode' for an LLM; especially if the problem area is well understood (and documented). Opus is great and definitely speeds up development even in larger code bases and is reasonably good at matching coding style/standard to that of of the existing code base. In my opinion, the big issue is the relatively small context that quickly overwhelms the models when given a larger task on a large codebase. For example, I have a largish enterprise grade code base with nice enterprise grade OO patterns and class hierarchies. There was a simple tech debt item that required refactoring about 30-40 classes to adhere to a slightly different class hierarchy. The work is not difficult, just tedious, especially as unit tests need to be fixed up. I threw Opus at it with very precise instructions as to what I wanted it to do and how I wanted it to do it. It started off well but then disintegrated once it got overwhelmed at the sheer number of files it had to change. At some point it got stuck in some kind of an error loop where one change it made contradicted with another change and it just couldn't work itself out. I tried stopping it and helping it out but at this point the context was so polluted that it just couldn't see a way out. I'd say that once an LLM can handle more 'context' than a senior dev with good knowledge of a large codebase, LLM will be viable in a whole new realm of development tasks on existing code bases. That 'too hard to refactor this/make this work with that' task will suddenly become viable.
View on HN · Topics
I just did something similar and it went swimmingly by doing this: Keep the plan and status in an md file. Tell it to finish one file at a time and run tests and fix issues and then to ask whether to proceed with the next file. You can then easily start a new chat with the same instructions and plan and status if the context gets poisoned.
View on HN · Topics
You have to think of Opus as a developer whose job at your company lasts somewhere between 30 to 60 minutes before you fire them and hire a new one. Yes, it's absurd but it's a better metaphor than someone with a chronic long term memory deficit since it fits into the project management framework neatly. So this new developer who is starting today is ready to be assigned their first task, they're very eager to get started and once they start they will work very quickly but you have to onboard them. This sounds terrible but they also happen to be extremely fast at reading code and documentation, they know all of the common programming languages and frameworks and they have an excellent memory for the hour that they're employed. What do you do to onboard a new developer like this? You give them a well written description of your project with a clear style guide and some important dos and don'ts, access to any documentation you may have and a clear description of the task they are to accomplish in less than one hour. The tighter you can make those documents, the better. Don't mince words, just get straight to the point and provide examples where possible. The task description should be well scoped with a clear definition of done, if you can provide automated tests that verify when it's complete that's even better. If you don't have tests you can also specify what should be tested and instruct them to write the new tests and run them. For every new developer after the first you need a record of what was already accomplished. Personally, I prefer to use one markdown document per working session whose filename is a date stamp with the session number appended. Instruct them to read the last X log files where X is however many are relevant to the current task. Most of the time X=1 if you did a good job of breaking down the tasks into discrete chunks. You should also have some type of roadmap with milestones, if this file will be larger than 1000 lines then you should break it up so each milestone is its own document and have a table of contents document that gives a simple overview of the total scope. Instruct them to read the relevant milestone. Other good practices are to tell them to write a new log file after they have completed their task and record a summary of what they did and anything they discovered along the way plus any significant decisions they made. Also tell them to commit their work afterwards and Opus will write a very descriptive commit message by default (but you can instruct them to use whatever format you prefer). You basically want them to get everything ready for hand-off to the next 60 minute developer. If they do anything that you don't want them to do again make sure to record that in CLAUDE.md. Same for any other interventions or guidance that you have to provide, put it in that document and Opus will almost always stick to it unless they end up overfilling their context window. I also highly recommend turning off auto-compaction. When the context gets compacted they basically just write a summary of the current context which often removes a lot of the important details. When this happens mid-task you will certainly lose parts of the context that are necessary for completing the task. Anthropic seems to be working hard at making this better but I don't think it's there yet. You might want to experiment with having it on and off and compare the results for yourself. If your sessions are ending up with >80% of the context window used while still doing active development then you should re-scope your tasks to make them smaller. The last 20% is fine for doing menial things like writing the summary, running commands, committing, etc. People have built automated systems around this like Beads but I prefer the hands-on approach since I read through the produced docs to make sure things are going ok and use them as a guide for any changes I need to make mid-project. With this approach I'm 99% sure that Opus 4.5 could handle your refactor without any trouble as long as your classes aren't so enormous that even working on a single one at a time would cause problems with the context window, and if they are then you might be able to handle it by cautioning Opus to not read the whole file and to just try making targeted edits to specific methods. They're usually quite good at finding and extracting just the sections that they need as long as they have some way to know what to look for ahead of time. Hope this helps and happy Clauding!
View on HN · Topics
in my experience, what happens is the code base starts to collapse under its own weight. it becomes impossible to fix one thing without breaking another. the coding agent fails to recognize the global scope of the problem and tries local fixes over and over. progress gets slower, new features cost more. all the same problems faced by an inexperienced developer on a greenfield project! has your experience been otherwise?
View on HN · Topics
They're not an equivalent intelligence as human's and thus have noticeably different failure modes. But human's fail in ways that they don't (eg. being unable to match llm's breadth and depth of knowledge) But the question i'm really asking is... isn't it more than a sheer statistical "trick" if an LLM can actually be instructed to "read surrounding code", understand the request, and demonstrably include it in its operation? You can't do that unless you actually understand what "surrounding code" is, and more importantly have a way to comply with the request...
View on HN · Topics
What size of code base are you talking about? And this is your personal experience?
View on HN · Topics
Overall Codebase size vs context matter less when you set it up as microservices style architecture from the starts. I just split it into boundaries that make sense to me. Get the LLM to make a quick cheat sheet about the api and then feed that into adjacent modules. It doesn’t need to know everything about all of it to make changes if you’ve got a grip on big picture and the boundaries are somewhat sane
View on HN · Topics
It doesn't have to be micro services, just code that is decoupled properly, so it can search and build its context easily.