llm/8632d754-c7a3-4ec2-977a-2733719992fa/topic-4-bbb7599d-614a-45e7-8b70-349fb299d747-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> Agentic Workflows and Harnessing # Technical strategies for controlling AI behavior, such as 'harness engineering,' using AGENTS.md files to document rules and prevent regressions, and setting up feedback loops where agents run tests to verify their own work. This includes moving beyond simple chatbots to autonomous background processes that triage issues or perform research. </topic> <comments_about_topic> 1. You trust your natural language instructions thousand times a day. If you ask for a large black coffee, you can trust that is more or less what you’ll get. Occasionally you may get something so atrocious that you don’t dare to drink, but generally speaking you trust the coffee shop knows what you want. It you insist on a specific amount of coffee brewed at a specific temperature, however, you need tools to measure. AI tools are similar. You can trust them because they are good enough, and you need a way (testing) to make sure what is produced meet your specific requirements. Of course they may fail for you, doesn’t mean they aren’t useful in other cases. All of that is simply common sense. 2. There was a guy a few months ago who found that telling the AI to do everything in a single PHP file actually produced significantly better results, i.e. it worked on the first try. Otherwise it defaulted to React, 1GB of node modules, and a site that wouldn't even load. >Am I better served For anything serious, I write the code "semi-interactively", i.e. I just prompt and verify small chunks of the program in rapid succession. That way I keep my mental model synced the whole time, I never have any catching up to do, and honestly it just feels good to stay in the driver's seat. 3. > The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top. I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you . It's great when it gets things right but even then it's you that is confirming this. 4. Give it a read, he mentions briefly how he uses for PR triages and resolving GH issues. He doesn't go in details, but there is a bit: > Issue and PR triage/review. Agents are good at using gh (GitHub CLI), so I manually scripted a quick way to spin up a bunch in parallel to triage issues. I would NOT allow agents to respond, I just wanted reports the next day to try to guide me towards high value or low effort tasks. > More specifically, I would start each day by taking the results of my prior night's triage agents, filter them manually to find the issues that an agent will almost certainly solve well, and then keep them going in the background (one at a time, not in parallel). This is a short excerpt, this article is worth reading. Very grounded and balanced. 5. Okay I think this somewhat answers my question. Is this individual a solo developer? “Triaging GitHub issues” sounds a bit like open source solo developer. Guess I’m just desperate for an article about how organizations are actually speeding up development using agentic AI. Like very practical articles about how existing development processes have been adjusted to facilitate agentic AI. I remain unconvinced that agentic AI scales beyond solo development, where the individual is liable for the output of the agents. More precisely, I can use agentic AI to write my code, but at the end of the day when I submit it to my org it’s my responsibility to understand it, and guarantee (according to my personal expertise) its security and reliability. Conversely, I would fire (read: reprimand) someone so fast if I found out they submitted code that created a vulnerability that they would have reasonably caught if they weren’t being reckless with code submission speed, LLM or not. AI will not revolutionize SWE until it revolutionizes our processes. It will definitely speed us up (I have definitely become faster), but faster != revolution. 6. we're talking about _this_ post? He specifically said he only runs one agent, so sure he probably reviews the code or as he stated finds means of auto-verifying what the agent does (giving the agent a way to self-verify as part of its loop). 7. For me, AI is the best for code research and review Since some team members started using AI without care, I did create bunch of agents/skills/commands and custom scripts for claude code. For each PR, it collects changes by git log/diff, read PR data and spin bunch of specialized agents to check code style, architecture, security, performance, and bugs. Each agent armed with necessary requirement documents, including security compliance files. False positives are rare, but it still misses some problems. No PR with ai generated code passes it. If AI did not find any problems, I do manual review. 8. GPT-4 showed the potential but the automated workflows (context management, loops, test-running) and pure execution speed to handle all that "reasoning"/workflows (remember watching characters pop in slowly in GPT-4 streaming API response calls) are gamechangers. The workflow automation and better (and model-directed) context management are all obvious in retrospect but a lot of people (like myself) were instead focused on IDE integration and such vs `grep` and the like. Maybe multi-agent with task boards is the next thing, but it feels like that might also start to outrun the ability to sensibly design and test new features for non-greenfield/non-port projects. Who knows yet. I think it's still very valuable for someone to dig in to the underlying models periodically (insomuch as the APIs even expose the same level of raw stuff anymore) to get a feeling for what's reliable to one-shot vs what's easily correctable by a "ran the tests, saw it was wrong, fixed it" loop. If you don't have a good sense of that, it's easy to get overambitious and end up with something you don't like if you're the sort of person who cares at all about what the code looks like. 9. > Break down sessions into separate clear, actionable tasks. Don't try to "draw the owl" in one mega session. This is the key one I think. At one extreme you can tell an agent "write a for loop that iterates over the variable `numbers` and computes the sum" and they'll do this successfully, but the scope is so small there's not much point in using an LLM. On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad. A lot of successful LLM adoption for code is finding this sweet spot. Overly specific instructions don't make you feel productive, and overly broad instructions you end up redoing too much of the work. 10. I agree that framing and scoping tasks is becoming a real joy. The great thing about this strategy is there's a point at which you can scope something small enough that it's hard for the AI to get it wrong and it's easy enough for you as a human to comprehend what it's done and verify that it's correct. I'm starting to think of projects now as a tree structure where the overall architecture of the system is the main trunk and from there you have the sub-modules, and eventually you get to implementations of functions and classes. The goal of the human in working with the coding agent is to have full editorial control of the main trunk and main sub-modules and delegate as much of the smaller branches as possible. Sometimes you're still working out the higher-level architecture, too, and you can use the agent to prototype the smaller bits and pieces which will inform the decisions you make about how the higher-level stuff should operate. 11. I feel the same, but, also, within like three years this might look very different. Maybe you'll give the full end-to-end goal upfront and it just polls you when it needs clarification or wants to suggest alternatives, and it self-manages cleanly self-delegating. Or maybe something quite different but where these early era agentic tooling strategies still become either unneeded or even actively detrimental. 12. > it just polls you when it needs clarification I think anyone who has worked on a serious software project would say, this means it would be polling you constantly. Even if we posit that an LLM is equivalent to a human, humans constantly clarify requirements/architecture. IMO on both of those fronts the correct path often reveals itself over time, rather than being knowable from the start. So in this scenario it seems like you'd be dealing with constant pings and need to really make sure you're understanding of the project is growing with the LLM's development efforts as well. To me this seems like the best-case of the current technology, the models have been getting better and better at doing what you tell it in small chunks but you still need to be deciding what it should be doing. These chunks don't feel as though they're getting bigger unless you're willing to accept slop. 13. > Break down sessions into separate clear, actionable tasks. What this misses, of course, is that you can just have the agent do this too. Agent's are great at making project plans, especially if you give them a template to follow. 14. If you've got a plan for the plan, what else could you possibly need! 15. You joke, but the more I iterate on a plan before any code, the more successful the first pass is. 1) Tell claude my idea with as much as I know, ask it to ask me questions. This could go on for a few rounds. (Opus) 2) Run a validate skill on the plan, reviewer with a different prompt (Opus) 3) codex reviews the plan, always finds a few small items after the above 2. 4) claude opus implements in 1 shot, usually 99% accurate, then I manually test. If I stay on target with those steps I always have good outcomes, but it is time consuming. 16. I do something very similar. I have an "outside expert" script I tell my agent to use as the reviewer. It only bothers me when neither it OR the expert can figure out what the heck it is I actually wanted. In my case I have Gemini CLI, so I tell Gemini to use the little python script called gatekeeper.py to validate it's plan before each phase with Qwen, Kimi, or (if nothing else is getting good results) ChatGPT 5.2 Thinking. Qwen & Kimi are via fireworks.ai so it's much cheaper than ChatGPT. The agent is not allowed to start work until one of the "experts" approves it via gatekeeper. Similarly it can't mark a phase as complete until the gatekeeper approves the code as bug free and up to standards and passes all unit tests & linting. Lately Kimi is good enough, but when it's really stuck it will sometimes bother ChatGPT. Seldom does it get all the way to the bottom of the pile and need my input. Usually it's when my instructions turned out to be vague. I also have it use those larger thinking models for "expert consultation" when it's spent more than 100 turns on any problem and hasn't made progress by it's own estimation. 17. > On the other extreme you can tell an agent "make me an app that's Facebook for dogs" and it'll make so many assumptions about the architecture, code and product that there's no chance it produces anything useful beyond a cool prototype to show mom and dad. Amusingly, this was my experience in giving Lovable a shot. The onboarding process was literally just setting me up for failure by asking me to describe the detailed app I was attempting to build. Taking it piece by piece in Claude Code has been significantly more successful. 18. > the scope is so small there's not much point in using an LLM Actually that's how I did most of my work last year. I was annoyed by existing tools so I made one that can be used interactively. It has full context (I usually work on small codebases), and can make an arbitrary number of edits to an arbitrary number of files in a single LLM round trip. For such "mechanical" changes, you can use the cheapest/fastest model available. This allows you to work interactively and stay in flow. (In contrast to my previous obsession with the biggest, slowest, most expensive models! You actually want the dumbest one that can do the job.) I call it "power coding", akin to power armor, or perhaps "coding at the speed of thought". I found that staying actively involved in this way (letting LLM only handle the function level) helped keep my mental model synchronized, whereas if I let it work independently, I'd have to spend more time catching up on what it had done. I do use both approaches though, just depends on the project, task or mood! 19. This matches my experience, especially "don’t draw the owl" and the harness-engineering idea. The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX). What ended up working for me was treating chat as where I shape the plan (tradeoffs, invariants, failure modes) and treating the agent as something that does narrow, reviewable diffs against that plan. The human job stays very boring: run it, verify it, and decide what’s actually acceptable. That separation is what made it click for me. Once I got that loop stable, it stopped being a toy and started being a lever. I’ve shipped real features this way across a few projects (a git like tool for heavy media projects, a ticketing/payment flow with real users, a local-first genealogy tool, and a small CMS/publishing pipeline). The common thread is the same: small diffs, fast verification, and continuously tightening the harness so the agent can’t drift unnoticed. 20. >The failure mode I kept hitting wasn’t just "it makes mistakes", it was drift: it can stay locally plausible while slowly walking away from the real constraints of the repo. The output still sounds confident, so you don’t notice until you run into reality (tests, runtime behaviour, perf, ops, UX). Yeah I would get patterns where, initial prototypes were promising, then we developed something that was 90% close to design goals, and then as we try to push in the last 10%, drift would start breaking down, or even just forgetting, the 90%. So I would start getting to 90% and basically starting a new project with that as the baseline to add to. 21. This is what I experienced as well. these are some ticks I use now. 1. Write a generic prompts about the project and software versions and keep it in the folder. (I think this getting pushed as SKIILS.md now) 2. In the prompt add instructions to add comments on changes, since our main job is to validate and fix any issues, it makes it easier. 3. Find the best model for the specific workflow. For example, these days I find that Gemini Pro is good for HTML UI stuff, while Claude Sonnet is good for python code. (This is why subagents are getting popluar) 22. I still use the chatbot but like to do it outside-in. Provide what I need, and instruct it to not write any code except the api (signatures of classes, interfaces, hierarchy, essential methods etc). We keep iterating about this until it looks good - still no real code. Then I ask it to do a fresh review of the broad outline, any issues it foresees etc. Then I ask it to write some demonstrator test cases to see how ergonomic and testable the code is - we fine tune the apis but nothing is fleshed out yet. Once this is done, we are done with the most time consuming phase. After that is basically just asking it to flesh out the layers starting from zero dependencies to arriving at the top of the castle. Even if we have any complexities within the pieces or the implementation is not exactly as per my liking, the issues are localised - I can dive in and handle it myself (most of the time, I don't need to). I feel like this approach works very well for me having a mental model of how things are connected because the most of the time I spent was spent on that model. 23. I've been thinking about this as three maturity levels. Level 1 is what Mitchell describes — AGENTS.md, a static harness. Prevents known mistakes. But it rots. Nobody updates the checklist when the environment changes. Level 2 is treating each agent failure as an inoculation. Agent duplicates a util function? Don't just fix it — write a rule file: "grep existing helpers before writing new ones." Agent tries to build a feature while the build is broken? Rule: "fix blockers first." After a few months you have 30+ of these. Each one is an antibody against a specific failure class. The harness becomes an immune system that compounds. Level 3 is what I haven't seen discussed much: specs need to push, not just be read. If a requirement in auth-spec.md changes, every linked in-progress task should get flagged automatically. The spec shouldn't wait to be consulted. The real bottleneck isn't agent capability — it's supervision cost. Every type of drift (requirements change, environments diverge, docs rot) inflates the cost of checking the agent's work. Crush that cost and adoption follows. 24. > level 2 - becomes an immune system i'd bet that above some number there will be contradictions. Things that apply to different semantic contexts, but look same on syntax level (and maybe with various levels of "syntax" and "semantic"). And debugging those is going to be nightmare - same as debugging requirements spec / verification of that 25. Very much the same experience. But it does not talk much about the project setup and the influence of it on the session success. In the narrow scoped projects it works really well, especially when tests are easy to execute. I found that this approach melts down when facing enterprise software with large repositories and unconventional layouts. Then you need to do a bunch of context management upfront, and verbose instructions for evaluations. But we know what it needs is a refactor thats all. And the post touches on a next type of a problem, how to plan far ahead of time to utilise agents when you are away. It is a difficult problem but IMO we’re going in a direction of having some sort of shared “templated plans”/workflows and budgeted/throttled task execution to achieve that. It is like you want to give a little world to explore so that it does not stop early, like a little game to play, then you come back in the morning and check how far it went. 26. Finally, a step-by-step guide for even the skeptics to try to see what spot the LLM tools have in their workflows, without hype or magic like I vibe-coded an entire OS, and you can too! . 27. With so much noise in the AI world and constant model updates (just today GPT-5.3-Codex and Claude Opus 4.6 were announced), this was a really refreshing read. It’s easy to relate to his phased approach to finding real value in tooling and not just hype. There are solid insights and practical tips here. I’m increasingly convinced that the best way not to get overwhelmed is to set clear expectations for what you want to achieve with AI and tailor how you use it to work for you, rather than trying to chase every new headline. Very refreshing. 28. It's amusing how everyone seems to be going through the same journey. I do run multiple models at once now. On different parts of the code base. I focus solely on the less boring tasks for myself and outsource all of the slam dunk and then review. Often use another model to validate the previous models work while doing so myself. I do git reset still quite often but I find more ways to not get to that point by knowing the tools better and better. Autocompleting our brains! What a crazy time. 29. Very nice. As a consequence of this new way of working I'm using `git worktree` and diffview all the time. For more on the "harness engineering", see what Armin Ronacher and Mario Zechner are doing with pi: https://lucumr.pocoo.org/2026/1/31/pi/ https://mariozechner.at/posts/2025-11-30-pi-coding-agent/ > I really don't care one way or the other if AI is here to stay3, I'm a software craftsman that just wants to build stuff for the love of the game. I suspect having three comma on one's bank account helps being very relaxed about the outcome ;) 30. It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever. I constantly have to tell it what I don't like or what can be improved or need to request clarifications or alternative solutions. This is what's so annoying about it. It's like a child that does the same errors again and again. But couldn't it adjust itself with the goal of reducing the error bit by bit? Wouldn't this lead to the ultimate agent who can read your mind? That would be awesome. 31. > It's so sad that we're the ones who have to tell the agent how to improve by extending agent.md or whatever. Your improvement is someone else's code smell. There's no absolute right or wrong way to write code, and that's coming from someone who definitely thinks there's a right way. But it's my right way. Anyway, I don't know why you'd expect it to write code the way you like after it's been trained on the whole of the Internet & the the RLHF labelers' preferences and the reward model. Putting some words in AGENTS.md hardly seems like the most annoying thing. tip: Add a /fix command that tells it to fix $1 and then update AGENTS.md with the text that'd stop it from making that mistake in the future. Use your nearest LLM to tweak that prompt. It's a good timesaver. 32. It is not a mind reader. I enjoy giving it feedback because it shows I am in charge of the engineering. I also love using it for research for upcoming features. Research + pick a solution + implement. It happens so fast. 33. First off, appreciate you sharing your perspective. I just have a few questions. > I've gone back to managing the context window in Emacs because I can't be bothered to learn how to deal with another model family that will be thrown out in six months. Can you expand more on what you mean by that? I'm a bit of a noob on llm enabled dev work. Do you mean that you will kick off new sessions and provide a context that you manage yourself instead of relying on a longer running session to keep relevant information? > Unironically learning vim or Emacs and the standard Unix code tools is still the best thing you can do to level up your llm usage. I appreciate your insight but I'm failing to understand how exactly knowing these tools increases performance of llms. Is it because you can more precisely direct them via prompts? 34. I can't speak for parent, but I use gptel, and it sounds like they do as well. It has a number of features, but primarily it just gives you a chat buffer you can freely edit at any time. That gives you 100% control over the context, you just quickly remove the parts of the conversation where the LLM went off the rails and keep it clean. You can replace or compress the context so far any way you like. While I also use LLMs in other ways, this is my core workflow. I quickly get frustrated when I can't _quickly_ modify the context. If you have some mastery over your editor, you can just run commands and post relevant output and make suggested changes to get an agent like experience, at a speed not too different from having the agent call tools. But you retain 100% control over the context, and use a tiny fraction of the tokens OpenCode and other agents systems would use. It's not the only or best way to use LLMs, but I find it incredibly powerful, and it certainly has it's place. A very nice positive effect I noticed personally is that as opposed to using agents, I actually retain an understanding of the code automatically, I don't have to go in and review the work, I review and adjust on the fly. 35. One thing to keep in mind is that the core of an LLM is basically a (non-deterministic) stateless function that takes text as input, and gives text as output. The chat and session interfaces obscure this, making it look more stateful than it is. But they mainly just send the whole chat so far back to the LLM to get the next response. That's why the context window grows as a chat/session continues. It's also why the answers tend to get worse with longer context windows – you're giving the LLM a lot more to sift through. You can manage the context window manually instead. You'll potentially lose some efficiencies from prompt caching, but you can also keep your requests much smaller and more relevant, likely spending fewer tokens. 36. LLMs work on text and nothing else. There isn't any magic there. Just a limited context window on which the model will keep predicting the next token until it decides that it's predicted enough and stop. All the tooling is there to manage that context for you. It works, to a degree, then stops working. Your intuition is there to decide when it stops working. This intuition gets outdated with each new release of the frontier model and changes in the tooling. The stateless API with a human deciding what to feed it is much more efficient in both cost and time as long as you're only running a single agent. I've yet to see anyone use multiple agents to generate code successfully (but I have used agent swarms for unstructured knowledge retrieval). The Unix tools are there for you to progra-manually search and edit the code base copy/paste into the context that you will send. Outside of Emacs (and possibly vim) with the ability to have dozens of ephemeral buffers open to modify their output I don't imagine they will be very useful. Or to quote the SICP lectures: The magic is that there is no magic. 37. > Immediately cease trying to perform meaningful work via a chatbot. That depends on your budget. To work within my pro plan's codex limits, I attach the codebase as a single file to various chat windows (GPT 5.2 Thinking - Heavy) and ask it to find bugs/plan a feature/etc. Then I copy the dense tasklist from chat to codex for implementation. This reduces the tokens that codex burns. Also don't sleep on GPT 5.2 Pro. That model is a beast for planning. 38. > Context switching is very expensive. In order to remain efficient, I found that it was my job as a human to be in control of when I interrupt the agent, not the other way around. Don't let the agent notify you. This I have found to be important too. 39. I'd be interested to know what agents you're using. You mentioned Claude and GPT in passing, but don't actually talk about which you're using or for which tasks. 40. Good article! I especially liked the approach to replicate manual commits with the agent. I did not do that when learning but I suspect I'd have been much better off if I had. 41. Thanks for sharing your experiences :) You mentioned "harness engineering". How do you approach building "actual programmed tools" (like screenshot scripts) specifically for an LLM's consumption rather than a human's? Are there specific output formats or constraints you’ve found most effective? 42. I've found mostly for context reasons its better to just have a grand overview of the systems and how they work together and feed that to the agent as context, it will use the additional files it touches to expand its understanding if you prompt well. 43. Do you have any ideas on how to harness AI to only change specific parts of a system or workpiece? Like "I consider this part 80/100 done and only make 'meaningful' or 'new contributions' here" ...? 44. > I'm not [yet?] running multiple agents, and currently don't really want to This is the main reason to use AI agents, though: multitasking. If I'm working on some Terraform changes and I fire off an agent loop, I know it's going to take a while for it to produce something working. In the meantime I'm waiting for it to come back and pretend it's finished (really I'll have to fix it), so I start another agent on something else. I flip back and forth between the finished runs as they notify me. At the end of the day I have 5 things finished rather than two. The "agent" doesn't have to be anything special either. Anything you can run in a VM or container (vscode w/copilot chat, any cli tool, etc) so you can enable YOLO mode. 45. > If an agent isn't running, I ask myself "is there something an agent could be doing for me right now?" Solution-looking-for-a-problem mentality is a curse. </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Agentic Workflows and Harnessing # Technical strategies for controlling AI behavior, such as 'harness engineering,' using AGENTS.md files to document rules and prevent regressions, and setting up feedback loops where agents run tests to verify their own work. This includes moving beyond simple chatbots to autonomous background processes that triage issues or perform research.
45