Summarizer

LLM Input

llm/8d288441-d245-4951-86d7-2256c9013d39/topic-2-404ac58c-0e85-416a-ab69-3a64fcd9ca34-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Agent Orchestration Challenges # Difficulties in managing multiple agents including context state, codebase conventions, steering, merge conflicts, and the fundamental bottleneck of human review and accountability
</topic>

<comments_about_topic>
1. The difficulty comes in managing the agent. Ensuring it knows the state of the codebase, conventions to follow, etc. Steering it.

I've had the same experience as you. I've applied it to old projects which I have some frame of reference for and it's like a 200x speed boost. Just absolutely insane - that sort of speed can overcome a lot of other shortcomings.

2. This is clearly going to develop the same problem Beads has. I've used it. I'm in stage 7. Beads is a good idea with a bad implementation. It's not a designed product in the sense we are used to, it's more like a stream of consciousness converted directly into code. There are many features that overlap significantly, strange bugs, and the docs are also AI generated so have fun reading them. It's a program that isn't only vibe coded, it was vibe designed too.

Gas Town is clearly the same thing multiplied by ten thousand. The number of overlapping and adhoc concepts in this design is overwhelming. Steve is ahead of his time but we aren't going to end up using this stuff. Instead a few of the core insights will get incorporated into other agents in a simpler but no less effective way.

And anyway the big problem is accountability. The reason everyone makes a face when Steve preaches agent orchestration is that he must be in an unusual social situation. Gas Town sounds fun if you are accountable to nobody: not for code quality, design coherence or inferencing costs. The rest of us are accountable for at least the first two and even in corporate scenarios where there is a blank check for tokens, that can't last. So the bottleneck is going to be how fast humans can review code and agree to take responsibility for it. Meaning, if it's crap code with embarrassing bugs then that goes on your EOY perf review. Lots of parallel agents can't solve that fundamental bottleneck.

3. I tried using beads. There kept being merge conflicts and the agent just kept one or the other changes instead of merging it intelligently, killing any work I did on making tasks or resolving others. Still haven't seen how beads solves this problem... and it's also an unnecessary one. This should be a separate piece of it that doesn't rely on agent not funging up the merge.

4. How long until Atlassian makes "JIRA for Agents" where all your tasks and updates and memory aren't stored in Git (so no merge conflicts) but are still centralized and shareable between all your agents/devs/teams..

And also auditable, trackable, reportable, etc..

5. The article seems to be about fun, which I'm all for, and I highly appreciate the usage of MAKER as an evaluation task (finally, people are actually evaluating their theories on something quantitative) but the messaging here seems inherently contradictory:

> Gas Town helps with all that yak shaving, and lets you focus on what your Claude Codes are working on.

Then:

> Working effectively in Gas Town involves committing to vibe coding. Work becomes fluid, an uncountable that you sling around freely, like slopping shiny fish into wooden barrels at the docks. Most work gets done; some work gets lost. Fish fall out of the barrel. Some escape back to sea, or get stepped on. More fish will come. The focus is throughput: creation and correction at the speed of thought.

I see -- so where exactly is my focus supposed to sit?

As someone who sits comfortably in the "Stage 8" category that this article defines, my concern has never been throughput, it has always been about retaining a high-degree of quality while organizing work so that, when context switching occurs, it transitions me to near-orthogonal tasks which are easy to remember so I can give high-quality feedback before switching again.

For instance, I know Project A -- these are the concerns of Project A. I know Project B -- these are the concerns of Project B. I have the insight to design these projects so they compose, so I don't have to keep track of a hundred parallel issues in a mono Project C.

On each of those projects, run a single agent -- with review gates for 2-3 independent agents (fresh context, different models! Codex and Gemini). Use a loop, let the agents go back and forth.

This works and actually gets shit done. I'm not convinced that 20 Claudes or massively parallel worktrees or whatever improves on quality, because, indeed, I always have to intervene at some point. The blocker for me is not throughput, it's me -- a human being -- my focus, and the random points of intervention which ... by definition ... occur stochastically (because agents).

Finally:

> Opus 4.5 can handle any reasonably sized task, so your job is to make tasks for it. That’s it.

This is laughably not true, for anyone who has used Opus 4.5 for non-trivial tasks. Claude Code constantly gives up early, corrupts itself with self-bias, the list goes on and on. It's getting better, but it's not that good.

6. I've tried most of the agentic "let it rip" tools. Quickly I realized that GPT 5~ was significantly better at reasoning and more exhaustive than Claude Code (Opus, RL finetuned for Claude Code).

"What if Opus wrote the code, and GPT 5~ reviewed it?" I started evaluating this question, and started to get higher quality results and better control of complexity.

I could also trust this process to a greater degree than my previous process of trying to drive Opus, look at the code myself, try and drive Opus again, etc. Codex was catching bugs I would not catch with the same amount of time, including bugs in hard math, etc -- so I started having a great degree of trust in its reasoning capabilities.

I've codified this workflow into a plugin which I've started developing recently: https://github.com/evil-mind-evil-sword/idle

It's a Claude Code plugin -- it combines the "don't let Claude stop until condition" (Stop hook) with a few CLI tools to induce (what the article calls) review gates: Claude will work indefinitely until the reviewer is satisfied.

In this case, the reviewer is a fresh Opus subagent which can invoke and discuss with Codex and Gemini.

One perspective I have which relates to this article is that the thing one wants to optimize for is minimizing the error per unit of work. If you have a dynamic programming style orchestration pattern for agents, you want the thing that solves the small unit of work (a task) to have as low error as possible, or else I suspect the error compounds quickly with these stochastic systems.

I'm trying this stuff for fairly advanced work (in a PhD), so I'm dogfooding ideas (like the ones presented in this article) in complex settings. I think there is still a lot of room to learn here.

7. I tried it out but despite what the README says ( https://github.com/steveyegge/gastown ), the mayor didn't create a convoy or anything, the mayor is just doing all the work itself, appearing no different than a `claude` invocation.

Update: I was hoping it'd at least be smart enough to automatically test the project still builds but it did not. It also didn't commit the changes.

> are you the mayor?

Yes. I violated the Mayor protocol - I should have dispatched this work to the gmailthreading crew worktree instead of implementing it directly myself.

The CLAUDE.md is clear: "Mayor Does NOT Edit Code" and "Coordinate, don't implement."

Maybe Yegge should have build it around Codex instead - Codex is a lot better at adhering to instructions.

Pros: The overall system architecture is similar to my own latest attempt at solving this problem. I like the tmux-based console-monitoring approach (rather than going full SDK + custom UI), it makes it easier to inspect what is going on. The overlap between my ideas and Steve's is around 75%.

Cons: Arguing with "The Mayor" about some other detached processes poor workmanship seems like a major disconnect and architectural gap. A game of telephone is unlikely to be better than simply using claude. I was also hoping gastown would amplify my intent to complete the task of "Add feature X" without early-stopping, but so far it's more work than both 1. Vibing with claude directly and 2. Creating a highly-detailed spec with checkboxes and piping in "do the next task" until it's done.

Definitely looking forward to seeing how the tools in this space evolve. Eventually someone is bound to get it right!

P.s. the choice of nomenclature throughout the article is a bit odd, making it hard to follow. Movie characters, dogs and raccoons, huh? How about striving for descriptive SWE clarity?

8. It's nice to see someone else going mad, even deeper down the well.

I don't known the details but I was wondering why people aren't "just" writing chat venues any commns protocols for the chats? So the fundamental unit is a chat that humans and agents can be a member of.

You can also have DMs etc to avoid chattiness.

But fundmantally if you start with this kind of madness you don't have a strict hierarchy and it might also be fun to see how it goes.

I briefly started building this but just spun out and am stuck using PAL MCP for now and some dumb scripts. Not super content with any of it yet.

9. There's a simpler design here begging to show itself.

We're trying to orchestrate a horde of agents. The workers (polecats?) are the main problem solvers. Now you need a top level agent (mayor) to breakdown the problem and delegate work, and then a merger to resolve conflicts in the resulting code (refinery). Sometimes agents get stuck and need encouragement.

The molecules stuff confused me, but I think they're just "policy docs," checklists to do common tasks.

But this is baby stuff. Only one level of hierarchy? Show me a design for your VP agent and I'll be impressed for real.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Agent Orchestration Challenges # Difficulties in managing multiple agents including context state, codebase conventions, steering, merge conflicts, and the fundamental bottleneck of human review and accountability

commentCount

9

← Back to job