Summarizer

LLM Input

llm/122b8d72-a8a3-4fcf-8eca-6a52786d1a8b/topic-14-5409f867-4770-49c3-9706-d1604ac78202-input.json

Pretty-print

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Testing and Verification # Emphasis on the need for humans to verify AI output, run tests, and maintain quality control since AI cannot reliably check its own work
</topic>

<comments_about_topic>
1. I wonder about this.

If (and it's a big if) the LLM gives you something that kinda, sorta, works, it may be an easier task to keep that working, and make it work better, while you refactor it, than it would have been to write it from scratch.

That is going to depend a lot on the skillset and motivation of the programmer, as well as the quality of the initial code dump, but...

There's a lot to be said for working code. After all, how many prototypes get shipped?

2. Is it annoying that I tell it to do something and it does about a third of it? Absolutely.

Can I get it to finish by asking it over and over to code review its PR or some other such generic prompt to weed out the skips and scaffolding? Also yes.

Basically these things just need a supervisor looking at the requirements, test results, and evaluating the code in a loop. Sometimes that's a human, it can also absolutely be an LLM. Having a second LLM with limited context asking questions to the worker LLM works. Moreso when the outer loop has code driving it and not just a prompt.

3. I often ask for big things.

For example I'm working on some virtualization things where I want a machine to be provisioned with a few options of linux distros and BSDs. In one prompt I asked for this list to be provisioned so a certain test of ssh would complete, it worked on it for several hours and now we're doing the code review loop. At first it gave up on the BSDs and I had to poke it to actually finish with an idea it had already had, now I'm asking it to find bugs and it's highlighting many mediocre code decisions it has made. I haven't even tested it so I'm not sure if it's lying about anything working yet.

4. Tell it to analyze your codebase for best practices and suggest fixes.

Tell it to analyze your architecture, security, documentation, etc. etc. etc. Install claude to do review on github pull requests and prompt it to review each one with all of these things.

Just keep expanding your imagination about what you can ask it to do, think of it more like designing an organization and pinning down the important things and providing code review and guard rails where it needs it and letting it work where it doesn't.

5. From the linked project:

> The reality: 3 weeks in, ~50 hours of coding, and I'm mass-producing features faster than I can stabilize them. Things break. A lot. But when it works, it works.

6. You just have another agent/session/context refactor as you go.

I built a skribbl.io clone to use at work. We like to play eod on Friday as a happy hour and when we would play skribbl.io we would try to get screencaps of the stupid images we were drawing but sometimes we would forget. So I said I'd use claude to build our own skribbl.io that would save the images.

I was definitely surprised that claude threaded the needle on the task pretty easily, pretty much single shot. Then I continued adding features until I had near parity. Then I added the replay feature. After all that I looked at the codebase... pretty much a single big file. It worked though, so we played it for the time being.

I wanted to fix some bugs and add more features, so I checked out a branch and had an agent refactor first. I'd have a couple context/sessions open and I'd one just review, the other refactored, and sometimes I'd throw a third context/session in there that would just write and run tests.

The LLM will build things poorly if you let it, but it's easy to prompt it another way and even if you fail that and back yourself into a corner, it's easy to get the agents to refactor.

It's just like writing tests, the llms are great at writing shitty useless tests, but you can be specific with your prompt and in addition use another agent/context/session to review and find shitty tests and tell you why they're shitty or look for missing tests, basically keep doing a review, then feed the review into the agent writing the tests.

7. I’m using it in a >200kloc codebase successfully, too. I think a key is to work in a properly modular codebase so it can focus on the correct changes and ignore unrelated stuff.

That said, I do catch it doing some of the stuff the OP mentioned— particularly leaving “backwards compatibility” stuff in place. But really, all of the stuff he mentions, I’ve experienced if I’ve given it an overly broad mandate.

8. Yes, this is my experience as well. I've found the key is having the AI create and maintain clear documentation from the beginning. It helps me understand what it's building, and it helps the model maintain context when it comes time to add or change something.

You also need a reasonably modular architecture which isn't incredibly interdependent, because that's hard to reason about, even for humans.

You also need lots and lots (and LOTS) of unit tests to prevent regressions.

9. I wonder if you can up the 10kloc if you have a good static analysis of your tool (I vibecoded one in Python) and good tests. Sometimes good tests aren't possible since there are too many different cases but with other forms of codes you can cover all the cases with like 50 to 100 tests or so

10. Found a problem? Slap another agent on top to fix it. It’s hilarious to see how the pendulum’s swung away from “thinking from first principles as a buzzword”. Just engineer, dammit…

11. Who is liable for the runtime behavior of the system, when handling users’ sensitive information?

If the person who is liable for the system behavior cannot read/write code (as “all coders have been replaced”), does Anthropic et al become responsible for damages to end users for systems its tools/models build? I assume not.

How do you reconcile this? We have tools that help engineers design and build bridges, but I still wouldn’t want to drive on an “autonomously-generated bridge may contain errors. Use at own risk” because all human structural engineering experts have been replaced.

After asking this question many times in similar threads, I’ve received no substantial response except that “something” will probably resolve this, maybe AI will figure it out

12. Yeah… I'm using Claude Code almost all day every day, but it still 100% requires my judgment. If another AI like OpenClaw was just giving the thumbs up to whatever CC was doing, it would not end well (for my projects anyway).

13. While Claude was trying fix a bug for me (one of these "here! It's fixed now!" "no it's not, the ut still doesn't pass", "ah, I see, lets fix the ut", "no you dont, fix the code" loops), I was updating my oncall rotation after having to run after people to refresh my credentials to so, after attending a ship room where I had to provide updates and estimates.

Why isn't Claude doing all that for me, while I code? Why the obsession that we must use code generation, while other gabage activities would free me to do what I'm, on paper, paid to do?

It's less sexy of course, it doesn't have the promise of removing me in the end. But the reason, in the present state, is that IT admins would never accept for an llm to handle permissions, rotations, management would never accept an llm to report status or provide estimate. This is all "serious" work where we can't have all the errors llm create.

Dev isn't that bad, devs can clean slop and customers can deal with bugs.

14. What I don’t understand in these posts is how exactly is the AI checking its work. That’s literally what I’m here for now. It doesn’t know how to log in to my iOS app using the simulator, or navigate to the firebase console and download a plist file.

Once we get to a spot where the AI can check its work and iterate, the loop is closed. But we are a long way off from that atm. Even for the web. I mean, have you tried the Playwright MCP server? Aside from being the slowest tool calls I have ever seen, the agent struggles mightily to figure out the simplest of navigation and interaction.

Yes yes Unit tests, but functional is the be all end all and until it can iterate and create its own functional test suite, I just don’t get it.

What am I missing?

15. I don't buy it. It's the same model underneath running whatever UI. It's the same model that keeps forgetting and missing details. And somehow when it is given a bunch of CLI tools and more interfaces to interact with, it suddenly becomes x10 AI? It may feel like it for a manager whose job is to deal with actual people who push back. Will it stop bypassing a test because it is directly not related to a feature I asked for? I don't think so.

16. I‘ve done some phone programming over the Xmas holidays with clawdbot. This does work, BUT you absolutely need demand clearly measurable outcomes of the agent, like a closed feedback loop or comparison with a reference implementation, or perfect score in a simulated environment. Without this, the implementation will be incomplete and likely utter crap.

Even then, the architecture will be horrible unless you chat _a lot_ about it upfront. At some point, it’s easier to just look in the terminal.

17. I’m working on a product related to “sensemaking”. And I’m using this abstract, academic term on purpose to highlight the emotional experience, rather than “analysis” or “understanding”.

It is a constant lure products and tools have to create the feeling of sensemaking. People want (pejorative) tools that show visualizations or summaries, without thinking about the particular visual/summary artifact is useful, actionable or accurate!

18. If you use Cursor or Claude, you have to oversee it and steer it so it gets very close to what you want to achieve.

If you delegate these tasks to OpenClaw, I am not really sure the result is exactly what you want to achieve and it works like you want it to.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Testing and Verification # Emphasis on the need for humans to verify AI output, run tests, and maintain quality control since AI cannot reliably check its own work

commentCount

← Back to job