Summarizer

LLM Input

llm/065c6e83-d0d5-4aca-be3d-92768a8a3506/batch-2-61aef457-fe0a-43ec-a98e-cb20658a046c-input.json

Pretty-print

prompt

The following is content for you to classify. Do not respond to the comments—classify them.

<topics>
1. Not Novel or Revolutionary
Related: Many commenters argue this workflow is standard practice, not radically different. References to existing tools like Kiro, OpenSpec, SpecKit, and Antigravity that already implement spec-driven development. Claims the approach was documented 2+ years ago in Cursor forums.
2. LLMs as Junior Developers
Related: Analogy comparing LLMs to unreliable interns with boundless energy. Discussion of treating AI like junior developers requiring supervision, documentation, and oversight. The shift from coder to software manager role.
3. AI-Generated Article Concerns
Related: Multiple commenters suspect the article itself was written by AI, noting characteristic style and patterns. Debate about whether AI-written content should be evaluated differently or dismissed outright.
4. Magic Words and Prompt Engineering
Related: Skepticism about whether words like 'deeply' and 'in great details' actually affect LLM behavior. Discussion of attention mechanisms, emotional prompting research, and whether prompt techniques are superstition or cargo cult.
5. Planning vs Just Coding
Related: Debate about whether extensive planning overhead eliminates time savings. Some argue writing specs takes longer than writing code. Others counter that planning prevents compounding errors and technical debt.
6. Spec-Driven Development Tools
Related: References to existing frameworks: OpenSpec, SpecKit, BMAD-METHOD, Kiro, Antigravity. Discussion of how these tools formalize the research-plan-implement workflow described in the article.
7. Context Window Management
Related: Strategies for handling large codebases and context limits. Maintaining markdown files for subsystems, using skills, aggressive compaction. Concerns about context rot and performance degradation.
8. Waterfall Methodology Comparison
Related: Commenters note the approach resembles waterfall development with detailed upfront planning. Discussion of whether this contradicts agile principles or represents rediscovering proven methods.
9. Test-Driven Development Integration
Related: Suggestions to add comprehensive tests to the workflow. Writing tests before implementation, using tests as verification. Arguments that test coverage enables safer refactoring with AI.
10. Single Session vs Multiple Sessions
Related: Author's claim of running entire workflows in single long sessions without performance degradation. Others recommend clearing context between phases for better results.
11. Determinism and Reproducibility
Related: Concerns about non-deterministic LLM outputs. Discussion of whether software engineering can accommodate probabilistic tools. Comparisons to gambling and slot machines.
12. Token Cost Considerations
Related: Discussion of workflow being token-heavy and expensive. Comparisons between Claude subscription tiers. Arguments that simpler approaches save money while achieving similar results.
13. Annotation Workflow Details
Related: Questions about how to format inline annotations for Claude to recognize. Techniques like TODO prefixes, HTML comments, and clear separation between human and AI-written content.
14. Subagent Architecture
Related: Using multiple agents for different phases: planning, implementation, review. Red team/blue team approaches. Dispatching parallel agents for independent tasks.
15. Reference Implementation Technique
Related: Using existing code from open source projects as examples for Claude. Questions about licensing implications. Claims this dramatically improves output quality.
16. Claude vs Other Models
Related: Comparisons between Claude, Codex, Gemini, and other models. Discussion of model-specific behaviors and optimal prompting strategies. Using multiple models in complementary roles.
17. Greenfield vs Existing Codebases
Related: Observation that most AI coding articles focus on greenfield development. Different challenges when working with legacy code and established patterns.
18. Human Review Requirements
Related: Debate about whether all AI-generated code must be reviewed line-by-line. Questions about trust, liability, and whether AI can eventually be trusted without oversight.
19. Productivity Claims Skepticism
Related: Questions about actual time savings versus perceived productivity. References to studies showing AI sometimes makes developers less productive. Concerns about false progress.
20. Documentation as Side Benefit
Related: Plans and research documents serve as valuable documentation for future maintainers. Version controlling plan files in git. Using plans to understand architectural decisions later.
0. Does not fit well in any category
</topics>

<comments_to_classify>
[

{
"id": "47109082",
"text": "Is it on GH?"
}
,

{
"id": "47109152",
"text": "It was when I mvp'd it 3 weeks ago. Then I removed it as I was toying with the idea of somehow monetizing it. Then I added a few features which would make monetization impossible (e.g. How the app obtains etf/stock prices live and some other things). I reckon I could remove those and put in gh during the week if I don't forget. The quality of the Web app is SaaS grade IMO. Keyboard shortcuts, cmd+k, natural language parsing, great ui that doesn't look like made by ai in 5min. Might post here the link."
}
,

{
"id": "47110736",
"text": "Would love to check it out too once you put it up."
}
,

{
"id": "47107091",
"text": "> Notice the language: “deeply”, “in great details”, “intricacies”, “go through everything”. This isn’t fluff. Without these words, Claude will skim. It’ll read a file, see what a function does at the signature level, and move on. You need to signal that surface-level reading is not acceptable.\n\nThis makes no sense to my intuition of how an LLM works. It's not that I don't believe this works, but my mental model doesn't capture why asking the model to read the content \"more deeply\" will have any impact on whatever output the LLM generates."
}
,

{
"id": "47107622",
"text": "It's the attention mechanism at work, along with a fair bit of Internet one-up-manship. The LLM has ingested all of the text on the Internet, as well as Github code repositories, pull requests, StackOverflow posts, code reviews, mailing lists, etc. In a number of those content sources, there will be people saying \"Actually, if you go into the details of...\" or \"If you look at the intricacies of the problem\" or \"If you understood the problem deeply\" followed by a very deep, expert-level explication of exactly what you should've done differently. You want the model to use the code in the correction, not the one in the original StackOverflow question.\n\nSame reason that \"Pretend you are an MIT professor\" or \"You are a leading Python expert\" or similar works in prompts. It tells the model to pay attention to the part of the corpus that has those terms, weighting them more highly than all the other programming samples that it's run across."
}
,

{
"id": "47108641",
"text": "I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files.\n\nMaybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data."
}
,

{
"id": "47108676",
"text": "Modern LLM are certainly fine tuned on data that includes examples of tool use, mostly the tools built into their respective harnesses, but also external/mock tools so they dont overfit on only using the toolset they expect to see in their harnesses."
}
,

{
"id": "47109416",
"text": "IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable."
}
,

{
"id": "47108184",
"text": "Of course I can't be certain, but I think the \"mixture of experts\" design plays into it too. Metaphorically, there's a mid-level manager who looks at your prompt and tries to decide which experts it should be sent to. If he thinks you won't notice, he saves money by sending it to the undergraduate intern.\n\nJust a theory."
}
,

{
"id": "47108345",
"text": "Notice that MOE isn’t different experts for different types of problems. It’s per token and not really connect to problem type.\n\nSo if you send a python code then the first one in function can be one expert, second another expert and so on."
}
,

{
"id": "47109088",
"text": "Can you back this up with documentation? I don't believe that this is the case."
}
,

{
"id": "47109335",
"text": "Check out Unsloths REAP models, you can outright delete a few of the lesser used experts without the model going braindead since they all can handle each token but some are better posed to do so."
}
,

{
"id": "47108005",
"text": "This is such a good explanation. Thanks"
}
,

{
"id": "47109143",
"text": ">> Same reason that \"Pretend you are an MIT professor\" or \"You are a leading Python expert\" or similar works in prompts.\n\nThis pretend-you-are-a-[persona] is cargo cult prompting at this point. The persona framing is just decoration.\n\nA brief purpose statement describing what the skill [skill.md] does is more honest and just as effective."
}
,

{
"id": "47111080",
"text": "I think it does more harm than good on recent models. The LLM has to override its system prompt to role-play, wasting context and computing cycles instead of working on the task."
}
,

{
"id": "47110019",
"text": "You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.\n\nThese tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire."
}
,

{
"id": "47110206",
"text": "Your ignorance is my opportunity. May I ask which markets you are developing for?"
}
,

{
"id": "47110341",
"text": "\"The equivalent of saying, which slot machine were you sitting at It'll make me money\""
}
,

{
"id": "47108286",
"text": "That’s because it’s superstition.\n\nUnless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.\n\nSure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.\n\nThe thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.\n\nIf it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler."
}
,

{
"id": "47109139",
"text": "I actually have a prompt optimizer skill that does exactly this.\n\nhttps://github.com/solatis/claude-config\n\nIt’s based entirely off academic research, and a LOT of research has been done in this area.\n\nOne of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.\n\n“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”\n\nhttps://arxiv.org/abs/2307.11760"
}
,

{
"id": "47108530",
"text": "> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.\n\n\"Add tests to this function\" for GPT-3.5-era models was much less effective than \"you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib.\" In today's tools, \"add tests to this function\" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while \"thinking.\"\n\nSo I'm 90% sure this is already happening on some level."
}
,

{
"id": "47110717",
"text": "But can you see the difference if you only include \"you are a senior engineer\"? It seems like the comparison you're making is between \"write the tests\" and \"write the tests following these patterns using these examples. Also btw you’re an expert. \""
}
,

{
"id": "47108786",
"text": "I think \"understand this directory deeply\" just gives more focus for the instruction. So it's like \"burn more tokens for this phase than you normally would\"."
}
,

{
"id": "47109060",
"text": "i suppose we will just have to write an English to pedantry compiler.\n\nA common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'."
}
,

{
"id": "47108980",
"text": "> That’s because it’s superstition.\n\nThis field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.\n\nThis is why we see a new Markdown format every week, \"skills\", \"benchmarks\", and other useless ideas, practices, and measurements. Consider just how many \"how I use AI\" articles are created and promoted. Most of the field runs on anecdata.\n\nIt's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]\n\n[1]: https://news.ycombinator.com/item?id=47034087"
}
,

{
"id": "47107186",
"text": "Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.\n\nNow? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running\n\n(1 Outside of some core ML developers at the big model companies)"
}
,

{
"id": "47107884",
"text": "It’s like playing a fretless instrument to me.\n\nPractice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.\n\nDo I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it."
}
,

{
"id": "47108106",
"text": "Sufficiently advanced technology has become like magic: you have to prompt the electronic genie with the right words or it will twist your wishes."
}
,

{
"id": "47108657",
"text": "Light some incense, and you too can be a dystopian space tech support, today! Praise Omnissiah!"
}
,

{
"id": "47108940",
"text": "are we the orks?"
}
,

{
"id": "47107413",
"text": "For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions."
}
,

{
"id": "47108030",
"text": "Yep, with Claude saying \"please\" and \"thank you\" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less."
}
,

{
"id": "47108740",
"text": "this is psychotic why is this how this works lol"
}
,

{
"id": "47109474",
"text": "Speculation only obviously: highly-charged conversations cause the discussion to be channelled to general human mitigation techniques and for the 'thinking agent' to be diverted to continuations from text concerned with the general human emotional experience."
}
,

{
"id": "47107667",
"text": "Sometimes I daydream about people screaming at their LLM as if it was a TV they were playing video games on."
}
,

{
"id": "47107573",
"text": "wait seriously? lmfao\n\nthats hilarious. i definitely treat claude like shit and ive noticed the falloff in results.\n\nif there's a source for that i'd love to read about it."
}
,

{
"id": "47108645",
"text": "If you think about where in the training data there is positivity vs negativity it really becomes equivalent to having a positive or negative mindset regarding a standing and outcome in life."
}
,

{
"id": "47108677",
"text": "I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore."
}
,

{
"id": "47107840",
"text": "For awhile(maybe a year ago?) it seemed like verbal abuse was the best way to make Claude pay attention.\nIn my head, it was impacting how important it deemed the instruction. And it definitely did seem that way."
}
,

{
"id": "47108744",
"text": "i make claude grovel at my feet and tell me in detail why my code is better than its code"
}
,

{
"id": "47107623",
"text": "Consciousness is off the table but they absolutely respond to environmental stimulus and vibes.\n\nSee, uhhh, https://pmc.ncbi.nlm.nih.gov/articles/PMC8052213/ and maybe have a shot at running claude while playing Enya albums on loop.\n\n/s (??)"
}
,

{
"id": "47107938",
"text": "i have like the faintest vague thread of \"maybe this actually checks out\" in a way that has shit all to do with consciousness\n\nsometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?\n\nin some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.\n\nall in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.\n\nthese things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they \"can do everything on their own and don't need to rely on anyone\" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no."
}
,

{
"id": "47107795",
"text": "How anybody can read stuff like this and still take all this seriously is beyond me. This is becoming the engineering equivalent of astrology."
}
,

{
"id": "47108969",
"text": "Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract...\n\nIt's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark).\n\nIt might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge."
}
,

{
"id": "47109547",
"text": "I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this."
}
,

{
"id": "47110337",
"text": "The evolution of software engineering is fascinating to me. We started by coding in thin wrappers over machine code and then moved on to higher-level abstractions. Now, we've reached the point where we discuss how we should talk to a mystical genie in a box.\n\nI'm not being sarcastic. This is absolutely incredible."
}
,

{
"id": "47110776",
"text": "And I've been had a long enough to go through that whole progression. Actually from the earlier step of writing machine code. It's been and continues to be a fun journey which is why I'm still working."
}
,

{
"id": "47110519",
"text": "We have tests and benchmarks to measure it though."
}
,

{
"id": "47107839",
"text": "Feel free to run your own tests and see if the magic phrases do or do not influence the output. Have it make a Todo webapp with and without those phrases and see what happens!"
}
,

{
"id": "47108166",
"text": "That's not how it works. It's not on everyone else to prove claims false, it's on you (or the people who argue any of this had a measurable impact) to prove it actually works. I've seen a bunch of articles like this, and more comments. Nobody I've ever seen has produced any kind of measurable metrics of quality based on one approach vs another. It's all just vibes.\n\nWithout something quantifiable it's not much better then someone who always wears the same jersey when their favorite team plays, and swears they play better because of it."
}

]
</comments_to_classify>

Based on the comments above, assign each to up to 3 relevant topics.

Return ONLY a JSON array with this exact structure (no other text):
[

{
"id": "comment_id_1",
"topics": [
1,
3,
5
]
}
,

{
"id": "comment_id_2",
"topics": [
2
]
}
,

{
"id": "comment_id_3",
"topics": [
0
]
}
,
...
]

Rules:
- Each comment can have 0 to 3 topics
- Use 1-based topic indices for matches
- Use index 0 if the comment does not fit well in any category
- Only assign topics that are genuinely relevant to the comment

Remember: Output ONLY the JSON array, no other text.

commentCount

← Back to job