llm/065c6e83-d0d5-4aca-be3d-92768a8a3506/topic-15-3a2a9992-38a8-41d6-b67c-b335dd143be0-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> Claude vs Other Models # Comparisons between Claude, Codex, Gemini, and other models. Discussion of model-specific behaviors and optimal prompting strategies. Using multiple models in complementary roles. </topic> <comments_about_topic> 1. Reproducing experimental results across models and vendors is trivial and cheap nowadays. 2. Not if anthropic goes further in obfuscating the output of claude code. 3. > LLMs don’t usually fail at syntax? Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue. 4. I have no doubts that it does for many people. But the time/cost tradeoff is still unquestionable. I know I could create what LLMs do for me in the frontend/backend in most cases as good or better - I know that, because I've done it at work for years. But to create a somewhat complex app with lots of pages/features/apis etc. would take me months if not a year++ since I'd be working on it only on the weekends for a few hours. Claude code helps me out by getting me to my goal in a fraction of the time. Its superpower lies not only in doign what I know but faster, but in doing what I don't know as well. I yield similar benefits at work. I can wow management with LLM assited/vibe coded apps. What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself. For the type of work I do, managers do not care whether I could do it better if I'd code it myself. They are amazed however that what has taken months previously, can be done in hours nowadays. And I for sure will try to reap benefits of LLMs for as long as they don't replace me rather than being idealistic and fighting against them. 5. > What previously would've taken a multi-man team weeks of planning and executing, stand ups, jour fixes, architecture diagrams, etc. can now be done within a single week by myself. This has been my experience. We use Miro at work for diagramming. Lots of visual people on the team, myself included. Using Miro's MCP I draft a solution to a problem and have Miro diagram it. Once we talk it through as a team, I have Claude or codex implement it from the diagram. It works surprisingly well. > They are amazed however that what has taken months previously, can be done in hours nowadays. Of course they're amazed. They don't have to pay you for time saved ;) > reap benefits of LLMs for as long as they don't replace me > What previously would've taken a multi-man team I think this is the part that people are worried about. Every engineer who uses LLMs says this. By definition it means that people are being replaced. I think I justify it in that no one on my team has been replaced. But management has explicitly said "we don't want to hire more because we can already 20x ourselves with our current team +LLM." But I do acknowledge that many people ARE being replaced; not necessarily by LLMs, but certainly by other engineers using LLMs. 6. Todos, habits, goals, calendar, meals, notes, bookmarks, shopping lists, finances. More or less that with Google cal integration, garmin Integration (Auto updates workout habits, weight goals) family sharing/gamification, daily/weekly reviews, ai summaries and more. All built by just prompting Claude for feature after feature, with me writing 0 lines. 7. I don’t think this is a result of the base training data („the internet“). It’s a post training behavior, created during reinforcement learning. Codex has a totally different behavior in that regard. Codex reads per default a lot of potentially relevant files before it goes and writes files. Maybe you remember that, without reinforcement learning, the models of 2019 just completed the sentences you gave them. There were no tool calls like reading files. Tool calling behavior is company specific and highly tuned to their harnesses. How often they call a tool, is not part of the base training data. 8. IDK the current state, but I remember that, last year, the open source coding harnesses needed to provide exactly the tools that the LLM expected, or the error rate went through the roof. Some, like grok and gemini, only recently managed to make tool calls somewhat reliable. 9. For Claude at least, the more recent guidance from Anthropic is to not yell at it. Just clear, calm, and concise instructions. 10. Yep, with Claude saying "please" and "thank you" actually works. If you build rapport with Claude, you get rewarded with intuition and creativity. Codex, on the other hand, you have to slap it around like a slave gollum and it will do exactly what you tell it to do, no more, no less. 11. wait seriously? lmfao thats hilarious. i definitely treat claude like shit and ive noticed the falloff in results. if there's a source for that i'd love to read about it. 12. I don't have a source offhand, but I think it may have been part of the 4.5 release? Older models definitely needed caps and words like critical, important, never, etc... but Anthropic published something that said don't do that anymore. 13. Anthropic recommends doing magic invocations: https://simonwillison.net/2025/Apr/19/claude-code-best-pract... It's easy to know why they work. The magic invocation increases test-time compute (easy to verify yourself - try!). And an increase in test-time compute is demonstrated to increase answer correctness (see any benchmark). It might surprise you to know that the only different between GPT 5.2-low and GPT 5.2-xhigh is one of these magic invocations. But that's not supposed to be public knowledge. 14. I think this was more of a thing on older models. Since I started using Opus 4.5 I have not felt the need to do this. 15. If you read the transformer paper, or get any book on NLP, you will see that this is not magic incantation; it's purely the attention mechanism at work. Or you can just ask Gemini or Claude why these prompts work. But I get the impression from your comment that you have a fixed idea, and you're not really interested in understanding how or why it works. If you think like a hammer, everything will look like a nail. 16. Do you think that Anthropic don’t include things like this in their harness / system prompts? I feel like this kind of prompts are uneccessary with Opus 4.5 onwards, obviously based on my own experience (I used to do this, on switching to opus I stopped and have implemented more complex problems, more successfully). I am having the most success describing what I want as humanly as possible, describing outcomes clearly, making sure the plan is good and clearing context before implementing. 17. My colleague swears by his DHH claude skill https://danieltenner.com/dhh-is-immortal-and-costs-200-m/ 18. It’s actually really common. If you look at Claude Code’s own system prompts written by Anthropic, they’re littered with “CRITICAL (RULE 0):” type of statements, and other similar prompting styles. 19. Yeah, it's definitely a strange new world we're in, where I have to "trick" the computer into cooperating. The other day I told Claude "Yes you can", and it went off and did something it just said it couldn't do! 20. Better yet, I have Codex, Gemini, and Claude as my kids, running around in my code playground. How do I be a good parent and not play favorites? 21. We all know Gemini is your artsy, Claude is your smartypants, and Codex is your nerd. 22. Its effectiveness is even more apparent with older smaller LLMs, people who interact with LLMs now never tried to wrangle llama2-13b into pretending to be a dungeon master... 23. The articles approach matches mine, but I've learned from exactly the things you're pointing out. I get the PLAN.md (or equivalent) to be separated into "phases" or stages, then carefully prompt (because Claude and Codex both love to "keep going") it to only implement that stage, and update the PLAN.md Tests are crucial too, and form another part of the plan really. Though my current workflow begins to build them later in the process than I would prefer... 24. I really don't understand why there are so many comments like this. Yesterday I had Claude write an audit logging feature to track all changes made to entities in my app. Yeah you get this for free with many frameworks, but my company's custom setup doesn't have it. It took maybe 5-10 minutes of wall-time to come up with a good plan, and then ~20-30 min for Claude implement, test, etc. That would've taken me at least a day, maybe two. I had 4-5 other tasks going on in other tabs while I waited the 20-30 min for Claude to generate the feature. After Claude generated, I needed to manually test that it worked, and it did. I then needed to review the code before making a PR. In all, maybe 30-45 minutes of my actual time to add a small feature. All I can really say is... are you sure you're using it right? Have you _really_ invested time into learning how to use AI tools? 25. Same here. I did bounce off these tools a year ago. They just didn't work for me 60% of the time. I learned a bit in that initial experience though and walked away with some tasks ChatGPT could replace in my workflow. Mainly replacing scripts and reviewing single files or functions. Fast forward to today and I tried the tools again--specifically Claude Code--about a week ago. I'm blown away. I've reproduced some tools that took me weeks at full-time roles in a single day. This is while reviewing every line of code. The output is more or less what I'd be writing as a principal engineer. 26. > The output is more or less what I'd be writing as a principal engineer. I certainly hope this is not true, because then you're not competent for that role. Claude Code writes an absolutely incredible amount of unecessary and superfluous comments, it's makes asinine mistakes like forgetting to update logic in multiple places. It'll gladly drop the entire database when changing column formats, just as an example. 27. Several months ago, just for fun, I asked Claude (the web site, not Claude Code) to build a web page with a little animated cannon that shoots at the mouse cursor with a ballistic trajectory. It built the page in seconds, but the aim was incorrect; it always shot too low. I told it the aim was off. It still got it wrong. I prompted it several times to try to correct it, but it never got it right. In fact, the web page started to break and Claude was introducing nasty bugs. More recently, I tried the same experiment, again with Claude. I used the exact same prompt. This time, the aim was exactly correct. Instead of spending my time trying to correct it, I was able to ask it to add features. I've spent more time writing this comment on HN than I spent optimizing this toy. https://claude.ai/public/artifacts/d7f1c13c-2423-4f03-9fc4-8... My point is that AI-assisted coding has improved dramatically in the past few months. I don't know whether it can reason deeply about things, but it can certainly imitate a human who reasons deeply. I've never seen any technology improve at this rate. 28. I picture like semi conductors; the 5nm process is so absurdly complex that operators can't just peek into the system easily. I imagine I'm just so used to hand crafting code that I can't imagine not being able to peek in. So maybe it's that we won't be reviewing by hand anymore? I.e. it's LLMs all the way down. Trying to embrace that style of development lately as unnatural as it feels. We're obv not 100% there yet but Claude Opus is a significant step in that direction and they keep getting better and better. 29. Since Opus 4.5, things have changed quite a lot. I find LLMs very useful for discussing new features or ideas, and Sonnet is great for executing your plan while you grab a coffee. 30. This is a great come-back story. I have had a similar experience with a photoshop demake of mine. I recommend to try out Opencode with this approach, you might find it less tiring than ChatGPT web (yes it works with your ChatGPT Plus sub). 31. I go a bit further than this and have had great success with 3 doc types and 2 skills: - Specs: these are generally static, but updatable as the project evolves. And they're broken out to an index file that gives a project overview, a high-level arch file, and files for all the main modules. Roughly ~1k lines of spec for 10k lines of code, and try to limit any particular spec file to 300 lines. I'm intimately familiar with every single line in these. - Plans: these are the output of a planning session with an LLM. They point to the associated specs. These tend to be 100-300 lines and 3 to 5 phases. - Working memory files: I use both a status.md (3-5 items per phase roughly 30 lines overall), which points to a latest plan, and a project_status (100-200 lines), which tracks the current state of the project and is instructed to compact past efforts to keep it lean) - A planner skill I use w/ Gemini Pro to generate new plans. It essentially explains the specs/plans dichotomy, the role of the status files, and to review everything in the pertinent areas of code and give me a handful of high-level next set of features to address based on shortfalls in the specs or things noted in the project_status file. Based on what it presents, I select a feature or improvement to generate. Then it proceeds to generate a plan, updates a clean status.md that points to the plan, and adjusts project_status based on the state of the prior completed plan. - An implementer skill in Codex that goes to town on a plan file. It's fairly simple, it just looks at status.md, which points to the plan, and of course the plan points to the relevant specs so it loads up context pretty efficiently. I've tried the two main spec generation libraries, which were way overblown, and then I gave superpowers a shot... which was fine, but still too much. The above is all homegrown, and I've had much better success because it keeps the context lean and focused. And I'm only on the $20 plans for Codex/Gemini vs. spending $100/month on CC for half year prior and move quicker w/ no stall outs due to token consumption, which was regularly happening w/ CC by the 5th day. Codex rarely dips below 70% available context when it puts up a PR after an execution run. Roughly 4/5 PRs are without issue, which is flipped against what I experienced with CC and only using planning mode. 32. This is pretty much my approach. I started with some spec files for a project I'm working on right now, based on some academic papers I've written. I ended up going back and forth with Claude, building plans, pushing info back into the specs, expanding that out and I ended up with multiple spec/architecture/module documents. I got to the point where I ended up building my own system (using claude) to capture and generate artifacts, in more of a systems engineering style (e.g. following IEEE standards for conops, requirement documents, software definitions, test plans...). I don't use that for session-level planning; Claude's tools work fine for that. (I like superpowers, so far. It hasn't seemed too much) I have found it to work very well with Claude by giving it context and guardrails. Basically I just tell it "follow the guidance docs" and it does. Couple that with intense testing and self-feedback mechanisms and you can easily keep Claude on track. I have had the same experience with Codex and Claude as you in terms of token usage. But I haven't been happy with my Codex usage; Claude just feels like it's doing more of what I want in the way I want. 33. The crowd around this pot shows how superficial is knowledge about claude code. It gets releases each day and most of this is already built in the vanilla version. Not to mention subagent working in work trees, memory.md, plan on which you can comment directly from the interface, subagents launched in research phase, but also some basic mcp's like LSP/IDE integration, and context7 to not to be stuck in the knowledge cutoff/past. When you go to YouTube and search for stuff like "7 levels of claude code" this post would be maybe 3-4. Oh, one more thing - quality is not consistent, so be ready for 2-3 rounds of "are you happy with the code you wrote" and defining audit skills crafted for your application domain - like for example RODO/Compliance audit etc. 34. I have to give this a try. My current model for backend is the same as how author does frontend iteration. My friend does the research-plan-edit-implement loop, and there is no real difference between the quality of what I do and what he does. But I do like this just for how it serves as documentation of the thought process across AI/human, and can be added to version control. Instead of humans reviewing PRs, perhaps humans can review the research/plan document. On the PR review front, I give Claude the ticket number and the branch (or PR) and ask it to review for correctness, bugs and design consistency. The prompt is always roughly the same for every PR. It does a very good job there too. Modelwise, Opus 4.6 is scary good! 35. Gemini is better at research Claude at coding. I try to use Gemini to do all the research and write out instruction on what to do what process to follow then use it in Claude. Though I am mostly creating small python scripts 36. In my own tests I have found opus to be very good at writing plans, terrible at executing them. It typically ignores half of the constraints. https://x.com/xundecidability/status/2019794391338987906?s=2... https://x.com/xundecidability/status/2024210197959627048?s=2... 37. the first link was from a simple request with fewer than 1000 tokens total in the context window, just a short shell script. here is another one which had about 200 tokens and opus decided to change the model name i requested. https://x.com/xundecidability/status/2005647216741105962?s=2... opus is bad at instruction following now. 38. I tried Opus 4.6 recently and it’s really good. I had ditched Claude a long time ago for Grok + Gemini + OpenCode with Chinese models. I used Grok/Gemini for planning and core files, and OpenCode for setup, running, deploying, and editing. However, Opus made me rethink my entire workflow. Now, I do it like this: * PRD (Product Requirements Document) * main.py + requirements.txt + readme.md (I ask for minimal, functional, modular code that fits the main.py) * Ask for a step-by-step ordered plan * Ask to focus on one step at a time The super powerful thing is that I don’t get stuck on missing accounts, keys, etc. Everything is ordered and runs smoothly. I go rapidly from idea to working product, and it’s incredibly easy to iterate if I figure out new features are required while testing. I also have GLM via OpenCode, but I mainly use it for "dumb" tasks. Interestingly, for reasoning capabilities regarding standard logic inside the code, I found Gemini 3 Flash to be very good and relatively cheap. I don't use Claude Code for the actual coding because forcing everything via chat into a main.py encourages minimal code that's easy to skim—it gives me a clearer representation of the feature space 39. I do the same. I also cross-ask gemini and claude about the plan during iterations, sometimes make several separate plans. 40. There is not a lot of explanation WHY is this better than doing the opposite: start coding and see how it goes and how this would apply to Codex models. I do exactly the same, I even developed my own workflows wit Pi agent, which works really well. Here is the reason: - Claude needs a lot more steering than other models, it's too eager to do stuff and does stupid things and write terrible code without feedback. - Claude is very good at following the plan, you can even use a much cheaper model if you have a good plan. For example I list every single file which needs edits with a short explanation. - At the end of the plan, I have a clear picture in my head how the feature will exactly look like and I can be pretty sure the end result will be good enough (given that the model is good at following the plan). A lot of things don't need planning at all. Simple fixes, refactoring, simple scripts, packaging, etc. Just keep it simple. 41. Funny how I came up with something loosely similar. Asking Codex to write a detailed plan in a markdown document, reviewing it, and asking it to implement it step by step. It works exquisitely well when it can build and test itself. 42. I have tried using this and other workflows for a long time and had never been able to get them to work (see chat history for details). This has changed in the last week, for 3 reasons: 1. Claude opus. It’s the first model where I haven’t had to spend more time correcting things than it would’ve taken me to just do it myself. The problem is that opus chews through tokens, which led to.. 2. I upgraded my Claude plan. Previously on the regular plan I’d get about 20 mins of time before running out of tokens for the session and then needing to wait a few hours to use again. It was fine for little scripts or toy apps but not feasible for the regular dev work I do. So I upgraded to 5x. This now got me 1-2 hours per session before tokens expired. Which was better but still a frustration. Wincing at the price, I upgraded again to the 20x plan and this was the next game changer. I had plenty of spare tokens per session and at that price it felt like they were being wasted - so I ramped up my usage. Following a similar process as OP but with a plans directory with subdirectories for backlog, active and complete plans, and skills with strict rules for planning, implementing and completing plans, I now have 5-6 projects on the go. While I’m planning a feature on one the others are implementing. The strict plans and controls keep them on track and I have follow up skills for auditing quality and performance. I still haven’t hit token limits for a session but I’ve almost hit my token limit for the week so I feel like I’m getting my money’s worth. In that sense spending more has forced me to figure out how to use more. 3. The final piece of the puzzle is using opencode over claude code. I’m not sure why but I just don’t gel with Claude code. Maybe it’s all the sautéing and flibertygibbering, maybe it’s all the permission asking, maybe it’s that it doesn’t show what it’s doing as much as opencode. Whatever it is it just doesn’t work well for me. Opencode on the other hand is great. It’s shows what it’s doing and how it’s thinking which makes it easy for me to spot when it’s going off track and correct early. Having a detailed plan, and correcting and iterating on the plan is essential. Making clause follow the plan is also essential - but there’s a line. Too fine grained and it’s not as creative at solving problems. Too loose/high level and it makes bad choices and goes in the wrong direction. Is it actually making me more productive? I think it is but I’m only a week in. I’ve decided to give myself a month to see how it all works out. I don’t intend to keep paying for the 20x plan unless I can see a path to using it to earn me at least as much back. 43. Just don’t use Claude Code. I can use the Codex CLI with just my $20 subscription and never come close to any usage limits 44. It isn’t slower. I use my personal ChatGPT subscriptions with Codex for almost everything at work and use my $800/month company Claude allowance only for the tricky stuff that Codex can’t figure out. It’s never application code. It’s usually some combination of app code + Docker + AWS issue with my underlying infrastructure - created with whatever IAC that I’m using for a client - Terraform/CloudFormation or the CDK. I burned through $10 on Claude in less than an hour. I only have $36 a day at $800 a month (800/22 working days) 45. > and use my $800/month company Claude allowance only for the tricky stuff that Codex can’t figure out. It doesn’t seem controversial that the model that can solve more complex problems (that you admit the cheaper model can’t solve) costs more. For the things I use it for, I’ve not found any other model to be worth it. 46. You’re assuming rational behavior from a company that doesn’t care about losing billions of dollar. Have you tried Codex with OpenAi’s latest models? 47. Not in the last 2 months. Current clause subscription is a sunk cost for the next month. Maybe I’ll try codex if Claude doesn’t lead anywhere. 48. I use both. As I’m working, I tell each of them to update a common document with the conversation. I don’t just tell Claude the what. I tell it the why and have it document it. I can switch back and forth and use the MD file as shared context. 49. Who knows? It’s part of an enterprise plan. I work for a consulting company. There are a number of fallbacks, the first fallback if we are working on an internal project is just to use our internal AWS account and use Claude code with the Anthropic hosted on Bedrock. https://code.claude.com/docs/en/amazon-bedrock The second fallback if it is for a customer project is to use their AWS account for development for them. The rate my company charges for me - my level as an American based staff consultant (highest bill rate at the company) they are happy to let us use Claude Code using their AWS credentials. Besides, if we are using AWS Bedrock hosted Anthropic models, they know none of their secrets are going to Anthropic. They already have the required legal confidentiality/compliancd agreements with AWS. 50. That's great, actually, doesn't the logic apply to other services as well? 51. This is great. My workflow is also heading in that direction, so this is a great roadmap. I've already learned that just naively telling Claude what to do and letting it work, is a recipe for disaster and wasted time. I'm not this structured yet, but I often start with having it analyse and explain a piece of code, so I can correct it before we move on. I also often switch to an LLM that's separate from my IDE because it tends to get confused by sprawling context. 52. What works extremely well for me is this: Let Claude Code create the plan, then turn over the plan to Codex for review, and give the response back to Claude Code. Codex is exceptionally good at doing high level reviews and keeping an eye on the details. It will find very suble errors and omissins. And CC is very good at quickly converting the plan into code. This back and forth between the two agents with me steering the conversation elevates Claude Code into next level. 53. add another agent review, I ask Claude to send plan for review to Codex and fix critical and high issues, with complexity gating (no overcomplicated logic), run in a loop, then send to Gemini reviewer, then maybe final pass with Claude, once all C+H pass the sequence is done </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Claude vs Other Models # Comparisons between Claude, Codex, Gemini, and other models. Discussion of model-specific behaviors and optimal prompting strategies. Using multiple models in complementary roles.
53