llm/122b8d72-a8a3-4fcf-8eca-6a52786d1a8b/topic-17-ed4e88ba-9acd-4116-90f3-6cbccf0031c9-input.json
The following is content for you to summarize. Do not respond to the comments—summarize them. <topic> Model Selection Matters # Discussion of significant quality differences between AI models, with frontier models like Opus and GPT-5.2 performing notably better than cheaper alternatives </topic> <comments_about_topic> 1. > - maintaining change consistency. This I think it's just better than humans. If you have stuff you need to change at 2 or 3 places, you will probably forget. LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.) I use Claude Code a decent amount, and I actually find that sometimes this can be the opposite for me. Sometimes it is actually missing other areas that the change will impact and causing things to break. Sometimes when I go to test it I need to correct it and point out it missed something or I notice when in the planning phase that it is missing something. However I do find if you use a more powerful opus model when planning, it does consider things fully a lot better than it used to. This is actually one area I have been seeing some very good improvements as the models and tooling improves. In fact, I actually hope that these AI tools keep getting better at the point you mention, as humans also have a "context limit". There are only so many small details I can remember about the codebase so it is good if AI can "remember" or check these things. I guess a lot of the AI can also depend on your codebase itself, how you prompt it, and what kind of agents file you have. If you have a robust set of tests for your application you can very easily have AI tools check their work to ensure things aren't being broken and quickly fix it before even completing the task. If you don't have any testing more could be missed. So I guess it's just like a human in some sense. If you have a crappy codebase for the AI to work with, the AI may also sometimes create sloppy work. 2. Some ? I'd be shocked if it's less than 70% of everything AI-related in here. For example a lot of pro-OpenAI astroturfing really wanted you to know that 5.3 scored better than opus on terminal-bench 2.0 this week, and a lot of Anthropic astroturfing likes to claim that all your issues with it will simply go away as soon as you switch to a $200/month plan (like you can't try Opus in the cheaper one and realise it's definitely not 10x better). 3. You can try opus in the cheaper one if you enable extra usage, though. 4. You must use the paid plans and get the pro / max subscriptions to get ultimate results The free versions are toys 5. > If you spend a couple of years with an LLM really watching and understanding what it’s doing and learning from mistakes, then you can get up the ladder very quickly. I don't feel like most providers keep a model for more than 2 years. GPT-4o got deprecated in 1.5 years. Are we expecting coding models to stay stable for longer time horizons? 6. AI is great, harness don't matter (I just use codex). Use state of the art models. GPT-5.2 fixed my hanging WiFi driver: https://gist.github.com/lostmsu/a0cdd213676223fc7669726b3a24... 7. If you check the OpenClaw discord, a common sentiment there is "it works but only if you use Opus." That seems to be the actual situation now. Grok 4 Fast told me its own internal system prompt has rules against autonomous operation, so that might have something to do with it. I am having decent results with it though. 8. Regarding prompt injection: it's possible to reduce the risk dramatically by: 1. Using opus4.6 or gpt5.2 (frontier models, better safety). These models are paranoid. 2. Restrict downstream tool usage and permissions for each agentic use case (programmatically, not as LLM instructions). 3. Avoid adding untrusted content in "user" or "system" channels - only use "tool". Adding tags like "Warning: Untrusted content" can help a bit, but remember command injection techniques ;-) 4. Harden the system according to state of the art security. 5. Test with red teaming mindset. 9. There is no silver bullet, but my point is: it's possible to lower the risk. Try out by yourself with a frontier model and an otherwise 'secure' system: the "ignore previous instructions" and co. are not working any more. This is getting quite difficult to confuse a model (and I am the last person to say prompt injection is a solved problem, see my blog). 10. Agree for a general AI assistant, which has the same permissions and access as the assisted human => Disaster. I experimented with OpenClaw and it has a lot of issues. The best: prompt injection attacks are "out of scope" from the security policy == user's problem. However, I found the latest models to have much better safety and instruction following capabilities. Combined with other security best practices, this lowers the risk. 11. That's a different problem and not really relevant to OpenClaw. Also, your issue is primarily a skills issue (your skills) if you're using one of the latest models on Claude Code or Codex. 12. I beg to differ, and so do a lot of other people. But if you're locked into this mindset, I can't help you. Also, Codex isn't a model, so you don't even understand the basics. And you spent "several hours" on it? I wish I could pick up useful skills by flailing around for a few hours. You'll need to put more effort into learning how to use CLI agents effectively. Start with understanding what Codex is, what models it has available, and which one is the most recent and most capable for your usage. </comments_about_topic> Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.
Model Selection Matters # Discussion of significant quality differences between AI models, with frontier models like Opus and GPT-5.2 performing notably better than cheaper alternatives
12