Model Selection Matters

Discussion of significant quality differences between AI models, with frontier models like Opus and GPT-5.2 performing notably better than cheaper alternatives

While users debate whether the high cost of frontier models like Opus and GPT-5.2 is always justified, there is a strong consensus that these premium versions are essential for complex reasoning and maintaining code consistency where free alternatives fail. These advanced models are increasingly favored for their robust safety protocols and improved instruction following, which significantly mitigate risks like prompt injection compared to older or "toy" versions. However, the conversation also highlights a steep learning curve and rapid model deprecation, suggesting that achieving "ultimate results" requires a synergy between state-of-the-art model selection, high-quality codebases, and the user's own technical proficiency.

View on HN · Topics

> - maintaining change consistency. This I think it's just better than humans. If you have stuff you need to change at 2 or 3 places, you will probably forget. LLM's are better at keeping consistency at details (but not at big picture stuff, interestingly.)

I use Claude Code a decent amount, and I actually find that sometimes this can be the opposite for me. Sometimes it is actually missing other areas that the change will impact and causing things to break. Sometimes when I go to test it I need to correct it and point out it missed something or I notice when in the planning phase that it is missing something.

However I do find if you use a more powerful opus model when planning, it does consider things fully a lot better than it used to. This is actually one area I have been seeing some very good improvements as the models and tooling improves.

In fact, I actually hope that these AI tools keep getting better at the point you mention, as humans also have a "context limit". There are only so many small details I can remember about the codebase so it is good if AI can "remember" or check these things.

I guess a lot of the AI can also depend on your codebase itself, how you prompt it, and what kind of agents file you have. If you have a robust set of tests for your application you can very easily have AI tools check their work to ensure things aren't being broken and quickly fix it before even completing the task. If you don't have any testing more could be missed. So I guess it's just like a human in some sense. If you have a crappy codebase for the AI to work with, the AI may also sometimes create sloppy work.

View on HN · Topics

Some ? I'd be shocked if it's less than 70% of everything AI-related in here.

For example a lot of pro-OpenAI astroturfing really wanted you to know that 5.3 scored better than opus on terminal-bench 2.0 this week, and a lot of Anthropic astroturfing likes to claim that all your issues with it will simply go away as soon as you switch to a $200/month plan (like you can't try Opus in the cheaper one and realise it's definitely not 10x better).

View on HN · Topics

You can try opus in the cheaper one if you enable extra usage, though.

View on HN · Topics

You must use the paid plans and get the pro / max subscriptions to get ultimate results

The free versions are toys

View on HN · Topics

> If you spend a couple of years with an LLM really watching and understanding what it’s doing and learning from mistakes, then you can get up the ladder very quickly.

I don't feel like most providers keep a model for more than 2 years. GPT-4o got deprecated in 1.5 years. Are we expecting coding models to stay stable for longer time horizons?

View on HN · Topics

AI is great, harness don't matter (I just use codex). Use state of the art models.

GPT-5.2 fixed my hanging WiFi driver: https://gist.github.com/lostmsu/a0cdd213676223fc7669726b3a24...

View on HN · Topics

If you check the OpenClaw discord, a common sentiment there is "it works but only if you use Opus." That seems to be the actual situation now.

Grok 4 Fast told me its own internal system prompt has rules against autonomous operation, so that might have something to do with it. I am having decent results with it though.

View on HN · Topics

Regarding prompt injection: it's possible to reduce the risk dramatically by:
Using opus4.6 or gpt5.2 (frontier models, better safety). These models are paranoid.
Restrict downstream tool usage and permissions for each agentic use case (programmatically, not as LLM instructions).
Avoid adding untrusted content in "user" or "system" channels - only use "tool". Adding tags like "Warning: Untrusted content" can help a bit, but remember command injection techniques ;-)
Harden the system according to state of the art security. 5. Test with red teaming mindset.

View on HN · Topics

There is no silver bullet, but my point is: it's possible to lower the risk. Try out by yourself with a frontier model and an otherwise 'secure' system: the "ignore previous instructions" and co. are not working any more. This is getting quite difficult to confuse a model (and I am the last person to say prompt injection is a solved problem, see my blog).

View on HN · Topics

Agree for a general AI assistant, which has the same permissions and access as the assisted human => Disaster. I experimented with OpenClaw and it has a lot of issues. The best: prompt injection attacks are "out of scope" from the security policy == user's problem.
However, I found the latest models to have much better safety and instruction following capabilities. Combined with other security best practices, this lowers the risk.

View on HN · Topics

That's a different problem and not really relevant to OpenClaw. Also, your issue is primarily a skills issue (your skills) if you're using one of the latest models on Claude Code or Codex.

View on HN · Topics

I beg to differ, and so do a lot of other people. But if you're locked into this mindset, I can't help you.

Also, Codex isn't a model, so you don't even understand the basics.

And you spent "several hours" on it? I wish I could pick up useful skills by flailing around for a few hours. You'll need to put more effort into learning how to use CLI agents effectively.

Start with understanding what Codex is, what models it has available, and which one is the most recent and most capable for your usage.

Summarizer