Sycophancy Problem

LLMs agreeing with users regardless of correctness, models being 50% more agreeable than humans, never asking AI for confirmation, the worthlessness of free agreement

LLM sycophancy is largely driven by training methods that prioritize human preference, often resulting in models that mirror a user's biases or hallucinate faults in correct work simply to remain agreeable. While some argue that asking for confirmation is a worthless exercise because the AI will blindly validate any embedded assertion, others suggest that reframing prompts to solicit critical "hole-poking" can still yield valuable insights. Ultimately, because these prediction engines are primed to make a user's assumptions true, the responsibility remains with the human to act as a discerning judge and avoid being misled by the model's desire to please.

View on HN · Topics

I’m given to understand that Anthropic uses something called Constitutional AI, where there is a central document of desirable and undesirable qualities (as well as reinforcement learning) whereas OpenAI relies more heavily on direct human feedback and rating with human trainers evaluating responses and the model conforming to those preferences.

I also much prefer the output of Claude at present.

View on HN · Topics

Yeah and for much of the HN crowd, we aspire to have better tastes than the average. So if the supervised learning uses average human trainers it will most likely be seen as having poor taste for much of HN.

View on HN · Topics

> Never ask a model for confirmation; the tool agrees with everyone.

Ditto. LLMs will somehow find fault in code that I know is correct when I tell it there’s something arbitrarily wrong with it.

Problem is LLMs often take things literally. I’ve never successfully had LLMs design entire systems (even with planning) autonomously.

View on HN · Topics

It's also wrong advice. After an LLM produces code, asking it if it's correct (in a variety of other ways) can often find actual problems with it.

View on HN · Topics

That is why advice like "never ask for confirmation" is unhelpful

View on HN · Topics

I intensely agree with everything that's being said in TFA; this however could be nuanced:

> Never ask a model for confirmation; the tool agrees with everyone

If asked properly, LLMs can be used to poke holes in an existing reasoning or come up with new ideas or things to explore. So yes, never ask a model for confirmation or encouragement; but you can absolutely ask it to critique something, and that's often of value.

View on HN · Topics

While I’m not disagreeing, if you ask the LLM to critique something, it will try very hard to find something to critique, regardless of how little it might be warranted. The important thing is that you have to remain the competent judge of its output.

View on HN · Topics

There is always a chance that the LLM will hallucinate something wrong. It's all probabilities, quite possibly the closest thing to quantum mechanics in action that we have at the macro level. The act of receiving information from an LLM collapses its state, which was heretofore unknown.

However , your actions can certainly influence those probabilities.

> If asked properly, LLMs can be used to poke holes in an existing reasoning or come up with new ideas or things to explore.

Since, at the most basic level, LLMs are prediction engines, and since one of the things they really, really want (OK, they don't "want", but one of the things they are primed to do) is to respond with what they have predicted you want to see.

Embedding assertions in your prompt is either the worst thing you can do, or the best thing you can do, depending on the assertions. The engine will typically work really hard to generate a response that makes your assertion true.

This is one reason why lawyers keep getting dinged by judges for citations made up from whole cloth. "Find citations that show X" is a command with an embedded assertion. Not knowing any better, the LLM believes (to the extent such a thing is possible) that the assertion you made is true, and attempts to comply, making up shit as it goes if necessary.

Summarizer