https://lemmata.substack.com/p/alphaproof-and-the-imo → evidence that true creativity might be hard

https://lemmata.substack.com/p/o3-o4-mini-largely-incremental-at

Models can’t come up with new abstractions and encapsulations on the fly: https://x.com/snewmanpv/status/1929724463916241351

In the GPT-4 era, LLMs were primarily known for being good at regurgitating known facts and ideas. They could do some impressive-but-shallow recombination, "rewrite the Declaration of Independence as a story about losing socks in the dryer", but they couldn't come up with anything really new. Notably, they didn't seem capable of solving any but the most straightforward math problems – nothing that required creative leaps of reasoning.

More recently, it seemed that the picture was shifting. First Google DeepMind's AlphaGeometry and AlphaProof began solving International Mathematical Olympiad questions, using search algorithms and automatic task-specific fine tuning to come up with insights and solve these difficult problems. Then reasoning models began solving problems without even needing fine tuning or explicit search algorithms. Most impressively so far, OpenAI's o3 model achieved *** on the FrontierMath benchmark, which includes problems designed to be challenging even to research mathematicians. Had LLMs cracked the creative reasoning barrier?

o3's FrontierMath score startled me more than any AI achievement since the launch of GPT-4. I had really expected that it would take longer to imbue AI models with the sorts of insight and exploratory taste required for difficult mathematics problems, that the gap between next-token prediction and serious problem-solving chops would be larger.

And so I will admit to having read ***'s essay, "***", with some measure of vindication and relief. Here were grounds for dismissing o3's FrontierMath score as a sort of cheating. *** explains that FrontierMath problems are designed to require a mix of background knowledge, creative insight, and careful execution... and then points out some reasons to suspect that the problems may not actually need all that much insight, given superhuman levels of background knowledge. Maybe, two years after the launch of GPT-4, large language models are still just mixing and matching existing work without adding any deep new insights of their own. The RL training used to create "reasoning models" seems to help with the execution part of the FrontierMath task classification, but may not address creative insight.

Which is all well and good, but the fact remains that few, if any, individual humans could match o3's score on FrontierMath. Depending on exactly how you want to define the term, o3 is arguably superhuman on this benchmark. Is there any point in quibbling about how it achieves that?

Yes, because this could help explain the gulf between AI's performance on benchmarks vs. real world tasks (***link to "the nature of work" post). Maybe every single one of those saturated benchmarks, like FrontierMath, consists of problems that can be solved by recombining a sufficiently large library of existing work.

But how many of the things we do all day require truly novel thinking? Don't most real-world jobs also consist of combining existing work? Or, given the limits of human memory, sometimes re-inventing work that a hyper-educated AI wouldn't have needed to reinvent?

*** explore some of these recent links about the applicability of AI tools to software engineering and other real-world tasks. Maybe "creative leaps" aren't what's holding AI back in many cases... but might be for some things, such as advancing science, or top-quality writing and creative work.

(I suspect we will see some bursts of scientific advances simply because there will be fruit that is low-hanging for AIs but not for us. This won't necessarily be the early signs of an exponential explosion of new discovery. It might be more like a new creative thinker bursting onto the scene, exhibiting a flurry of productivity that gradually fades as they finish exploring the ideas that are easily accessible from their particular perspective. ***Look for Scott Alexander's "why do you suck now?" essay.)

Note that there's more than one way to go about a task, and so it's not entirely well formed to ask "which capabilities does AI lack to solve a given problem" or "which capabilities are needed to solve a problem". E.g. maybe people do it using creative reasoning but AI can rely on learned knowledge.

Ask Sam for the place where he talked about half of thinking being knowing which information to think about, I want to quote it.