https://bsky.app/profile/lauraruis.bsky.social/post/3lbfeqtkhzm22 – LLMs learn to reason

[Child Page: o1]

[probably for a later post] https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws has interesting thoughts on test-time compute, Strawberry, inference compute only drives 40% of Nvidia’s sales, etc.)

Could ask Nathan Lambert for feedback on my draft

Record today’s Digits problems

69 1 2 4 9 22 25: 1309 total solutions, 2 * 22 + 25 → success (o1-mini)

183 2 9 10 15 17 25: 640 total solutions, 10 * 17 - 2 + 15 → success (o1-mini)

269 2 3 7 20 21 25: 331 total solutions, 2 * 7 * 21 - 25

o1-mini: partial success after 37 seconds (needed 4 steps, used “2” twice. Hallucinated that it used three operations. When I point this out, after another 139 seconds, concludes no solution is possible. Given the hint to start with 2*7, it solved the problem in 5 seconds.

o1-preview: after 67 seconds,  it gave a solution that used 21 twice, and used four steps (but claimed to use three). When I point this out, after another 82 seconds, concludes no solution is possible.

327 1 3 5 9 11 20: 161 total solutions, (5 * 20 + 9) * 3

431 3 4 6 9 11 15: 56 total solutions, (3 * 6 + 11) * 15 - 4

With the announcement of their latest-and-greatest AI model, “o1”, OpenAI has once again advanced the technological frontier of confusing product names. Oh, and also, of artificial intelligence.

o1 is advertised as having especially strong reasoning and problem-solving capabilities. This has sparked a new wave of discussion: how intelligent is o1? What can and can’t it do? Are we all about to become obsolete?

For all the talk about what AIs can do, I don’t think we talk enough about what they’ll need to do, if they’re going to take on increasingly sophisticated work. That is, I’ve seen relatively little discussion of what goes into human reasoning and problem-solving.

All of this comes on the heels of another major advance in AI reasoning: AlphaProof, a *** system from Google that can solve International Mathematical Olympiad problems.

Dean mentioned an “extended cut” of an interview given around the time the model was released, with some additional details. Mentions that the models are now better at generating reasoning traces.

The transcript of the “there are three Rs in Strawberry” decoding problem is interesting. It breaks down the problem. It tries different approaches. It checks its own work.

https://twitter.com/natolambert/status/1835686346880626880

Mentions scaling in both RL compute and test-time compute

OpenAI shared benchmark results indicating that o1 is substantially more capable than the o1-preview we currently have access to

Let’s formulate the reward for this RL problem. The classic problem with traditional reinforcement learning from human preferences is that one reward, in the form of a binary preference, is assigned to the whole trajectory. In this, it is hard to train the model to understand where it went wrong along the way. Recent research to get around this is by designing reward models that score every step in reasoning.

The best examples of per-step reward models (process reward models) rating each step in a reasoning tree are from OpenAI’s Let’s Verify Step By Step paper.

[I think this is something Nathan Lambert is hypothesizing, not something OpenAI has announced] For each reasoning step shown to the user through the vague summary, o1 models generate multiple candidates that they then rate after an end-of-step token. For users, this number is fixed. When doing evaluations, OpenAI can vary the number of candidates (and they said that want to expose this type of inference-intensity control to the user).

When you read the traces of this model it is very clear it’s different than any language model we’ve been playing with recently. It rambles, it questions, and it still gets to smart answers. It seems like there are some variable actions the model is taking, particularly at repetitive phrases like “Wait, is that right” or an oddly human “Hmmm.” These mark interesting moments when reasoning can change directions.

Getting this sort of result with an autoregressive model would be quite odd. We have seen for some time that once an autoregressive model is perturbed from a trajectory it commits to its error. RL agents can detect when they’re off script and go into an “exploratory” mode where they need to take slightly more random actions to find the high-value area of the state space.

Over a year ago, OpenAI likely paid high-skill annotators to create complex forward reasoning paths, likely with different paths for single problems. These can be rated and turned into initial labeled trajectories. It is likely that contrastive examples were needed too, making copying the reasoning traces (if we had them) not enough.

What is the equivalent of the AlphaGo move 37 for language? 

https://x.com/emollick/status/1835659394459013565?t=VgIjyuN2s86nMPYIsLSU5Q&s=03

Untitled 

Write on OpenAI’s o1?

https://x.com/MLStreetTalk/status/1834609042230009869?t=JPGxhJYFi6YAwMASaM87oA&s=03

Reasoning is *knowledge acquisition*. The new OpenAI models don't reason, they simply memorise reasoning trajectories gifted from humans. Now is the best time to spot this, as over time it will become more indistinguishable as the gaps shrink. For example, a clever human might know that a particular mathematical problem requires the use of symmetry to solve. The OpenAI model might not yet know, because it's not seen it before in that situation. When a human hints the model and tells it the answer, its CoT model will be updated, and next time in a similar situation it will "know" what strategy to take. This will rinse and repeat as they sponge reasoning data from users until many of the "holes in the swiss cheese" are filled up. But at the end of the day - this isn't reasoning. It's still cool though.

Nathan Lambert (@natolambert) posted at 9:23 AM on Fri, Sep 13, 2024:People are going to underestimate o1 style systems until there's an equivalent "move 37" moment from alphago in some reasoning domain. It'll come sooner than people think.(https://x.com/natolambert/status/1834629102097338766?t=nRk8BnYU8LmxgfpfmNp9Og&s=03)

https://x.com/btibor91/status/1834686946846597281?t=zkGfSi1ZH5cUy3DniUKbKQ&s=03

https://simonw.substack.com/p/openais-new-o1-chain-of-thought-models

https://x.com/deedydas/status/1833539735853449360?t=PTSb4uuby5SjnQFPh37IFw&s=03

https://x.com/emollick/status/1835090283832442982?t=tv57JQa96hlEQ6pyt46l7Q&s=03

Nathan Labenz describes this as “effectively GPT-4.5”, it’s a fine-tuned update of GPT-4o, the way 3.5 was a fine-tune of 3. The guy from Apollo noted that it’s primarily RL on outcomes.

https://www.oneusefulthing.org/p/something-new-on-openais-strawberry

https://www.understandingai.org/p/why-i-dont-view-teslas-near-infinite/comments

https://x.com/emollick/status/1835366663715258818?t=dNkT4XdOFU05m6uQ9w9ZTw&s=03

https://x.com/AstronoMisfit/status/1835328355430007164?t=ur1SQaNg5jHEyxD2yfJWCw&s=03

https://twitter.com/emollick/status/1836064591479971894

Julius (@JuliusSimonelli) posted at 2:23 PM on Fri, Sep 13, 2024:Another instance of o1-preview missing the part where e I tell it the answer https://t.co/8lkD4oRDP2(https://x.com/JuliusSimonelli/status/1834704564618244210?t=HKkY1nxGQvGnNik-hoOkWQ&s=03)

Mention of improved models (o1) being more convincing when they get things wrong (including attempts to justify their incorrect answer): https://www.mindprison.cc/i/148326057/capabilities-are-unclear

Revisit my predictions in the GPT-5 post?

“As an early model, it doesn’t yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images,” the company said. “But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.”

https://deliprao.substack.com/p/a-few-thoughts-about-o1

https://openai.com/index/introducing-openai-o1-preview/

Terence Tao on O1 | Hacker News

https://arcprize.org/blog/openai-o1-results-arc-prize

https://x.com/rohanpaul_ai/status/1834728874879455250?t=c5YgL2_PhEUKGXqeFn81Ow&s=03

Ethan Mollick (@emollick) posted at 7:46 PM on Fri, Sep 13, 2024:I think it is worth reading the red team reports on o1. Some interesting stuff in there (and these are commissioned and published by OpenAI)(https://x.com/emollick/status/1834785685661696062?t=E3Cgf9eYXVeQK7b3HnSVfw&s=03)

@⁨~Kathy⁩ I tested GPT-o1 on summer camps. Mini wasn’t even close. Preview was tantalizingly close, but still made a substantial number of mistakes:
https://chatgpt.com/share/e/66e35390-ac54-800c-930a-008a2a93762d

Julius (@JuliusSimonelli) posted at 4:44 PM on Thu, Sep 12, 2024:ChatGPT o1-preview still doesn't know what it knows... If you ask it for three people with the same birth year and day, it can't tell you. But when you ask for people's birthdays, it knows https://t.co/qgLJZm8CGr(https://x.com/JuliusSimonelli/status/1834377470856040685?t=vu-39MkFLyVNcoUPy8s9fA&s=03)

Miles Brundage (@Miles_Brundage) posted at 11:02 AM on Thu, Sep 12, 2024:My team + many others across OpenAI have been thinking for a while about the implications of AIs that can think for a while.More to say later (and do read all the o1 blog posts, system card, testimonials etc. and ofc try it), but for now, 3 personal reflections.(https://x.com/Miles_Brundage/status/1834291595648377091?t=DFITPKBwCtsROoXt8jf7rw&s=03)

Shakeel (@ShakeelHashim) posted at 11:05 AM on Thu, Sep 12, 2024:I took a stab at summarising the OpenAI o1 system card. A few bits in particular jumped out at me:1: @apolloaisafety finding the model "instrumentally faked alignment during testing", and deeming the model capable of "simple in-context scheming". https://t.co/JssF7NRW4e(https://x.com/ShakeelHashim/status/1834292284193734768?t=NxybnaZbTuvEDjWC-bZBbQ&s=03)

Igor (from AGI Safety Updates on Signal) (talk to him when I work on my post about reasoning + creativity + Strawberry):

same type of semantic debate is opened here,and I really don't want to get dragged into the tarpit.  From reading the paper, it's sounds like this paper https://arxiv.org/abs/2407.21787 with a + maybe lots of process supervision to get better + mabye custom score function, and the model is pretty clearly doing a quite exhaustive search across options in the chain of thoughts logs.

I think it should be uncontroversial to say that the model isn't doing "reasoning" as a chain of logical inferences (the way a prolog horn clause solver would with pure search where every step is valid and the problem are dead ends) without the statistical association that makes people call it "autocomplete" (e.g. from the cipher example

9 corresponds to 'i'(9='i')
But 'i' is 9, so that seems off by 1.
So perhaps we need to think carefully about letters.
Wait, 18/2=9, 9 corresponds to 'I'

it still doesn't "count" reliably)

But I think it's also reasonable to say that humans probably also do this type of small step associations unless thinking formally => queue endless semantic debates on twitter

What I'm curious about are

the exact numbers on that log scale

how the model does on things we can't expect to be in the corpus somwhat similarly (all the demos are curveballs the model excels at imo and testimony I've seen about proofs involving KL divergence, i.e. stuff that's probably not in text books and neither mechanical turk-able are that it's still meh at it)

whether or not they can scale and distill this into something meaningful

if anyone has access and wants to try some variants from this https://arxiv.org/pdf/2406.02061 I'd be curious to hear how it fares

https://www.maginative.com/article/openai-has-demonstrated-strawberry-ai-capabilities-to-u-s-national-security-officials/

Less wrong version:
https://www.lesswrong.com/posts/8oX4FTRa8MJodArhj/the-information-openai-shows-strawberry-to-feds-races-to

The information:
https://www.theinformation.com/articles/openai-shows-strawberry-ai-to-the-feds-and-uses-it-to-develop-orion

Ethan Mollick (@emollick) posted at 1:38 PM on Thu, Sep 19, 2024:The issue with o1 (or successors) helping with PhD level problems is that most people don’t have a ton of those lying around. Even PhDs don’t solve PhD level problems every day.It might be useful for us to start to think about places where having help like that might be good.(https://x.com/emollick/status/1836867501138780517?t=sy2vMyowmfPyJIbV2ndCUw&s=03)

Tsarathustra (@tsarnick) posted at 6:29 PM on Tue, Sep 17, 2024:OpenAI's Noam Brown says that while AI model performance scales roughly equivalently with more training or inference compute, the cost of inference is on the order of 100 billion times cheaper https://t.co/DQQ7jn6wWD(https://x.com/tsarnick/status/1836215965912289306?t=BRvHougHcKEXTLiMSsIF-A&s=03)

Zvi:

Roon, in a distinct thread, reminds us that humans are very good at some things relative to other things, that AIs will instead be relatively good at different things, and we should not expect AGI in the sense of ‘better than all humans at actual everything’ until well after it is a ton better than us at many important things

I used a lot of metacognitive tools here. Can an AlphaProof-style system emulate some of the same techniques? Will its brute force ability to explore zillions of paths compensate for a less sophisticated search? Or will it struggle to progress into serious research problems? A bulldozer is an example of brute force working very very well for some problems, OK for a few more, and very poorly for others. AlphaProof seems more brute force than humans, but it’s conceivable that it can become sophisticated enough for brute force to make up for the rest. “Are many practical tasks AGI-complete” is a critical question for how the future will unfold.

To understand how much progress AlphaProof represents toward human-like intelligence, we need a clear understanding of what human intelligence actually is, i.e. how people think. Unfortunately, we still have only a fuzzy idea of how people think (if we understood it more clearly, we'd probably have already been able to replicate it in a computer!).

Why don't we talk more about what deep work looks like? Maybe we think it's obvious? Maybe we think it’s ineffable? Or maybe we assume we understand it, even though our surface ideas of how we think are usually wrong, they're rationalizations after the fact (just as LLMs hallucinate madly when asked to explain their own “thought” processes).

Importantly, in this kind of work, we often lean on experience and judgement to make high-level decisions: when to keep pushing down a difficult path, when to step back and look for an alternative, when to work forward from what we know and when to work backwards from the goal. When you’ve explored enough examples to have a feel for the problem. Which ideas are worth pursuing in the first place.

(I'm approaching this from the perspective of computer science, it'd be interesting to hear the cognitive science perspective, ask Claude)

Sequencing / decomposition, search, and (importantly!) strategizing

Also planning

Where do Think Step By Step and Chain of Thought fit in?

Where does “reasoning” fit in?

There might be levels of skill hiding in here that we haven’t guessed at yet, e.g. under “strategizing”

Search strategies: random, rastering / sequential, binary, maze left wall following, tree search, progressive refinement, hill climbing, annealing, …

Examples from guess-the-number, mazes, crossword puzzles, Sudoku, …

Is there a name for the search strategy of “whittling away wherever you can” (a la Sudoku)?

Trip planning, plotting a novel

Analytic search

Skills Used To Solve The IMO Problem

Understand the problem statement: break the problem statement into pieces, and review each piece until you understand it. Look for connections with concepts you already understand (but watch out for precedents that seem relevant, but aren’t4). Test your understanding: does the problem seem tractable? Is it too easy? Does it seem like you’re missing necessary information? Is there information in the problem statement that seems like it isn’t needed? Any of those are signs that you may have misread.

Understand the implications: play around with specific examples. See how the information in the problem statement plays out in practice. Try a variety of examples, and try to brainstorm scenarios that you haven’t explored yet. Look for patterns.

Look for connections to the goal. How do the patterns that you’ve observed relate to the goal? If there’s a pattern that would be helpful in proving the assigned result, can you prove that pattern always holds?

Have a large toolbox, and know when to use each tool. I relied on a large variety of standard tricks: proof by contradiction, the “consider the first time that X happens” pattern, two different techniques for proving that a sequence repeats, and many others. Importantly, I had a sense for which tool I could pull out at any given moment, that would connect to what I knew so far, and would get me closer to my goal.

Work from both ends: consider what you need to prove, but also what can you prove. When I couldn’t come up with any sort of plan, I set out to simply prove anything I could. This requires judgement: I could have tried to prove something like “a number can never show up three times in a row” (after the preamble), which is true but not helpful.

Reframing:

Capture useful concepts like m and m’

Reframing the sequence into a finite, historyless form

Alternate representation (column view) to stimulate pattern recognition

Scouting ahead; if I can prove a certain thing, where would that get me?

Explore, temporarily suspending constraints / ignoring problems in the hopes that I’ll be able to smooth them over later (e.g. finding a state formulation that has finitely many values)

Need a library of standard tricks, a nose for where they might apply, and a knack for adapting them to fit

Test this by looking for counterexamples

Juggle multiple strategies; using judgement to decide when to stick with a strategy and when to branch or give up / set it aside

It helps to have had enough experience to develop a sense for what directions might be fruitful.

Comfort with abstraction. Note that I’m unusually good at this, but would have benefited from being much better!

Look at Comparing My Approach With AlphaProof (below) when finishing this overview.

Comparing My Approach With AlphaProof

At each step, I'm often working from some intuition or explicit knowledge of what I need to prove next, and steering my thoughts in service of that next step. For instance, once I had tentatively identified a pattern in my example sequences, if I couldn't immediately see how to prove that pattern would always hold, I would think about what would have to be true for the pattern to break, and then try to engineer an example that would break it. If I succeeded in finding such an example, I'd know not to bother working on the proof. If I failed, the failure might help me figure out how the proof would work.

Now, AlphaProof has a couple of assets that I don't. It likely has an encyclopedic library of proof techniques, and it has the ability to try zillions of different proofs. These are similar to the strengths of a chess program, which can evaluate ***number of different sequences for each move, and may have a massive library of opening sequences and endgame strategies.

Conversely, to my understanding, AlphaProof is missing a lot of the tools that I used here, and which are typical of human problem-solving. It doesn't play with examples, look for patterns, and try to prove those patterns always hold. It doesn't develop alternative representations. It knows lots of standard proof techniques, but it doesn't know how to guess at which technique will apply to a problem and then work backwards from the requirements for that technique. And perhaps most important, it doesn't have meta-cognition; it can't assess the situation, use high-level strategies to decide what direction would be most fruitful to explore next, and then steer its step-by-step work in service of that strategy.

Or can it?

Google's blog post gives only a high level overview of how AlphaProof works. It incorporates a model which has been trained to come up with good ideas for the next step in a proof. That model may have managed to develop some ability to guess at which technique will work for a problem and then work toward proving the elements necessary to apply that technique. It may have even developed instincts which capture some of the higher-level strategies that I use for a problem like this.

AlphaProof also has the ability to play with simpler versions of a problem, find solutions for them, and apply the lessons learned there to the full problem. This is a very important technique in human problem solving, even though I didn't use it here.

None of this explicitly allows AlphaProof to construct examples and look for patterns, to work from both ends of a problem (e.g. working backwards from a known proof technique), or exercise meta-cognition. Perhaps equivalent abilities could be hiding in the model that decides which proof step to attempt next, but it's hard for me to see how that could work very well; the structure of the computation and the intermediate data that AlphaProof keeps track of don't lend themselves to these techniques. [***Footnote: one should never be too confident in such statements. AlphaProof might have constructed simpler problems like "assume the preamble only has one number" or "assume the preamble contains only ones and twos" or even "assume the preamble is 1 2 2". I don't know how it comes up with "simpler problems", but if it could develop simpler problems of this flavor that might allow it do something equivalent to what I did by writing out a few sequences, looking for patterns, and trying to prove that those patterns always hold.]

It's worth noting that AlphaProof only solved four of the six problems in the 2024 IMO [***Footnote: it solved three, and AlphaGeometry solved one], and it failed to solve the problem I've worked through here. (Of the problems it did solve, I've finished one, am perhaps 2/3 done with another, haven't looked at a third, and am not even going to attempt the geometry problem because I suck at those. I've also made a lot of progress on the other problem it didn't get, but I'm not sure I'll be able to reach the finish line.)