[probably for a later post] https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws has interesting thoughts on test-time compute, Strawberry, inference compute only drives 40% of Nvidia’s sales, etc.)

Could ask Nathan Lambert for feedback on my draft

Record today’s Digits problems

69 1 2 4 9 22 25: 1309 total solutions, 2 * 22 + 25 → success (o1-mini)

183 2 9 10 15 17 25: 640 total solutions, 10 * 17 - 2 + 15 → success (o1-mini)

269 2 3 7 20 21 25: 331 total solutions, 2 * 7 * 21 - 25

o1-mini: partial success after 37 seconds (needed 4 steps, used “2” twice. Hallucinated that it used three operations. When I point this out, after another 139 seconds, concludes no solution is possible. Given the hint to start with 2*7, it solved the problem in 5 seconds.

o1-preview: after 67 seconds,  it gave a solution that used 21 twice, and used four steps (but claimed to use three). When I point this out, after another 82 seconds, concludes no solution is possible.

327 1 3 5 9 11 20: 161 total solutions, (5 * 20 + 9) * 3

431 3 4 6 9 11 15: 56 total solutions, (3 * 6 + 11) * 15 - 4

With the announcement of their latest-and-greatest AI model, “o1”, OpenAI has once again advanced the technological frontier of confusing product names. Oh, and also, of artificial intelligence.

o1 is advertised as having especially strong reasoning and problem-solving capabilities. This has sparked a new wave of discussion: how intelligent is o1? What can and can’t it do? Are we all about to become obsolete?

For all the talk about what AIs can do, I don’t think we talk enough about what they’ll need to do, if they’re going to take on increasingly sophisticated work. That is, I’ve seen relatively little discussion of what goes into human reasoning and problem-solving.

All of this comes on the heels of another major advance in AI reasoning: AlphaProof, a *** system from Google that can solve International Mathematical Olympiad problems.

Dean mentioned an “extended cut” of an interview given around the time the model was released, with some additional details. Mentions that the models are now better at generating reasoning traces.

The transcript of the “there are three Rs in Strawberry” decoding problem is interesting. It breaks down the problem. It tries different approaches. It checks its own work.

https://twitter.com/natolambert/status/1835686346880626880

Mentions scaling in both RL compute and test-time compute

OpenAI shared benchmark results indicating that o1 is substantially more capable than the o1-preview we currently have access to

Let’s formulate the reward for this RL problem. The classic problem with traditional reinforcement learning from human preferences is that one reward, in the form of a binary preference, is assigned to the whole trajectory. In this, it is hard to train the model to understand where it went wrong along the way. Recent research to get around this is by designing reward models that score every step in reasoning.

The best examples of per-step reward models (process reward models) rating each step in a reasoning tree are from OpenAI’s Let’s Verify Step By Step paper.

[I think this is something Nathan Lambert is hypothesizing, not something OpenAI has announced] For each reasoning step shown to the user through the vague summary, o1 models generate multiple candidates that they then rate after an end-of-step token. For users, this number is fixed. When doing evaluations, OpenAI can vary the number of candidates (and they said that want to expose this type of inference-intensity control to the user).

When you read the traces of this model it is very clear it’s different than any language model we’ve been playing with recently. It rambles, it questions, and it still gets to smart answers. It seems like there are some variable actions the model is taking, particularly at repetitive phrases like “Wait, is that right” or an oddly human “Hmmm.” These mark interesting moments when reasoning can change directions.

Getting this sort of result with an autoregressive model would be quite odd. We have seen for some time that once an autoregressive model is perturbed from a trajectory it commits to its error. RL agents can detect when they’re off script and go into an “exploratory” mode where they need to take slightly more random actions to find the high-value area of the state space.

Over a year ago, OpenAI likely paid high-skill annotators to create complex forward reasoning paths, likely with different paths for single problems. These can be rated and turned into initial labeled trajectories. It is likely that contrastive examples were needed too, making copying the reasoning traces (if we had them) not enough.

What is the equivalent of the AlphaGo move 37 for language? 

https://x.com/emollick/status/1835659394459013565?t=VgIjyuN2s86nMPYIsLSU5Q&s=03

Untitled 

Write on OpenAI’s o1?

https://x.com/MLStreetTalk/status/1834609042230009869?t=JPGxhJYFi6YAwMASaM87oA&s=03

Reasoning is *knowledge acquisition*. The new OpenAI models don't reason, they simply memorise reasoning trajectories gifted from humans. Now is the best time to spot this, as over time it will become more indistinguishable as the gaps shrink. For example, a clever human might know that a particular mathematical problem requires the use of symmetry to solve. The OpenAI model might not yet know, because it's not seen it before in that situation. When a human hints the model and tells it the answer, its CoT model will be updated, and next time in a similar situation it will "know" what strategy to take. This will rinse and repeat as they sponge reasoning data from users until many of the "holes in the swiss cheese" are filled up. But at the end of the day - this isn't reasoning. It's still cool though.

Nathan Lambert (@natolambert) posted at 9:23 AM on Fri, Sep 13, 2024:People are going to underestimate o1 style systems until there's an equivalent "move 37" moment from alphago in some reasoning domain. It'll come sooner than people think.(https://x.com/natolambert/status/1834629102097338766?t=nRk8BnYU8LmxgfpfmNp9Og&s=03)

https://x.com/btibor91/status/1834686946846597281?t=zkGfSi1ZH5cUy3DniUKbKQ&s=03

https://simonw.substack.com/p/openais-new-o1-chain-of-thought-models

https://x.com/deedydas/status/1833539735853449360?t=PTSb4uuby5SjnQFPh37IFw&s=03

https://x.com/emollick/status/1835090283832442982?t=tv57JQa96hlEQ6pyt46l7Q&s=03

Nathan Labenz describes this as “effectively GPT-4.5”, it’s a fine-tuned update of GPT-4o, the way 3.5 was a fine-tune of 3. The guy from Apollo noted that it’s primarily RL on outcomes.

https://www.oneusefulthing.org/p/something-new-on-openais-strawberry

https://www.understandingai.org/p/why-i-dont-view-teslas-near-infinite/comments

https://x.com/emollick/status/1835366663715258818?t=dNkT4XdOFU05m6uQ9w9ZTw&s=03

https://x.com/AstronoMisfit/status/1835328355430007164?t=ur1SQaNg5jHEyxD2yfJWCw&s=03

https://twitter.com/emollick/status/1836064591479971894

Julius (@JuliusSimonelli) posted at 2:23 PM on Fri, Sep 13, 2024:Another instance of o1-preview missing the part where e I tell it the answer https://t.co/8lkD4oRDP2(https://x.com/JuliusSimonelli/status/1834704564618244210?t=HKkY1nxGQvGnNik-hoOkWQ&s=03)

Mention of improved models (o1) being more convincing when they get things wrong (including attempts to justify their incorrect answer): https://www.mindprison.cc/i/148326057/capabilities-are-unclear

Revisit my predictions in the GPT-5 post?

“As an early model, it doesn’t yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images,” the company said. “But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.”

https://deliprao.substack.com/p/a-few-thoughts-about-o1

https://openai.com/index/introducing-openai-o1-preview/

Terence Tao on O1 | Hacker News

https://arcprize.org/blog/openai-o1-results-arc-prize

https://x.com/rohanpaul_ai/status/1834728874879455250?t=c5YgL2_PhEUKGXqeFn81Ow&s=03

Ethan Mollick (@emollick) posted at 7:46 PM on Fri, Sep 13, 2024:I think it is worth reading the red team reports on o1. Some interesting stuff in there (and these are commissioned and published by OpenAI)(https://x.com/emollick/status/1834785685661696062?t=E3Cgf9eYXVeQK7b3HnSVfw&s=03)

@⁨~Kathy⁩ I tested GPT-o1 on summer camps. Mini wasn’t even close. Preview was tantalizingly close, but still made a substantial number of mistakes:
https://chatgpt.com/share/e/66e35390-ac54-800c-930a-008a2a93762d

Julius (@JuliusSimonelli) posted at 4:44 PM on Thu, Sep 12, 2024:ChatGPT o1-preview still doesn't know what it knows... If you ask it for three people with the same birth year and day, it can't tell you. But when you ask for people's birthdays, it knows https://t.co/qgLJZm8CGr(https://x.com/JuliusSimonelli/status/1834377470856040685?t=vu-39MkFLyVNcoUPy8s9fA&s=03)

Miles Brundage (@Miles_Brundage) posted at 11:02 AM on Thu, Sep 12, 2024:My team + many others across OpenAI have been thinking for a while about the implications of AIs that can think for a while.More to say later (and do read all the o1 blog posts, system card, testimonials etc. and ofc try it), but for now, 3 personal reflections.(https://x.com/Miles_Brundage/status/1834291595648377091?t=DFITPKBwCtsROoXt8jf7rw&s=03)

Shakeel (@ShakeelHashim) posted at 11:05 AM on Thu, Sep 12, 2024:I took a stab at summarising the OpenAI o1 system card. A few bits in particular jumped out at me:1: @apolloaisafety finding the model "instrumentally faked alignment during testing", and deeming the model capable of "simple in-context scheming". https://t.co/JssF7NRW4e(https://x.com/ShakeelHashim/status/1834292284193734768?t=NxybnaZbTuvEDjWC-bZBbQ&s=03)

Igor (from AGI Safety Updates on Signal) (talk to him when I work on my post about reasoning + creativity + Strawberry):

same type of semantic debate is opened here,and I really don't want to get dragged into the tarpit.  From reading the paper, it's sounds like this paper https://arxiv.org/abs/2407.21787 with a + maybe lots of process supervision to get better + mabye custom score function, and the model is pretty clearly doing a quite exhaustive search across options in the chain of thoughts logs.

I think it should be uncontroversial to say that the model isn't doing "reasoning" as a chain of logical inferences (the way a prolog horn clause solver would with pure search where every step is valid and the problem are dead ends) without the statistical association that makes people call it "autocomplete" (e.g. from the cipher example

9 corresponds to 'i'(9='i')
But 'i' is 9, so that seems off by 1.
So perhaps we need to think carefully about letters.
Wait, 18/2=9, 9 corresponds to 'I'

it still doesn't "count" reliably)

But I think it's also reasonable to say that humans probably also do this type of small step associations unless thinking formally => queue endless semantic debates on twitter

What I'm curious about are

the exact numbers on that log scale

how the model does on things we can't expect to be in the corpus somwhat similarly (all the demos are curveballs the model excels at imo and testimony I've seen about proofs involving KL divergence, i.e. stuff that's probably not in text books and neither mechanical turk-able are that it's still meh at it)

whether or not they can scale and distill this into something meaningful

if anyone has access and wants to try some variants from this https://arxiv.org/pdf/2406.02061 I'd be curious to hear how it fares

https://www.maginative.com/article/openai-has-demonstrated-strawberry-ai-capabilities-to-u-s-national-security-officials/

Less wrong version:
https://www.lesswrong.com/posts/8oX4FTRa8MJodArhj/the-information-openai-shows-strawberry-to-feds-races-to

The information:
https://www.theinformation.com/articles/openai-shows-strawberry-ai-to-the-feds-and-uses-it-to-develop-orion