Dwarkesh Podcast Subscribe Sign in Dwarkesh Podcast Sholto Douglas & Trenton Bricken — How LLMs actually think 43 6 4 1× 0:00 Current time: 0:00 / Total time: -3:12:20 -3:12:20 Audio playback is not supported on your browser. Please upgrade. Sholto Douglas & Trenton Bricken — How LLMs actually think Dwarkesh Patel Mar 28, 2024 43 6 4 Share Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast. No way to summarize it, except: This is the best context dump out there on how LLMs are trained, what capabilities they're likely to soon have, and what exactly is going on inside them. You would be shocked how much of what I know about this field, I've learned just from talking with them. To the extent that you've enjoyed my other AI interviews, now you know why. Enjoy! Watch on YouTube . Listen on Apple Podcasts , Spotify , or any other podcast platform. There's a transcript with links to all the papers the boys were throwing down - may help you follow along. Follow Trenton and Sholto on Twitter. Timestamps (00:00:00) - Long contexts (00:16:12) - Intelligence is just associations (00:32:35) - Intelligence explosion & great researchers (01:06:52) - Superposition & secret communication (01:22:34) - Agents & true reasoning (01:34:40) - How Sholto & Trenton got into AI research (02:07:16) - Are feature spaces the wrong way to think about intelligence? (02:21:12) - Will interp actually work on superhuman models (02:45:05) - Sholto’s technical challenge for the audience (03:03:57) - Rapid fire Transcript Edited by Teddy Kim , with lots of helpful links 00:00:00 - Long contexts Dwarkesh Patel 0:00:00 Okay, today I have the pleasure to talk with two of my good friends, Sholto and Trenton . Noam Brown , who wrote the Diplomacy paper , said this about Sholto: “he's only been in the field for 1.5 years, but people in AI know that he was one of the most important people behind Gemini's success.” And Trenton, who's at Anthropic , works on mechanistic interpretability and it was widely reported that he has solved alignment. So this will be a capabilities only podcast. Alignment is already solved, no need to discuss further. Let's start by talking about context lengths . It seemed to be underhyped, given how important it seems to me, that you can just put a million tokens into context . There's apparently some other news that got pushed to the front for some reason, but tell me about how you see the future of long context lengths and what that implies for these models. Sholto Douglas 00:01:28 So I think it's really underhyped. Until I started working on it, I didn't really appreciate how much of a step up in intelligence it was for the model to have the onboarding problem basically instantly solved. You can see that a bit in the perplexity graphs in the paper where just throwing millions of tokens worth of context about a code base allows it to become dramatically better at predicting the next token in a way that you'd normally associate with huge increments in model scale. But you don't need that. All you need is a new context. So underhyped and buried by some other news. Dwarkesh Patel 00:01:58 In context, are they as sample efficient and smart as humans? Sholto Douglas 00:02:02 I think that's really worth exploring. For example, one of the evals that we did in the paper had it learn a language in context better than a human expert could, over the course of a couple of months. This is only a small demonstration but I'd be really interested to see things like Atari games where you throw in a couple hundred, or a thousand frames, of labeled actions in the same way that you'd show your friend how to play a game and see if it's able to reason through. It might. At the moment, with the infrastructure and stuff, it's still a bit slow at doing that, but I would actually guess that it might just work out of the box in a way that would be pretty mind-blowing. Trenton Bricken 00:02:38 And crucially, I think this language was esoteric enough that it wasn't in the training data. Sholto Douglas 00:02:42 Exactly. If you look at the model before it has that context thrown in, it doesn't know the language at all and it can't get any translations. Dwarkesh Patel 00:02:49 And this is an actual human language? Sholto Douglas 00:02:51 Exactly. An actual human language. Dwarkesh Patel 00:02:53 So if this is true, it seems to me that these models are already in an important sense, superhuman. Not in the sense that they're smarter than us, but I can't keep a million tokens in my context when I'm trying to solve a problem, remembering and integrating all the information, an entire code base. Am I wrong in thinking this is a huge unlock? Sholto Douglas 00:03:14 Actually, I generally think that's true. Previously, I've been frustrated when models aren't as smart, when you ask them a question and you want it to be smarter than you or to know things that you don't. This allows them to know things that you don't. It just ingests a huge amount of information in a way you just can't. So it's extremely important. Dwarkesh Patel 00:03:33 Well, how do we explain in-context learning ? Sholto Douglas 00:03:35 There's a line of work I quite like, where it looks at in-context learning as basically very similar to gradient descent, but the attention operation can be viewed as gradient descent on the in-context data. That paper had some cool plots where they basically showed “we take n steps of gradient descent and that looks like n layers of in-context learning, and it looks very similar.” So I think that's one way of viewing it and trying to understand what's going on. Trenton Bricken 00:03:59 You can ignore what I'm about to say because, given the introduction, alignment is solved and AI safety isn't a problem. I think the context stuff does get problematic, but also interesting here. I think there'll be more work coming out in the not-too-distant future around what happens if you give a hundred shot prompt for jailbreaks, adversarial attacks . It's also interesting in the sense that, if your model is doing gradient descent and learning on the fly, even if it's been trained to be harmless, you're dealing with a totally new model in a way. You're fine-tuning but in a way where you can't control what's going on. Dwarkesh Patel 00:04:41 Can you explain? What do you mean by gradient descent happening in the forward pass and attention? Trenton Bricken 00:04:45 There was something in the paper about trying to teach the model to do linear regression but just through the number of samples or examples they gave in the context. And you can see if you plot on the x-axis the number of shots that it has, then the loss it gets on ordinary least squares regression will go down with time. Sholto Douglas 00:05:04 And it goes down exactly matched with the number of gradient descent steps. Trenton Bricken 00:05:08 Yeah, exactly. Dwarkesh Patel 00:05:09 I only read the intro and discussion section of that paper. But in the discussion, the way they framed it is that the model, in order to get better at long-context tasks, has to get better at learning to learn from these examples or from the context that is already within the window. And the implication of that is, if meta-learning happens because it has to learn how to get better at long-context tasks, then in some important sense the task of intelligence requires long-context examples and long-context training. Sholto Douglas 00:05:45 Understanding how to better induce meta-learning in your pre-training process is a very important thing about flexible or adaptive intelligence. Dwarkesh Patel 00:05:53 Right, but you can proxy for that just by getting better at doing long-term context tasks. One of the bottlenecks for AI progress that many people identify is the inability of these models to perform tasks on long horizons, engaging with the task for many hours, or even many weeks or months, where they’re an assistant or an employee and they can just do a thing I tell them to do for a while. AI agents haven't taken off for this reason from what I understand. So how linked are long context windows, and the ability to perform well on them, and the ability to do these kinds of long-horizon tasks that require you to engage with an assignment for many hours? Or are these unrelated concepts? Sholto Douglas 00:06:36 I would take issue with that being the reason that agents haven't taken off. I think that's more about nines of reliability and the model actually successfully doing things. If you can't chain tasks successively with high enough probability, then you won't get something that looks like an agent. And that's why something like an agent might follow more of a step function. In GPT-4 class models, Gemini Ultra class models, they're not enough. But maybe the next increment on model scale means that you get that extra nine. Even though the loss isn't going down that dramatically, that small amount of extra ability gives you the extra. Obviously you need some amount of context to fit long-horizon tasks, but I don't think that's been the limiting factor up to now. Trenton Bricken 00:07:16 The NeurIPS best paper this year, by Rylan Schaeffer who was the lead author, points to this as the emergence of mirage. People will have a task and you get the right or wrong answer depending on if you've sampled the last five tokens correctly. So naturally you're multiplying the probability of sampling all of those and if you don't have enough nines of reliability, then you're not going to get emergence. And all of a sudden you do and it's, “oh my gosh, this ability is emergent,” when actually it was kind of there to begin with. Sholto Douglas 00:07:47 And there are ways that you can find a smooth metric for that. Dwarkesh Patel 00:07:50 HumanEval or whatever. In the GPT-4 paper , the coding problems they have, they measure– Sholto Douglas 00:07:56 Log pass rates Dwarkesh Patel 00:07:57 Exactly. For the audience, basically the idea is when you're measuring how much progress there has been on a specific task such as solving coding problems, when it gets it right only one in a thousand times you don't give it a one in a thousand score like, “oh, got it right some of the time.” And so the curve you see is, it gets it right one in a thousand, then one in a hundred, then one in ten, and so forth. So I want to follow up on this. If your claim is that the AI agents haven't taken off because of reliability rather than long-horizon task performance, isn't that lack of reliability–when a task is changed on top of another task, on top of another task–isn't that exactly the difficulty with long-horizon tasks? You have to do ten things in a row or a hundred things in a row, diminishing the reliability of any one of them. The probability goes down from 99.99% to 99.9%. Then the whole thing gets multiplied together and the whole thing has become so much less likely to happen. Sholto Douglas 00:08:59 That is exactly the problem.But the key issue you're pointing out there is that your base task solve rate is 90%. If it was 99% then chain, it doesn't become a problem. I think this is also something that just hasn't been properly studied. If you look at the academic evals, it’s a single problem. Like the math problem, it's one typical math problem, it's one university-level problem from across different topics. You were beginning to start to see evals looking at this properly via more complex tasks like SWE-bench , where they take a whole bunch of GitHub issues. That is a reasonably long horizon task, but it's still sub-hour as opposed to a multi-hour or multi-day task. So I think one of the things that will be really important to do next is understand better what success rate over long-horizon tasks looks like. I think that's even important to understand what the economic impact of these models might be and properly judge increasing capabilities. Cutting down the tasks and the inputs/outputs involved into minutes or hours or days and seeing how good it is at successively chaining and completing tasks of those different resolutions of time. Then that tells you how automated a job family or task family will be in a way that MMLU scores don't. Trenton Bricken 00:10:18 It was less than a year ago that we introduced 100K context windows and I think everyone was pretty surprised by that. Everyone had this soundbite of, “ quadratic attention costs , so we can't have long context windows.” And here we are. The benchmarks are being actively made. Dwarkesh Patel 00:10:36 Wait, doesn't the fact that there are these companies–Google, Magic , maybe others–who have million token attention imply that it's not quadratic anymore? Or are they just eating the cost? Sholto Douglas 00:10:50 Well, who knows what Google is doing for its long context game? One thing has frustrated me about the general research field's approach to attention. There’s an important way in which the quadratic cost of attention is actually dominated in typical dense transformers by the MLP block. So you have this n squared term that's associated with attention but you also have an n squared term that's associated with the D model, the residual stream dimension of the model. I think Sasha Rush has a great tweet where he basically plots the curve of the cost of attention respective to the cost of really large models and attention actually trails off. You actually need to be doing pretty long context before that term becomes really important. The second thing is that people often talk about how attention at inference time is such a huge cost. When you're actually generating tokens, the operation is not n squared. One set of Q-vectors looks up a whole bunch of KV-vectors and that's linear with respect to the amount of context that the model has. So I think this drives a lot of the recurrence and state space research where people have this meme of linear attention. And as Trenton said, there's a graveyard of ideas around attention. That’s not to say I don't think it's worth exploring, but I think it's important to consider why and where the actual strengths and weaknesses of it are. Dwarkesh Patel 00:12:21 Okay, what do you make of this take? As we move forward through the takeoff , more and more of the learning happens in the forward pass. So originally all the learning happens in the bottom-up, hill climbing evolutionary process. Let’s say during the intelligence explosion the AI is maybe handwriting the weights or doing GOFAI or something, and we're in the middle step where a lot of learning happens in-context now with these models, a lot of it happens within the backward pass. Does this seem like a meaningful gradient along which progress is happening? The broader thing being that if you're learning in the forward pass, it's much more sample efficient because you can basically think as you're learning. Like when you read a textbook, you're not just skimming it and trying to absorb inductively, “these words follow these words.” You read it and you think about it, and then you read some more and you think about it some more. Does this seem like a sensible way to think about the progress? Sholto Douglas 00:13:23 It may just be like how birds and planes fly, but they fly slightly differently. The virtue of technology allows us to accomplish things that birds can't. It might be that context length is similar in that it allows it to have a working memory that we can't, but functionally is not the key thing towards actual reasoning. The key step between GPT-2 and GPT-3 was that all of a sudden there was this meta-learning behavior that was observed in training, in the pre-training of the model. And that has, as you said, something to do with how if you give it some amount of context, it's able to adapt to that context. That was a behavior that wasn't really observed before that at all. And maybe that's a mixture of property of context and scale and this kind of stuff. But it wouldn't have occurred to model tiny context, I would say. Dwarkesh Patel 00:14:09 This is actually an interesting point. So when we talk about scaling up these models, how much of it comes from just making the models themselves bigger? And how much comes from the fact that during any single call you are using more compute? So if you think of diffusion, you can just iteratively keep adding more compute. If adaptive compute is solved, you can keep doing that. And in this case, if there's a quadratic penalty for attention but you're doing long context anyways, then you're still dumping in more compute ( and not just by having bigger models). Trenton Bricken 00:14:46 It's interesting because you do get more forward passes by having more tokens. My one gripe–I guess I have two gripes with this though, maybe three. So in the AlphaFold paper , one of the transformer modules–they have a few and the architecture is very intricate–but they do, I think, five forward passes through it and will gradually refine their solution as a result. You can also kind of think of the residual stream, Sholto alluded to the read-write operations, as a poor man's adaptive compute. Where it's just going to give you all these layers and if you want to use them, great. If you don't, then that's also fine. Then people will be like, “oh the brain is recurrent and you can do however many loops through it you want.” I think to a certain extent, that's right. If I ask you a hard question, you'll spend more time thinking about it and that would correspond to more forward passes. But I think there's a finite number of forward passes that you can do. It’s with language as well, people are like “oh human language can have infinite recursion in it,” like infinite nested statements of “the boy jumped over the bear, that was doing this, that had done this, that had done that…” But empirically, you'll only see five to seven levels of recursion, which relates to that magic number of how many things you can hold in working memory at any given time. So it's not infinitely recursive, but does that matter in the regime of human intelligence? And can you not just add more layers? 00:16:12 - Intelligence is just associations Dwarkesh Patel 00:16:12 Can you break it down for me? You've referred to this in some of your previous answers of listening to these long contexts and holding more things in memory. But ultimately it comes down to your ability to mix concepts together to do some kind of reasoning and these models aren't necessarily human level at that, even in context. Break down for me how you see just storing raw information versus reasoning and what's in between. Like, where's the reasoning happening? Where is this raw information storage happening? What's different between them in these models? Trenton Bricken 00:16:46 I don't have a super crisp answer for you here. Obviously with the input and output of the model, you're mapping back to actual tokens. And then in between that you're doing higher level processing. Dwarkesh Patel 00:17:01 Before we get deeper into this, we should explain to the audience. You referred earlier to Anthropic's way of thinking about transformers as these read-write operations that layers do. One of you should just kind of explain at a high level what you mean by that. Trenton Bricken 00:17:15 So for the residual stream, imagine you're in a boat going down a river and the boat is the current query where you're trying to predict the next token. So it's “the cat sat on the _____.” And then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want. And those correspond to the attention heads and MLPs that are part of the model. Sholto Douglas 00:17:41 I almost think of it like the working memory of the model, like the RAM of the computer, where you're choosing what information to read in so you can do something with it and then maybe read something else in later on. Trenton Bricken 00:17:54 And you can operate on subspaces of that high-dimensional vector. At this point, I think it's almost given that a ton of things are encoded in superposition. So the residual stream is just one high-dimensional vector, but actually there's a ton of different vectors that are packed into it. Dwarkesh Patel 00:18:12 To dumb it down, a way that would have made sense to me a few months ago is that you have the words that are the input into the model. All those words get converted into these tokens and those tokens get converted into these vectors. And basically, it's just this small amount of information that's moving through the model. And the way you explained it to me, Sholto, this paper talks about how early on in the model, maybe it's just doing some very basic things about, “what do these tokens mean?” Like if it says ten plus five, just moving information to have that good representation. And in the middle, maybe the deeper thinking is happening about “how to solve this.” At the end, you're converting it back into the output token because the end product is that you're trying to predict the probability of the next token from the last of those residual streams. So it's interesting to think about the small compressed amount of information moving through the model and how it's getting modified in different ways. Trenton, you're one of the few people who have a background from neuroscience. So you can think about the analogies here to the brain. And in fact, you had a paper in grad school about thinking about attention in the brain, and one of our friend’s said this is the only, or first, neural explanation of why attention works. Whereas we have evidence for why the CNNs, convolutional neural networks , work based on the visual cortex or something. Do you think in the brain there is something like a residual stream of compressed information that's moving through and getting modified as you're thinking about something? Even if that's not what's literally happening, do you think that's a good metaphor for what's happening in the brain? Trenton Bricken 00:20:04 At least in the cerebellum you basically do have a residual stream in what we'll call the attention model for now–and I can go into whatever amount of detail you want for that–where you have inputs that route through it, but they'll also just go directly to the end point that that module will contribute to. So there's a direct path and an indirect path. and, and so the model can pick up whatever information it wants and then add that back in. Dwarkesh Patel 00:20:35 What happens in the cerebellum? Trenton Bricken 00:20:37 So the cerebellum nominally just does fine motor control but I analogize this to the person who's lost their keys and is just looking under the streetlight where it's very easy to observe this behavior. One leading cognitive neuroscientist said to me that a dirty little secret of any fMRI study, where you're looking at brain activity for a given task, is that the cerebellum is almost always active and lighting up for it. If you have a damaged cerebellum, you also are much more likely to have autism so it's associated with social skills. In one particular study, where I think they use PET instead of fMRI, when you're doing “next token prediction” the cerebellum lights up a lot. Also, 70% of your neurons in the brain are in the cerebellum. They're small but they're there and they're taking up real metabolic cost. Dwarkesh Patel 00:21:29 This was one of Gwern ’s points, that what changed with humans was not just that we have more neurons, but specifically there's more neurons in the cerebral cortex in the cerebellum and they're more metabolically expensive and they're more involved in signaling and sending information back and forth. Is that attention? What's going on? Trenton Bricken 00:21:52 So back in the 1980s, Pentti Kanerva came up with an associative memory algorithm . You have a bunch of memories. You want to store them. There's some amount of noise or corruption that's going on and you want to query or retrieve the best match. And so he wrote this equation for how to do it and a few years later realized that if you implemented this as an electrical engineering circuit, it actually looks identical to the core cerebellar circuit. And that circuit, and the cerebellum more broadly, is not just in us, it's in basically every organism. There's active debate on whether or not cephalopods have it, they kind of have a different evolutionary trajectory. But even for fruit flies with the Drosophila mushroom body , that is the same cerebellar architecture. That convergence and then my paper, which shows that actually this attention operation is a very close approximation, including implementing the Softmax and having these nominal quadratic costs that we've been talking about. So the three way convergence here and the takeoff and success of transformers, just seems pretty striking to me. Dwarkesh Patel 00:23:04 I want to zoom out. I think what motivated this discussion in the beginning was we were talking about, “what is the reasoning? What is the memory? What do you think about the analogy you found to attention and this?” Do you think of this more as just looking up the relevant memories or the relevant facts? And if that is the case, where is the reasoning happening in the brain? How do we think about how that builds up into the reasoning? Trenton Bricken 00:23:33 Maybe my hot take here, I don't know how hot it is, is that most intelligence is pattern matching and you can do a lot of really good pattern matching if you have a hierarchy of associative memories. You start with your very basic associations between just objects in the real world. You can then chain those and have more abstract associations, such as a wedding ring symbolizing so many other associations that are downstream. You can even generalize the attention operation and this associated memory as the MLP layer as well. And it's in a long-term setting where you don't have tokens in your current context, but I think this is an argument that association is all you need. Associated memory in general as well, you can do two things with it. You can both, denoise or retrieve a current memory. So if I see your face but it's raining and cloudy, I can denoise and gradually update my query towards my memory of your face. But I can also access that memory and then the value that I get out actually points to some other totally different part of the space. A very simple instance of this would be if you learn the alphabet. So I query for A and it returns B, I query for B and it returns C, and you can traverse the whole thing. Dwarkesh Patel 00:25:02 One of the things I talked to Demis about was a paper he had in 2008 that memory and imagination are very linked because of this very thing that you mentioned, that memory is reconstructive . So you are, in some sense, imagining every time you're thinking of a memory because you're only storing a condensed version of it and you have to. This is famously why human memory is terrible and why people in the witness box or whatever would just make shit up. So let me ask a stupid question. So you read Sherlock Holmes and the guy's incredibly sample efficient. He'll see a few observations and he'll basically figure out who committed the crime because there's a series of deductive steps that leads from somebody's tattoo and what's on the wall to the implications of that. How does that fit into this picture? Because crucially, what makes him smart is that there's not just an association, but there's a sort of deductive connection between different pieces of information. Would you just explain it as, that's just higher level association? Trenton Bricken 00:26:11 I think so. I think learning these higher-level associations to be able to then map patterns to each other, as a kind of meta-learning. I think in this case, he would also just have a really long context length, or a really long working memory, where he can have all of these bits and continuously query them as he's coming up with some theory so that the theory is moving through the residual stream. And then his attention heads are querying his context. But then, how he's projecting his query and keys in the space, and how his MLPs are then retrieving longer-term facts or modifying that information, is allowing him to in later layers do even more sophisticated queries and slowly be able to reason through and come to a meaningful conclusion. Sholto Douglas 00:27:00 That feels right to me. You're looking back in the past. You're selectively reading in certain pieces of information, comparing them, and maybe that informs your next step of what piece of information you now need to pull in. Then you build this representation, which progressively looks closer and closer to the suspect in your case. That doesn't feel at all outlandish. Trenton Bricken 00:27:20 I think that the people who aren't doing this research can overlook how after your first layer of the model, every query key and value that you're using for attention comes from the combination of all the previous tokens. So my first layer, I'll query my previous tokens and just extract information from them. But all of a sudden, let's say that I attended to tokens 1, 2, and 4 in equal amounts. Then the vector in my residual stream–assuming that they wrote out the same thing to the value vectors, but, but ignore that for a second–is a third of each of those. So when I'm querying in the future, my query is actually a third of each of those things. Sholto Douglas 00:28:03 But they might be written to different subspaces. Trenton Bricken 00:28:05 That's right. Hypothetically, but they wouldn't have to. You can recombine and immediately, even by layer two and certainly by the deeper layers, just have these very rich vectors that are packing in a ton of information. And the causal graph is literally over every single layer that happened in the past. That's what you're operating on. Sholto Douglas 00:28:25 Yeah, it does bring to mind a very funny eval to do, a Sherlock Holmes eval. You put the entire book into context and then you have a sentence which is, “the suspect is X.” Then you have a larger probability distribution over the different characters in the book. Trenton Bricken 00:28:41 That would be super cool. Sholto Douglas 00:28:44 I wonder if you'd get anything at all. Dwarkesh Patel 00:28:47 Sherlock Holmes is probably already in the training data. You gotta get a mystery novel that was written in the– Trenton Bricken 00:28:52 You can get an LLM to write it. Sholto Douglas 00:28:53 Or we could purposely exclude it, right? Dwarkesh Patel 00:28:56 Oh, we can? How do you? Trenton Bricken 00:28:57 Well, you need to scrape any discussion of it from Reddit or any other thing. Sholto Douglas 00:29:00 Right, it's hard. That's one of the challenges that goes into things like long-context evals, getting a good one. You need to know that it's not in your training data. You just put in the effort to exclude it. Dwarkesh Patel 00:29:10 There's two different threads I want to follow up on. Let's go to the long-context one and then we'll come back to this. In the Gemini 1.5 paper the eval that was used was can it remember something like Paul Graham’s essays . Sholto Douglas 00:29:28 Yeah, the needle in a haystack . Dwarkesh Patel 00:29:30 I mean, we don't necessarily just care about its ability to recall one specific fact from the context. I'll step back and ask the question. The loss function for these models is unsupervised. You don't have to come up with these bespoke things that you keep out of the training data. Is there a way you can do a benchmark that's also unsupervised, where another LLM is rating it in some way or something like that. Maybe the answer is that if you could do this, reinforcement learning would work. Sholto Douglas 00:30:05 I think people have explored that kind of stuff. For example, Anthropic has the constitutional RL paper where they take another language model and they point it and say, “how helpful or harmless was that response?” Then they get it to update and try and improve along the Pareto frontier of helpfulness and harmfulness. So you can point language models at each other and create evals in this way. It's obviously an imperfect art form at the moment. because you get reward function hacking basically. Even humans are imperfect here. Humans typically prefer longer answers, which aren't necessarily better answers and you get the same behavior with models. Dwarkesh Patel 00:30:48 Going back to the Sherlock Holmes thing, if it's all associations all the way down, does that mean we should be less worried about super intelligence? Because there's not this sense in which it's like Sherlock Holmes++. It'll still need to just find these associations, like humans find associations. It's not able to just see a frame of the world and then it's figured out all the laws of physics. Trenton Bricken 00:31:20 This is a very legitimate response.It's, “if you say humans are generally intelligent, then artificial general intelligence is no more capable or competent.” I'm just worried that you have that level of general intelligence in silicon. You can then immediately clone hundreds of thousands of agents and they don't need to sleep, and they can have super long context windows, and then they can start recursively improving, and then things get really scary. So I think to answer your original question, you're right, they would still need to learn associations. Dwarkesh Patel 00:31:52 But wait, if intelligence is fundamentally about these associations, the recursive self-improvement is just them getting better at association. There's not another thing that's happening. So then it seems like you might disagree with the intuition that they can't be that much more powerful, if they're just doing that. Trenton Bricken 00:32:11 I think then you can get into really interesting cases of meta-learning. When you play a new video game or study a new textbook, you're bringing a whole bunch of skills to the table to form those associations much more quickly. And because everything in some way ties back to the physical world, I think there are general features that you can pick up and then apply in novel circumstances. 00:32:35 - Intelligence explosion & great researchers Dwarkesh Patel 00:32:35 Should we talk about the intelligence explosion then? The reason I'm interested in discussing this with you guys in particular is that the models of the intelligence explosion we have so far come from economists. That’s fine but I think we can do better because in the model of the intelligence explosion, what happens is you replace the AI researchers. There's a bunch of automated AI researchers who can speed up progress, make more AI researchers, and make further progress. If that's the mechanism, we should just ask the AI researchers whether they think this is plausible. So let me just ask you, if I have a thousand agent Sholtos or agent Trentons, do you think that you get an intelligence explosion? What does that look like to you? Sholto Douglas 00:33:32 I think one of the important bounding constraints here is compute. I do think you could dramatically speed up AI research. It seems very clear to me that in the next couple of years, we'll have things that can do many of the software engineering tasks that I do on a day to day basis, and therefore dramatically speed up my work, and therefore speed up the rate of progress. At the moment, I think most of the labs are somewhat compute bound in that there are always more experiments you could run and more pieces of information that you could gain in the same way that scientific research on biology is somewhat experimentally throughput-bound. You need to run and culture the cells in order to get the information. I think that will be at least a short term planning constraint. Obviously, Sam's trying to raise $7 trillion to buy chips and it does seem like there's going to be a lot more compute in the future as everyone is heavily ramping. NVIDIA 's stock price sort of represents the relative compute increase. Any thoughts? Trenton Bricken 00:34:36 I think we need a few more nines of reliability in order for it to be really useful and trustworthy. And we need context lengths that are super long and very cheap to have. If I'm working in our code base, it's really only small modules that I can get Claude to write for me right now. But it's very plausible that within the next few years, or even sooner, it can automate most of my tasks. The only other thing here that I will note is that the research our interpretability subteam is working on is so early-stage. You really have to be able to make sure everything is done correctly in a bug-free way and contextualize the results with everything else in the model. If something isn't going right, you have to be able to enumerate all of the possible things, and then slowly work on those. An example that we've publicly talked about in previous papers is dealing with layer norm . If I'm trying to get an early result or look at the logit effects of the model, if I activate this feature that we've identified to a really large degree, how does that change the output of the model? Am I using layer norm or not? How is that changing the feature that's being learned? That will take even more context or reasoning abilities for the model. Dwarkesh Patel 00:36:04 You used a couple of concepts together. It's not self-evident to me that they're the same but it seemed like you were using them interchangeably. One was working on the Claude code base and making more modules based on that, they need more context or something. It seems like they might already be able to fit in the context or do you mean context like “the context window?” Trenton Bricken 00:36:30 Yeah, the “context window” context. Dwarkesh Patel 00:36:32 So it seems like the thing that's preventing it from making good modules is not the lack of being able to put the code base in there. Trenton Bricken 00:36:39 I think that will be there soon. Dwarkesh Patel 00:36:41 But it's not going to be as good as you at coming up with papers because it can fit the code base in there. Trenton Bricken 00:36:46 No, but it will speed up a lot of the engineering. Dwarkesh Patel 00:36:48 In a way that causes an intelligence explosion? Trenton Bricken 00:36:53 No, in a way that accelerates research. But I think these things compound. The faster I can do my engineering, the more experiments I can run. And the more experiments I can run, the faster we can… I mean, my work isn't actually accelerating capabilities at all, it's just interpreting the models. But we have a lot more work to do on that. surprise to the Twitter guy, Dwarkesh Patel 00:37:14 For context, when you released your paper , there was a lot of talk on Twitter like, “alignment is solved guys. Close the curtains.” Trenton Bricken 00:37:24 Yeah, no it keeps me up at night how quickly the models are becoming more capable and just how poor our understanding of what's going on still is. Dwarkesh Patel 00:37:36 Let's run through the specifics here. By the time this is happening, we have bigger models that are two to four orders of magnitude bigger, or at least an effective compute two to four orders of magnitude bigger. So this idea that you can run experiments faster, you're having to retrain that model in this version of the intelligence explosion. The recursive self-improvement is different from what might've been imagined 20 years ago, where you just rewrite the code. You actually have to train a new model and that's really expensive. Not only now, but especially in the future, as you keep making these models orders of magnitude bigger. Doesn't that dampen the possibility of a recursive self-improvement type of intelligence explosion? Sholto Douglas 00:38:25 It's definitely going to act as a breaking mechanism. I agree that the world of what we're making today looks very different from what people imagined it would look like 20 years ago. It's not going to be able to write the same code to be really smart, because actually it needs to train itself. The code itself is typically quite simple, typically really small and self contained. I think John Carmack had this nice phrase where it's the first time in history where you can plausibly imagine writing AI with 10,000 lines of code. That actually does seem plausible when you pare most training codebases down to the limit. But it doesn't take away from the fact that this is something where we should really strive to measure and estimate how progress might be. We should be trying very, very hard to measure exactly how much of a software engineer's job is automatable, and what the trend line looks like, and be trying our hardest to project out those trend lines. Dwarkesh Patel 00:39:21 But with all due respect to software engineers you are not writing like a React front-end right? What is concretely happening? Maybe you can walk me through a day in the life of Sholto. You're working on an experiment or project that's going to make the model "better.” What is happening from observation to experiment, to theory, to writing the code? What is happening? Sholto Douglas 00:39:48 I think it’s important to contextualize here that I've primarily worked on inference so far. A lot of what I've been doing is just helping guide the pre-training process, designing a good model for inference and then making the model and the surrounding system faster. I've also done some pre-training work around that, but it hasn't been my 100% focus. I can still describe what I do when I do that work. Dwarkesh Patel 00:40:09 Sorry, let me interrupt. When Carl Shulman was talking about it on the podcast , he did say that things like improving inference or even literally making better chips or GPUs, that’s part of the intelligence explosion. Obviously if the inference code runs faster, it happens better or faster or whatever. Sorry, go ahead. Sholto Douglas 00:40:32 So concretely, what does a day look like? I think the most important part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale, and interpreting and understanding what goes wrong. I think most people would be surprised to learn just how much goes into interpreting and understanding what goes wrong. People have long lists of ideas that they want to try. Not every idea that you think should work, will work. Trying to understand why that is is quite difficult and working out what exactly you need to do to interrogate it. So a lot of it is introspection about what's going on. It's not pumping out thousands and thousands and thousands of lines of code. It's not the difficulty in coming up with ideas. Many people have a long list of ideas that they want to try, but paring that down and shot calling, under very imperfect information, what are the right ideas to explore further is really hard. Dwarkesh Patel 00:41:32 What do you mean by imperfect information? Are these early experiments? What is the information? Sholto Douglas 00:41:40 Demis mentioned this in his podcast. It's like the GPT-4 paper where you have scaling law increments. You can see in the GPT-4 paper, they have a bunch of dots, right? They say we can estimate the performance of our final model using all of these dots and there's a nice curve that flows through them. And Demis mentioned that we do this process of scaling up. Concretely, why is that imperfect information? It’s because you never actually know if the trend will hold. For certain architectures the trend has held really well. And for certain changes, it's held really well. But that isn't always the case. And things which can help at smaller scales can actually hurt at larger scales. You have to make guesses based on what the trend lines look like and based on your intuitive feeling of what’s actually something that's going to matter, particularly for those which help with the small scale. Dwarkesh Patel 00:42:35 That's interesting to consider. For every chart you see in a release paper or technical report that shows that smooth curve, there's a graveyard of first few runs and then it's flat. Sholto Douglas 00:42:45 Yeah. There's all these other lines that go in different directions. You just tail off. Trenton Bricken 00:42:50 It's crazy, both as a grad student and here, the number of experiments that you have to run before getting a meaningful result. Dwarkesh Patel 00:42:57 But presumably it's not just like you run it until it stops and then go to the next thing. There's some process by which to interpret the early data. I don't know. I could put a Google Doc in front of you and I'm pretty sure you could just keep typing for a while on different ideas you have. There's some bottleneck between that and just making the models better immediately. Walk me through that. What is the inference you're making from the first early steps that makes you have better experiments and better ideas? Sholto Douglas 00:43:30 I think one thing that I didn't fully convey before was that I think a lot of like good research comes from working backwards from the actual problems that you want to solve. There's a couple of grand problems today in making the models better that you would identify as issues and then work on how can I change things to achieve this? When you scale you also run into a bunch of things and you want to fix behaviors and issues at scale. And that informs a lot of the research for the next increment and this kind of stuff. Concretely, the barrier is a little bit of software engineering, having a code base that's large and capable enough that it can support many people doing research at the same time often makes it complex. If you're doing everything by yourself, your iteration pace is going to be much faster. Alec Radford , for example, famously did much of the pioneering work at OpenAI. I’ve heard he mostly works out of a Jupyter notebook and then has someone else who writes and productionizes that code for him. Actually operating with other people raises the complexity a lot, for natural reasons familiar to every software engineer and also the inherent running. Running and launching those experiments is easy but there's inherent slowdowns induced by that. So you often want to be parallelizing multiple different streams. You can't be totally focused on one thing necessarily. You might not have fast enough feedback cycles. And then intuiting what went wrong is actually really hard. This is in many respects, the problem that the team that Trenton is on is trying to better understand. What is going on inside these models? We have inferences and understanding and headcanon for why certain things work, but it's not an exact science. and so you have to constantly be making guesses about why something might have happened, what experiment might reveal, whether that is or isn't true. That's probably the most complex part. The performance work is comparatively easier but harder in other respects. It's just a lot of low-level and difficult engineering work. Trenton Bricken 00:45:38 I agree with a lot of that. Even on the interpretability team, especially with Chris Olah leading it, there are just so many ideas that we want to test and it's really just having the “engineering” skill–a lot of it is research–to very quickly iterate on an experiment, look at the results, interpret it, try the next thing, communicate them, and then just ruthlessly prioritizing what the highest priority things to do are. Sholto Douglas 00:46:07 This is really important. The ruthless prioritization is something which I think separates a lot of quality research from research that doesn't necessarily succeed as much. We're in this funny field where so much of our initial theoretical understanding is broken down basically. So you need to have this simplicity bias and ruthless prioritization over what's actually going wrong. I think that's one of the things that separates the most effective people. They don't necessarily get too attached to using a given sort of solution that they are familiar with, but rather they attack the problem directly. You see this a lot in people who come in with a specific academic background. They try to solve problems with that toolbox but the best people are people who expand the toolbox dramatically. They're running around and they're taking ideas from reinforcement learning, but also from optimization theory. And also they have a great understanding of systems. So they know what the sort of constraints that bound the problem are and they're good engineers. They can iterate and try ideas fast. By far the best researchers I've seen, they all have the ability to try experiments really, really, really, really, really fast. That’s cycle time at smaller scales. Cycle time separates people. Trenton Bricken 00:47:20 Machine learning research is just so empirical. This is honestly one reason why I think our solutions might end up looking more brain-like than otherwise. Even though we wouldn't want to admit it, the whole community is kind of doing greedy evolutionary optimization over the landscape of possible AI architectures and everything else. It’s no better than evolution. And that’s not even a slight against evolution. Dwarkesh Patel 00:47:46 That's such an interesting idea. I'm still confused on what will be the bottleneck. What would have to be true of an agent such that it sped up your research? So in the Alec Radford example where he apparently already has the equivalent of Copilot for his Jupyter notebook experiments, is it just that if he had enough of those he would be a dramatically faster researcher? So you're not automating the humans, you're just making the most effective researchers who have great taste, more effective and running the experiments for them? You're still working at the point at which the intelligence explosion is happening? Is that what you're saying? Sholto Douglas 00:48:27 Right, and if that were directly true then why can't we scale our current research teams better? I think that’s an interesting question to ask. If this work is so valuable, why can't we take hundreds or thousands of people–they're definitely out there–and scale our organizations better. I think we are less, at the moment, bound by the sheer engineering work of making these things than we are by compute to run and get signal, and taste in terms of what the actual right thing to do is. And then making those difficult inferences on imperfect information, Trenton Bricken 00:49:13 For the Gemini team. Because I think for interpretability, we actually really want to keep hiring talented engineers. I think that's a big bottleneck for us. Sholto Douglas 00:49:23 Obviously more people are better. But I do think it's interesting to consider. One of the biggest challenges that I've thought a lot about is how do we scale better? Google is an enormous organization. It has 200,000-ish people, right? Maybe 180,000 or something like that. One has to imagine ways of scaling out Gemini's research program to all those fantastically talented software engineers. This seems like a key advantage that you would want to be able to take advantage of. You want to be able to use it but how do you effectively do that? It's a very complex organizational problem. Dwarkesh Patel 00:50:02 So compute and taste. That's interesting to think about because at least the compute part is not bottlenecked on more intelligence, it's just bottlenecked on Sam's $7 trillion or whatever, right? If I gave you 10x the H100s to run your experiments, how much more effective a researcher are you? Sholto Douglas 00:50:20 TPUs, please. Dwarkesh Patel 00:50:23 How much more effective a researcher are you? Sholto Douglas 00:50:26 I think the Gemini program would probably be maybe five times faster with 10 times more compute or something like that. Dwarkesh Patel 00:50:35 So that's pretty good. Elasticity of 0.5. Wait, that's insane. Sholto Douglas 00:50:39 I think more compute would just directly convert into progress. Dwarkesh Patel 00:50:43 So you have some fixed size of compute and some of it goes to inference and also to clients of GCP . Some of it goes to training and from there, as a fraction of it, some of it goes to running the experiments for the full model. Sholto Douglas 00:51:04 Yeah, that's right. Dwarkesh Patel 00:51:05 Shouldn't the fraction that goes experiments then be higher given research is bottlenecked by compute. Sholto Douglas 00:51:13 So one of the strategic decisions that every pre-training team has to make is exactly what amount of compute do you allocate to different training runs, to your research program versus scaling the last best thing that you landed on. They're all trying to arrive at an optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don't get otherwise. So scale has all these emergent properties which you want to understand better. Remember what I said before about not being sure what's going to fall off the curve. If you keep doing research in this regime and keep on getting more and more compute efficient, you may have actually gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too, at the frontier of what you sort of expect to work. Dwarkesh Patel 00:52:17 So then tell me what it looks like to be in the world where AI has significantly sped up AI research. Because from this, it doesn't really sound like the AIs are going off and writing the code from scratch that's leading to faster output. It sounds like they're really augmenting the top researchers in some way. Tell me concretely. Are they doing the experiments? Are they coming up with the ideas? Are they just evaluating the outputs of the experiments? What's happening? Sholto Douglas 00:52:39 So I think there's two walls you need to consider here. One is where AI has meaningfully sped up our ability to make algorithmic progress. And one is where the output of the AI itself is the thing that's the crucial ingredient towards model capability progress. Specifically what I mean there is synthetic data. In the first world, where it's meaningfully speeding up algorithmic progress, I think a necessary component of that is more compute. You've probably reached this elasticity point where AIs are easier to speed up and get on to context than yourself, or other people. So AIs meaningfully speed up your work because they're basically a fantastic Copilot that helps you code multiple times faster. That seems actually quite reasonable. Super long-context, super smart model. It's onboarded immediately and you can send them off to complete subtasks and subgoals for you. That actually feels very plausible, but again we don't know because there are no great evals about that kind of thing. As I said before, the best one is SWE-bench. Dwarkesh Patel 00:53:51 Somebody was mentioning to me that the problem with that one is that when a human is trying to do a pull request, they'll type something out and they'll run it and see if it works. If it doesn't, they'll rewrite it. None of this was part of the opportunities that the LLM was given when told “run on this.” Just output and if it runs and checks all the boxes then it passed. So it might've been an unfair test in that way. Sholto Douglas 00:54:16 So you can imagine that if you were able to use that, that would be an effective training source. The key thing that's missing from a lot of training data is the reasoning traces, right? And I think this would be it. If I wanted to try and automate a specific field, a job family, or understand how at risk of automation that specific field is, then having reasoning traces feels to me like a really important part of that. Dwarkesh Patel 00:54:51 There's so many different threads there I want to follow up on. Let's begin with the data versus compute thing. Is the output of the AI the thing that's causing the intelligence explosion? People talk about how these models are really a reflection on their data. I forgot his name but there was a great blog by this OpenAI engineer. It was talking about how at the end of the day, as these models get better and better, there are just going to be really effective maps of the data set. So at the end of the day you have to stop thinking about architectures. The most effective architecture is just, “do you do an amazing job of mapping the data?” So that implies that the future AI progress comes from the AI just making really awesome data that you’re mapping to? Sholto Douglas 00:55:45 That's clearly a very important part . Dwarkesh Patel 00:55:46 That's really interesting. Does that look to you like chain-of-thought ? Or what would you imagine as these models get better, as these models get smarter? What does the synthetic data look like? Sholto Douglas 00:56:00 When I think of really good data, to me, that raises something which involved a lot of reasoning to create. It's similar to Ilya 's perspective on achieving super intelligence effectively via perfectly modeling human textual output. But even in the near term, in order to model something like the arXiv papers or Wikipedia, you have to have an incredible amount of reasoning behind you in order to understand what next token might be output. So for