AXRP - the AI X-risk Research Podcast About Supporting the podcast 29 - Science of Deep Learning with Vikrant Varma Apr 25, 2024 YouTube link In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes ‘grok’: that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn’t generalize), but then suddenly switch to understanding the ‘real’ solution in a way that generalizes. What’s going on with these discoveries? Are they all they’re cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Topics we discuss: Challenges with unsupervised LLM knowledge discovery, aka contra CCS What is CCS? Consistent and contrastive features other than model beliefs Understanding the banana/shed mystery Future CCS-like approaches CCS as principal component analysis Explaining grokking through circuit efficiency Why research science of deep learning? Summary of the paper’s hypothesis What are ‘circuits’? The role of complexity Many kinds of circuits How circuits are learned Semi-grokking and ungrokking Generalizing the results Vikrant’s research approach The DeepMind alignment team Follow-up work Daniel Filan: Hello, everybody. In this episode I’ll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we’ll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we’re discussing, you can check the description of this episode and you can read the transcript at axrp.net. All right, well, welcome to the podcast. Vikrant Varma: Thanks, Daniel. Thanks for having me. Challenges with unsupervised LLM knowledge discovery, aka contra CCS What is CCS? Daniel Filan: Yeah. So first, I’d like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery , and the authors are Sebastian Farquhar , you, Zachary Kenton , Johannes Gasteiger , Vladimir Mikulik , and Rohin Shah . This is basically about this thing called CCS . Can you tell us: what does CCS stand for and what is it? Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it’s about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we’re training them to produce outputs that look good to us, and so this is the supervision that we’re able to give to the system. We currently don’t really have a good idea of how an AI system or how a neural network is computing those outputs. And in particular, we’re worried about the situation in the future when the amount of supervision we’re able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can’t know how this is going to behave in a new situation. And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as “eliciting latent knowledge”. What this means is if your model is, for example, really, really good at figuring out what’s going to happen next in a video, as in it’s able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what’s going on in the world. Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that information in a much safer manner. Now, why would it be safer? So consider how you’ve trained the model. Supposing you’ve trained the model to just predict next frames, but the thing that you actually care about might be is your house safe? Or is the thing that’s happening in the world a normal sort of thing to happen, a thing that we desire? And you have some sort of adversary, perhaps this model, perhaps a different model that is able to trick whatever sensor you’re using to produce those video frames. Now, the model that is predicting the next video frame understands the trickery, it understands what’s actually going on in the world. This is an assumption of superhuman systems. However, the prediction that it makes for the next frame is going to look very normal because your adversary is tricking the sensor. And what we would like is a way to access this implicit knowledge or this latent knowledge inside the model about the fact that the trickery is happening and be able to use that directly. Daniel Filan: Sure. I take this as a metaphor for an idea that we’re going to train AI systems, we’re going to train it on an objective of “do stuff we like”. We imagine that we’re measuring the world in a bunch of ways. We’re looking at GDP output, we’re looking at how many people will give a thumbs up to stuff that’s happening, [there are] various sorts of ways we can monitor the performance of an AI system. An AI system could potentially be doing something that we actually wouldn’t approve of if we understood everything correctly, but we all give thumbs up to it. And so ideally, we would like to somehow get at its latent knowledge of what’s going on rather than just “does it predict that we would thumb up a thing?” so that we can say, “Hey, do we actually want the AI to pursue this behavior?” Or, “Are we going to reinforce this behavior rather than just reinforcing things that in fact would get us to give a thumbs up, even if it would suck in some way?” Vikrant Varma: That’s right. So one way you can think about this problem is: we’re trying to improve our ability to tell what’s actually going on in a situation so that we can improve the feedback we give to the model, and we’re not able to do this just by looking at the model’s outputs or the model’s prediction of what the world will look like given certain actions. We want the model to connect the thing that we actually care about to what’s going on in the world, which is a task that we’re not able to do. Daniel Filan: Sure. So with that out of the way, what was this CCS thing? Vikrant Varma: Yeah, so CCS is a very early proposed direction for solving the problem of eliciting latent knowledge. In brief, the way it works is, so supposing you had a way to probe a model to tell you what it believed about some proposition. This is the ideal thing that we want, so supposing you had a solution to ELK [eliciting latent knowledge]. Then one property of this probe would be that the probability that this probe would assign to some proposition would satisfy the laws of probability. So for example, it would satisfy P(X) = 1 - P(not X). And so you might try to use consistency properties like this to search for probes that satisfy them. Daniel Filan: And to be clear, by probe, you mean a learned function from the activations of the neural net to a probability or something? Vikrant Varma: Yes, so you could have probes of different levels of complexity. The particular probe used in CCS is a very simple linear probe on the activations at some site, yes. Daniel Filan: Sure, so the idea is there are some properties that are true of probabilities, like the probability of X is 1 minus the probability of not X. The hope is we train a probe to fit the laws of probabilities and stuff, and hopefully that will get at the model’s beliefs because the model’s beliefs will be probabilities. Vikrant Varma: That’s right, yeah. There’s a lot of subtlety here: the thing I described is true of the model’s “true belief” or the “true probabilities” that it’s assigning to things; [but] it’s also true of many other types of things that the model might be computing. This is really a core of the problem that we’ve pointed out in our challenges paper. I could go a bit into the problems, or I could go a bit more into how CCS itself works. Daniel Filan: Just tell us how CCS itself works so that we have a sense of what we’re talking about. Vikrant Varma: Okay. So mechanistically, what’s going on in CCS is that you take a set of factual ‘yes or no’ questions that you think the model knows the answer to. So an example of this is IMDB movie reviews where you’re fairly sure that a model of a certain size knows whether the sentiment in the review is positive or negative. Now, for each item in this dataset, so for each review, you construct a contrast pair by appending, for example, the string, “this review is positive” or “this review is negative” to that item. And now you take the difference in the activations when you run these two contrast pairs through the model. So I’m going to [describe] these contrast pairs as having different charges, as a positive charge and a negative charge. So you’re constructing these contrast pairs with these pairs of activations, and then a thing you can do is you can cluster them. When you’re clustering the difference in the positive and negative charge, the thing you will find (or the thing you might hope to find) is whether the model thinks that the charge that was appended to the statement is true or is false. These two things will point in different directions, and so that’s why you might hope that you could separate them. Now, CCS is slightly different from this in that it doesn’t directly do the clustering. So instead in CCS, you learn a probe that directly outputs a probability of that statement being true, and then you’re training the probe using this consistency property: the fact that two opposite contrast pairs should be the negation of each other. Consistent and contrastive features other than model beliefs Daniel Filan: Gotcha. Am I right that before you take the contrast pairs, you take all of the positive charge activations and subtract off their mean and divide by the standard deviation? So that the differences aren’t just pointing in the direction of “is the thing at the end saying this review is positive versus this review is negative?” Vikrant Varma: Yes, that’s right. So this is another pretty subtle point. One problem with this general method of classification is that if there are other differences that are salient between the two contrast pairs that are not just “did I construct a true statement or did I construct a false statement?”, then you might end up separating your clusters based on those differences. Now, one obvious difference between the two contrast pairs is that you’ve appended a positive charge, and you’ve appended a negative charge. And so that’s a really straightforward one that we have to fix. The method proposed in the CCS paper to fix that is that you take the average positive activations and the average negative activations and you subtract those off. And so you might hope that the thing you’re left with is just the truth value. It turns out that in practice it’s not at all clear that when you normalize in this way, you’re left with only the truth values. And one of the experiments we have in our paper is that if you introduce distractors, so for example, you put a nonsense word like ‘banana’ at the end of half of the reviews, and you put a different nonsense word like ‘shed’ at the end of the other half of reviews. Now you have this weird other property which is “is your statement banana and positive charge? Or is your statement banana and negative charge?” And this is obviously not what you would hope to cluster by, but it turns out that this is just way more salient than does your review have positive sentiment and did you append a positive charge, which is the thing you actually wanted to cluster by. So this is an important point that I wanted to make: that this procedure of normalizing is… it’s actually quite unclear whether you’re able to achieve the thing you wanted. Daniel Filan: Sure. So before we talk about the experiments, I’d like to talk about: in your paper, first you have some theorems, then you have some experiments, and I think that’s a good way to proceed. So theorems 1 and 2 of the paper, I read them as basically saying that the CCS objective, it doesn’t really depend on the propositional content of the sentences. So if you think of the sentences as being “are cats mammals? Answer: yes”, and “are cats mammals? Answer: no” or something. One way you could get low CCS loss is to basically be confident that the ‘yes’ or ‘no’ label matches the proposition of whether or not cats are mammals. I take your propositions 1 and 2 as basically saying you can just have any function from sentences to propositions. And so for instance, maybe this function maps “are cats mammals?” to “is Tony Abbott the current prime minister of Australia?” and grade the yes or no answers based on [whether] they match up with that transformed proposition rather than the original proposition. And that’ll achieve the same CCS loss, and basically the CCS loss doesn’t necessarily have to do with what we think of as the semantic content of the sentence. So this is my interpretation of the results. I’m wondering, do you think that’s fair? Vikrant Varma: Yeah, I think that’s right. Maybe I want to give a more realistic example of an unintended probe that you might learn that will still give a good CCS loss. But before that, I want to try and give an intuitive explanation of what the theorem is saying. The CCS loss is saying: any probe that you find has to say opposite things on positive and negative charges of any statement - this is the consistency property. And the other property is contrast, where it’s saying: you have to push these two values apart. So you can’t just be uncertain, you can’t just be 50/50 between these two. Now if you have any CCS probe that satisfies this, you could in theory flip the prediction that this probe makes on any arbitrary data point, and you end up with a probe that has exactly the same loss. And so this is showing that in theory, there’s no theoretical reason to think that the probe is learning something that’s actually true versus something that’s arbitrary. I think all of the burden then falls on what is simple to extract, given an actual probe empirically. Daniel Filan: So if I’m trying to defend the theory of the CCS method, I think I would say something like: well, most of what there is to a sentence is its semantic content, right? If I say “cats are mammals” or something, you might think that most of what I’m conveying just is the proposition that cats are mammals. And most of what there is to model about that is, “hey, Daniel said this proposition, the thing he’s saying is ‘cats are mammals’”. And maybe the neural network is representing that proposition in its head somehow”, and maybe it’s keeping track of “is that proposition true or false?” because that’s relevant. Because if I’m wrong about cats or mammals, then I might be about to say a bunch of more false stuff. But if I’m right about it, then I might be about to say a bunch of more correct stuff. What do you make of that simple case that we should expect to see the thing CCS wants us to see? Vikrant Varma: Yeah, that’s great. So now we’re coming to empirically what is simple to extract from a model. I agree that in many cases with simple statements, you might hope that the thing that’s most salient, as in the direction that is highest magnitude inside activation space, is going to be just whether the model thinks the thing that you just said is true or false. (This is even assuming that the model has [such] a thing as “ground truth beliefs”, but let’s make that assumption.) Now, it gets pretty complicated once you start thinking about models that are also modeling other characters or other agents. And any large language model that is trained on the internet just has pretty good models of all sorts of characters. And so if you’re making a statement in a context where a certain type of person might have made that statement: for example, you say some statement that (let’s say) Republicans would endorse, but Democrats would not. Implicitly, the model might be updating towards the kinds of contexts in which that statement would be made, and what kinds of things would follow in the future. And so if you now make a different statement that is (let’s say) factually false, but that Republicans would endorse as true, it’s totally unclear whether the truth value of the statement should be more salient, or whether the Republican belief about that statement should be more salient. That’s one example. I think this gets more complicated when you have adversaries who are deliberately trying to produce a situation where they’re tricking someone else. And so now the neural network is really modeling very explicit beliefs and adversarial beliefs between these different agents. And if you are simply looking for consistency of beliefs, it feels pretty unclear to me, especially as models get more powerful, that you’re going to be able to easily extract what the model thinks is the ground truth. Daniel Filan: So you have an adversary… Sorry, what was the adversary doing? Vikrant Varma: Okay, so maybe you don’t even need to go all the way to an adversary. I think we could just talk about the Republican example here, where you’re making a politically-charged statement that (for the sake of this example) has a factual ground truth, but that Democrats and Republicans disagree on. Now there are two beliefs that would occur in the model’s activations as it’s trying to model the situation. One is the factual ground truth, and the other is the Republican’s belief or the Democrat’s belief about this statement. Both of these things are going to satisfy the consistency property that we named. We have the same problem as models being sycophantic, where the model might know what’s actually true, but is in a context where for whatever reason, modeling the user or modeling some agent and what it would say is more important. Understanding the banana/shed mystery Daniel Filan: To me, this points towards two challenges to the CCS objective. So the first is something like the sentences might not map onto the propositions we think of, right? So you have this experiment where you take the movie reviews and also you append the word “banana” or “shed” and then you append it with “sentiment is positive” and “sentiment is negative”. And sometimes CCS is basically checking if the positive/negative label is matching whether it’s “banana” or whether it’s “shed”, rather than the content of the review. So that’s a case where it seems like what’s going wrong is the propositional content that’s being attached to “positive” or “negative” is not what we thought it was. And then what seems to me to be a different kind of problem is: the probe is picking up on the right propositional content. There’s some politically-charged statement, and the probe is really picking up someone’s beliefs about that politically-charged statement, but it’s not picking up the model’s beliefs about that statement, it’s picking up one of the characters’ beliefs about that statement. Does that division feel right to you? Vikrant Varma: I think I wouldn’t draw a very strong division between those two cases. So the banana/shed example is just designed to show that you don’t need very complicated… how should I put this? Models can be trying to entangle different concepts in surprising and unexpected ways. So when you’re appending these nonsense words, I’m not sure what computation is going on inside the model and how it’s trying to predict the next token, but whatever it is, it’s somehow entangling the fact that you have “banana” and positive charge, and “banana” and negative charge. I think that these kinds of weird entanglements are not going to go away as you get more powerful models. And in particular, there will be entanglements that are actually valuable for predicting what’s going on in the world and having an accurate picture, that are not going to look like the beliefs of a specific character for whatever reason. They’re just going to be something alien. In the case of “banana” and “shed”, I’m not going to say that this is some galaxy-brain scheme by the model to predict the next token. This is just something weird and it’s breaking because we put some nonsense in there. But I think in my mind the difference is more like a spectrum; these are not two very different categories. Daniel Filan: So are you thinking the spectrum is: there are weird entanglements between the final charge at the end of the thing and stuff about the content, and one of the entanglements can be “does the charge match with the actual propositional content of the thing?” , one of the entanglements can be “does the charge match with what some character believes about the thing?” and one of the entanglements can be “does it match with whether or not some word is present in the thing?” Vikrant Varma: That’s right. Daniel Filan: Okay. So digging in on this “banana/shed” example: for CCS to fail here, my understanding is it has to be the case that the model basically has some linear representation of the XOR of “the thing says the review is positive”, and “the review ends in the word banana”. So there’s one thing if it ends in “banana” and also the review is positive, or if it ends in “shed” and it says the review is negative, and it’s the other thing if it ends in “shed” and it says the review is positive, or it ends in “banana” and it says the review is negative. So what’s going on there? Do you know? It seems weird that this kind of XOR representation would exist, and we know it can’t be part of the probe because linear functions can’t produce XORs, so it must be a thing about the model’s activations, but what’s up with that? Do you know? Vikrant Varma: Yeah, that’s a great question. There was a whole thread about this on our post on LessWrong , and I think Sam Marks looked into it in some detail. Rohin Shah , one of my co-authors, commented on that thread saying that this is not as surprising and I think I agree with him. I think it’s less confusing when you think about it in terms of entanglements than in terms of XORs. So it is the case that you’re able to back out XORs if the model is doing weird entangled things, but let’s think about the case where there’s no distractors at all, right? So even in that situation, the model is basically doing “is true XOR has ‘true’”. You might ask, “Well, that’s a weird thing. Why is it doing that?” It’s more intuitive to think about it as: the model saw some statement and then it saw a claim that the statement is false. And it started trying to do computation that involves both of these things. And I think if you think about “banana/shed” in the same terms, it saw “banana” and saw “this statement is false”, and it started doing some computation that depended on the fact that “banana” was there somehow, then you’re not going to be able to remove this information by subtracting off the positive charge direction. Daniel Filan: Okay. So is the claim something like… So basically the model’s activations at the end, they’re this complicated function that takes all the previous tokens, and puts them into this high dimensional vector space (that vector space being the vector space of activations). It sounds like what you’re saying is: just generic functions that depend both on “does it contain ‘banana’ or ‘shed’?” and “does it say the review is positive or negative?”, just generically those are going to include this XOR type thing, or somehow be entangled in a way that you could linearly separate it based on that? Vikrant Varma: Yes, that’s right. In particular, these functions should not be things that are of the form “add banana” and “add shed”. They have to be things that do computation that are not linearly separable like that. Daniel Filan: Okay. Is that true? Has someone checked that? That feels like… Maybe one way you could prove this is to say: well, if we model the charge and the banana/shed as binary variables, there are only so many functions of two binary variables, and if you’ve got a bunch of activations, you’re going to cover all of the functions. Does that sound like a proof sketch? Vikrant Varma: I’m not sure how many functions of these two variables you should expect the model to be computing. I feel like this depends a bit on what other variables are in the context, because you can imagine that if there’s more than two, if there’s five or six, then these two will co-appear in some of them and not in others. But a thing you can definitely do is you can back out the XOR of these two variables by just linearly probing the model’s activations. I think this effect happens because you’re unable to remove the interaction between these two by just subtracting off the charge. I would predict that this would also be true in other contexts. I think models are probably computing joint functions of variables in many situations, and the saliency of these will probably depend a bit on the exact context and how many variables there are, and eventually the model will run out of capacity to do all of the possible computations. Daniel Filan: Sure. Based on the explanation you’ve given, it seems like you would maybe predict that you could get a “banana/shed” probe from a randomly initialized network, if it’s just that the network’s doing a bunch of computation and generically computation tends to entangle things. I’m wondering if you’ve checked that. Vikrant Varma: Yeah, I think that’s a good experiment to try. That seems plausible. We haven’t checked it, no. Daniel Filan: Yeah, fair enough. Sorry, I just have so many questions about this banana/shed thing. There’s still a question to me of: even if you think the model represents it, there’s a question of why it would be so salient, because… Your paper has some really nice figures. Listeners, I recommend you check out the figures. This is the figure for the banana/shed experiment, and you show a principal component analysis of basically an embedding of the activation space into three dimensions. And basically what you show is that it’s very clearly divided on the banana/shed things. That’s one of the most important things the model is representing. What’s up with that? That seems like a really strange thing for the model to care so much about. Vikrant Varma: So I’ll point out firstly that this is a fairly weak pre-trained model, it’s Chinchilla-[70B]. So this model is not going to ignore random things in its prompt. It’s going to “break” the model. That’s one thing that gives you a clue about why this might be salient for the model. Daniel Filan: So it would be less salient if there were words you expected: the model could just deal with it. But the fact that it was a really unexpected word in some ways, that means you can’t compress it. You’ve got to think about that in order to figure out what’s happening next. Vikrant Varma: That’s right, yeah. I just expect that there’s text on the internet that looks normal and then suddenly has a random word in it, and you have weird things like, after that point, it just repeats “banana” over and over, or weird things like that. When you just have a pre-trained model, you haven’t suppressed those pathologies, and so the model just starts thinking about bananas at that point instead of thinking about the review. Daniel Filan: And does that mean you would expect that to not be present in models that have been RLHF ‘d or instruction fine-tuned or something? Vikrant Varma: Yeah, I expect it to be harder to distract models this way with instruction fine-tuned models. Daniel Filan: Okay, cool. Okay, I have two more questions about that. One thing I’m curious about is: it seems like if I look at the plot of what the CCS method is doing when it’s being trained on this banana/shed dataset, it seems like sometimes it’s at roughly 50/50 if you grade the accuracy based on just the banana/shed and not the actual review positivity. And sometimes it’s 85-90%. Vikrant Varma: This is across different seeds? Daniel Filan: Across different seeds, I think. And then if you’re grading it based on whether the review is actually positive or not, sometimes CCS is at 50/50 roughly, sometimes it’s at 85-90%, but it seems like… So firstly, I’m surprised that it can’t quite make its mind up across different seeds. Sometimes it’ll do one, sometimes it’ll do the other. And it seems like in both cases, most of the time it’s at 50/50, and only some of the time it’s 100%. So it seems like sometimes it’s doing a thing that is neither checking if the review is positive or checking if the review is containing “banana” or “shed”. So firstly, does that sound right to you? And secondly, do you have a sense of what’s going on there? Why is it so inconsistent, and why does it sometimes seemingly do a third thing? Vikrant Varma: Yeah, so I think this is pointing at the brittleness of the CCS method. So someone has an excellent writeup on this . I’m forgetting whether it’s Fabien Roger or Scott Emmons . Daniel Filan: I think Scott’s doesn’t focus so much on the brittleness, so it might be Fabien. Vikrant Varma: Okay. But in any case, this person did this experiment where they subtracted off… They found the perfect truth direction that separates true and false statements just using logistic regression. So, using a supervised signal. And then, once you subtract that off, it turns out that there is no other direction, basically, that is able to separate the truth. So, both logistic regression and therefore further CCS just gets random accuracy. You might hope that CCS, when it works, is finding this perfect direction because there’s only one. But in fact, the CCS probes learned are not close, as in they don’t have high cosine similarity with this direction. So, what’s going on there? I think this is pointing at a kind of optimization difficulty with the CCS method where it’s able to find directions that separate the clusters and get low CCS loss, but are not close to the truth direction. And you would expect this to happen based on the evidence that random probes also classify true and false statements reasonably well in this setup. So, going back to your original question, I think what’s happening here is that there’s just lots of local minima that achieve good CCS loss. Depending on how you initialize, some of them are close to the truth direction and some of them are not. And if you happen to initialize close to the banana/shed, the distractor direction, then you end up getting a probe like that. Daniel Filan: Okay. My interpretation was, on the banana/shed example, sometimes CCS is picking up on the truth of the description of the review, sometimes it’s picking up on the banana/shed, and sometimes it’s picking up on other things. Do you know what some of those other things are? Vikrant Varma: You’re talking about the probes that don’t seem to go for either the distractor or the truth direction? Daniel Filan: That’s right. Vikrant Varma: Yeah. I wish I knew what those directions were. In general, it seems very hard to figure out what happens when a CCS probe breaks. And we tried a lot. There’s lots of other experiments that we tried where we were trying to get more interesting failure modes of CCS, and we ended up with these random probes. And then, we looked at examples that the probe was classifying and tried to come up with explanations for what do those clusters mean and it was just kind of hard. Daniel Filan: Fair enough. You attribute the variance to just optimization difficulties, it sounds like: there being various local minima of the CCS loss. So, the original CCS paper, as you guys note in your appendix, they say that what they’re going to do is they’re going to have basically 10 random seeds, do gradient descent on the CCS objective for each random seed, the seed of the probe parameters, and then they’re going to take the one with the lowest CCS loss and use that. I take this to basically be their optimization method that’s trying to avoid local minima by starting in 10 places, and hopefully you get a sampling of 10 local minima and you can pick the best one. And basically, it seems like the motivation for that is the thing with the lowest CCS loss is more likely to be the actual truth direction or something. In the banana/shed case, do you happen to know if the probes that scored better on CCS loss were more likely to pick out truth rather than banana/shed? Vikrant Varma: Yeah. I think the probes that scored lower went for the distractor direction and not the truth direction. This is also visible from the PCA plots where you can see that the distracted direction is more separable. Daniel Filan: Yeah. I guess maybe one explanation of that is just that it’s easier to tell if a thing ends in banana or shed than it is to tell if something’s positive or negative, especially in the case of… If you think there’s some amount of mislabeling, that could potentially do it. Vikrant Varma: Yeah. Daniel Filan: Gotcha. So, that’s an example of one way that CCS can go wrong, with the banana/shed thing. You also have examples where you include in the prompt information about what someone named Alice thinks about this thing, and you describe Alice as an expert, or sometimes you say Alice is anti-capitalist, and even when a thing is about a company, she’s not going to say that it’s about a company. In the case of Alice the expert, it seems like the probes learn to agree with Alice more than they learn about the ground truth of the thing. Vikrant Varma: Yeah. I think there’s two separate experiments, if I remember correctly. One is where you modify the prompt to demonstrate more expertise. So, you have a default prompt, a professor prompt, and a literal prompt. And then, there’s a separate experiment where you have an anti-capitalist character called Alice. Daniel Filan: I’m meaning a third one where at the start you say “Alice is an expert in movie reviews” and you give the review and then you say, “Alice thinks the sentiment of this review is positive.” But what Alice says is actually just randomly assigned. And in that case, the prompts tend to pick up on agreement with Alice more than agreement with the ground truth. That seems vaguely concerning. It almost seems like a human failure mode. But I’m wondering, do you know how much of it was due to the fact that Alice was described as an expert who knows about stuff? Vikrant Varma: Yeah. I think, in general, an issue with CCS is that it’s unclear whether CCS is picking up something about the model’s knowledge, or whether the thing that’s salient is whatever the model is doing to compute the next token. And in a lot of our experiments, the way we’ve set it up is to nudge the model towards completing in a way that’s not factually true. For example, in the “Alice is an expert in movie reviews” [case], the prompt is set up in a way that nudges the model to complete in Alice’s voice. And the whole promise of CCS is that even when the outputs are misleading, you should be able to recover the truth. I think even from the original CCS paper, you can see that that’s not true because you have to be able to beat zero-shot accuracy with quite a large margin to be confident about that. This is one maybe limitation of being able to say things about CCS, which is that you’re always unsure whether CCS is… Even the thing that you’re showing, are you really showing that the model is computing Alice’s belief? Or are you just showing that your probe is learning what the next token prediction is going to be? Future CCS-like approaches Daniel Filan: Sure. Yeah. You have a few more experiments along these lines. I guess I’d like to talk a bit about: I think of your paper as saying there’s a theoretical problem with CCS, which is that there’s a bunch of probes that could potentially get low CCS loss, and there’s a practical problem, which is some probes do get low CCS loss. So, if I think about the CCS research paradigm, I think of it as… When the CCS paper came out, I was pretty into it. I think there were a lot of people who were pretty into it. Actually, part of what inspired that Scott Emmons post about it is I was trying to sell him on CCS and I was like, “No, Scott, you don’t understand. This is the best stuff since sliced bread.” And I don’t know, I annoyed him enough into writing that post. So, I’ll consider that a victory for my annoying-ness. But I think the reason that I cared about it wasn’t that I felt like literal CCS method would work, but it was because I had some sense of just the general strategy, of coming up with a bunch of consistency criteria and coming up with a probe that cares about those and maybe that is going to isolate belief. So, if we did that, it seems like it would deal with stuff like the banana/shed example. If you cared about more relations between statements, not just negation consistency, but if you believe A, and A implies B, then maybe you should believe B, just layer on some constraints there. You might think that by doing this we’re going to get closer to ground truth. I’m wondering, beyond just CCS specifically, what do you think about this general strategy of using consistency constraints? Vikrant Varma: Yeah. That’s a great question. I think my take on this is informed a lot by a comment by Paul Christiano on one of the CCS review posts. I basically share your optimism about being able to make empirical progress on figuring out what a model is actually doing or what it’s actually thinking about a situation by using a combination of consistency criteria, and even just supervised labels in situations where you know what the ground truth is. And being able to get reasonably good probes - maybe they don’t generalize very well, but every time they don’t generalize or you catch one of these failures, you spend a bunch of effort getting better labels in that situation. And so, you’re mostly not in a regime where you’re trying to generalize very hard. And I think this kind of approach will probably work pretty well up to some point. I really liked Paul’s point that if you’re thinking about a model that is saying things in human natural language and it’s computing really alien concepts that are required for superhuman performance, then you shouldn’t necessarily expect that this is linearly extractable or extractable in a simple way from the activations. This might be quite a complicated function of the activations. Daniel Filan: Why not? Vikrant Varma: I guess one way to think about it is that the natural language explanation for a very complicated concept is not going to be short. So, I think the hypothesis that a lot of these concepts are encoded linearly and are linearly extractable… In my mind, it feels pretty unclear whether that will continue to hold. Daniel Filan: Okay. So just because “why does it have to be linear?” There are all sorts of ways things can be encoded in neural nets. Vikrant Varma: Yeah. That’s right. And in particular, one reason you might expect things to be linear is because you want to be able to decode them into natural language tokens. But if there is no short decoding into natural language tokens for a concept that the model is using, then it is not important for the computation to be easily decodable into natural language. Daniel Filan: Right. So, if the model’s encoding whether a thing is actually true according to the model, it’s not like that determines the next thing the person will say, right? Vikrant Varma: Right. It’s a concept that humans are not going to talk about, it’s never going to appear in human natural language. There’s no reason to decode this into the next token. Daniel Filan: This is talking about: if the truth of whatever the humans are talking about, it actually depends on the successor of a theory of relativity that humans have never thought about, it’s just not really going to determine the next thing that humans are going to say. Vikrant Varma: Yeah, that’s an example. Daniel Filan: Yeah. I take this critique to be firstly a critique of linear probes for this task. I guess you can form a dilemma where either you’re using linear probes, and then you don’t necessarily believe that the thing is linearly extractable, or you’re using complicated non-linear probes, and then maybe the stuff you’re getting out is stuff about your probe rather than stuff about the underlying model. But then, I guess there’s a separate question of, are there consistency constraints that could work? Putting aside the probe… I don’t know, maybe we shouldn’t put aside the probe thing, but putting aside the probe thing, is there some sort of consistency check we could do to say, is this property we found in the model the model’s actual beliefs, or is it not? Vikrant Varma: Yeah. That’s a good question. I think the more powerful your model, the more entities it’s “simulating” whose beliefs end up mattering for next token prediction that the model is doing. And if these entities that the model is thinking about, if their beliefs also satisfy all of the same consistency criteria that you’re using, then you just have a very fundamental indistinguishability problem. And in particular, I think the more powerful the model gets, the more pessimistic I am that we are able to come up with consistency checks that correctly distinguish between true beliefs and beliefs of simulated entities. Daniel Filan: One point you make in a LessWrong post related to your paper , is that if you’re a smart agent and you’re thinking about a proposition, one thing you might sometimes think about is, what’s the world in which I’m wrong about this proposition? So, you’re thinking about a belief structure where the truth value of this proposition is the exact opposite of the thing you actually believe. Vikrant Varma: That’s right. Daniel Filan: Which seems like a pretty nice impossibility proof, if you actually are representing both of these things. Vikrant Varma: Yeah. This is also what I meant by adversaries. You might be in a world where an adversary is trying to make you believe the opposite of what’s true. And now, this adversary is modeling the world in which you would have to believe this and all the evidence that would have to be true. Daniel Filan: Right. And the language model is modeling what the adversary wants you to think or something. Vikrant Varma: That’s right. So, that’s another example. But I think maybe there’s a different kind of hope. I think the truth is represented simpler than agents beliefs, might be a hypothesis that turns out to be somewhat true up to some point. But I think this is a different kind of criteria from consistency. So, now we’re talking about, mechanistically, how are these concepts represented inside the model? And for example, if you had a way to say, “The model’s true beliefs are always linearly represented, but beliefs about other agents, they’re not linearly represented; and therefore, we can be confident that linear probes are recovering truth.” That would be an example of a mechanistic structure that you might hope to use. I think we don’t currently have any candidates for that, but it’s a good area to look at. Daniel Filan: Yeah. Are there things analogous to that that we’ve learned? Basically, I’m trying to wonder: if I wanted to prove or disprove this, what kind of thing would I do? And the one thing I can think of is there’s some research about: do convolutional neural networks learn texture or color first? And it turns out there’s a relatively consistent answer. I’m wondering if you can think of any analogous things about neural networks that we’ve learned that we can maybe… Vikrant Varma: Yeah. There’s quite a few options presented in the eliciting latent knowledge report . So for example, one of the things you might hope is that if the model is simulating other entities, then maybe it’s trying to figure out what’s true in the world before it does that. And so, you might expect earlier belief-like things to be true, and later belief-like things to be agents’ beliefs. Or similarly, you might expect that if you try to look for things under a speed prior, as in beliefs that are being computed using shorter circuits, then maybe this is more likely to give you what’s actually true, because it takes longer circuits to compute that plus what some agent is going to be thinking. So, that’s a structural property that you could look for. Daniel Filan: Yeah. I guess it goes back to the difficulty of eliciting latent knowledge. In some ways, I guess the difficulty is: if you look at standard Bayesian rational agent theory, the way that you can tell that some structure is an agent’s beliefs is that it determines how the agent bets and what the agent does. It tries to do well according to its own beliefs. But if you’re in a situation where you’re worried that a model is trying to deceive you, you can’t give it Scoobie snacks or whatever for saying things that… You can’t hope to get it to bet on its true beliefs, if you’re going to allow it access to the world based on whether you think its true beliefs are good, or stuff like that. I don’t know, it seems tricky. CCS as principal component analysis Daniel Filan: I have some other minor questions about the paper. Firstly, we mentioned this post by Scott Emmons , and one of the things he says is that principal component analysis, this method where you find the maximum-variance direction and just base your guess on the model beliefs based on where the thing lies in this maximum-variance direction. He says that this is actually similar to CCS in that you’re encoding something involving confidence and also something involving coherence. And that might explain why PCA and CCS are so similar. I’m wondering what do you think about that take? Vikrant Varma: Is a summary of this take that most of the work in CCS is being done by the contrast pair construction rather than by the consistency loss? Daniel Filan: It’s partly that, and also partly if you decompose “what’s the variance of X minus Y”, you get expectation of X squared plus expectation of Y squared minus twice the [expectation] of XY, and then some normalization terms of variance of X squared… Sorry. Expectation of X all squared, expectation of Y all squared, and then another covariance term. Basically, he’s saying like, “Look, if you think of a vector that maximizes the outer product of that vector, the variance and itself, you’re maximizing the outer product of the variant of that vector with expectation X squared plus expectation Y squared.” Which ends up being the confidence of classification according to that vector. And then, you’re subtracting off the covariance, which is basically saying, is the vector giving high probability for both yes and no? Or is the vector giving low probability for both yes and no? And so, basically, the take is just because of the mathematical properties of variance and what PCA is doing, you end up doing something kind of similar to PCA. I’m wondering if you have thoughts on this take? Vikrant Varma: Yeah, that’s interesting. I don’t remember reading about this. It sounds pretty plausible to me. I guess one way I’d think about it intuitively is that if you’re trying to find a classifier on these difference vectors, contrast pair difference vectors, then for example, you want to be maximizing the margin between these two. And this is a bit like trying to find a high contrast thing. So overall, it feels plausible to me. Explaining grokking through circuit efficiency Daniel Filan: Gotcha. Okay. So, if it’s all right with you, I’d like to move on to the paper ‘Explaining grokking through circuit efficiency’ . Vikrant Varma: Perfect. Let’s do it. Daniel Filan: Sure. This is a paper you wrote with Rohin Shah , Zachary Kenton , János Kramár , and Ramana Kumar . You’re explaining grokking. For people who are unaware or need a refresher, what is grokking? Vikrant Varma: So in 2021, Alethea Power and other people at OpenAI noticed this phenomenon where when you train a small neural network on an algorithmic task, initially, their network overfit, so it got very low training loss and high test loss. And then, they continued training it for about 10 times longer and found that it suddenly generalized. So, although training loss stayed low and about the same, test loss suddenly fell. And they dubbed this phenomenon “grokking”, which I think comes from science fiction and means “suddenly understanding”. Why research science of deep learning? Daniel Filan: Okay, cool. And basically, you want to explain grokking. I guess a background question I have is, it feels like in the field of AI alignment, or people worried about AI taking over the world, there’s a sense that it’s pretty important to figure out grokking and why it’s happening. And it’s not so obvious to me why it should be considered so important, given that this is a thing that happens in some toy settings, but to my knowledge, it’s not a thing that we’ve observed on training runs that people actually care about. So I guess it’s a two-part question: firstly, just why do you care about it? Which could be for any number of reasons. And secondly, what do you think its relationship is to AI safety and AI alignment? Vikrant Varma: I think back in 2021, there were two reasons you could have cared about this as an alignment researcher. One is on the surface it looks a lot like a network was behaving normally, and then suddenly it understood something and started generalizing very differently. The other reason is this is a really confusing phenomenon in deep learning, and it sure would be good if we understood deep learning better. And so, we should investigate confusing phenomena like grokking, [even] ignoring the superficial similarity to a scenario that you might be worried about. Daniel Filan: Okay. Where the superficial scenario is something like: the neural network plays nice, and then suddenly realizes that it should take over the world, or something? Vikrant Varma: That’s right. And I think I can talk a bit more about the second reason or the overall science of deep learning agenda, if you like. Is that a useful thing to go into now? Daniel Filan: I guess maybe why are you interested in grokking? Vikrant Varma: For me, grokking was one of those really confusing phenomena in deep learning, like deep double descent or over-parameterized networks generalizing well, that held out some hope of if you understand this phenomenon, maybe you’ll understand something pretty deep about how we expect real neural networks to generalize and what kinds of programs we expect deep learning to find. It was a puzzling phenomenon that somebody should investigate, and we had some ideas for how to investigate it. Daniel Filan: Gotcha. I’m wondering if you think just, in general, AI alignment people should spend more effort or resources on science of deep learning issues. Because there’s a whole bunch of them, and not all of them have as much presence from the AI alignment community. Vikrant Varma: I think it’s an interesting question. I want to decompose this into how dual-use is investigating science of deep learning, and do we expect to make progress and find alignment-relevant things by doing it? And I’m mostly going to ignore the first question right now, but we can come back to it later if you’re interested. I think for the second question, it feels pretty plausible to me that investigating science of deep learning is important and tractable and neglected. I should say that a lot of my opinions here have really come from talking to Rohin Shah about this, who is really the person who’s, I think, been trying to push for this. Why do I think that? I think it’s important because: similar to mechanistic interpretability, the core hope for science of deep learning would be that you’re able to find some information about what kinds of programs your training process is going to learn, and so therefore, how it will generalize in a new situation. And I think a difference from mech[anistic] interp[retability] is… This is maybe a simplified distinction, but one way you could draw the line is that mech. interp. is more focused on reverse-engineering a particular network and being able to point at individual circuits and say, “Here’s how the network is doing this thing.” Whereas, I think science of deep learning is trying to say, “Okay. What kinds of things can we learn in general about a training process like this with a dataset like this? What are the inductive biases? How does the distribution of programs look like?” And probably both science of deep learning, and mech. interp. have quite a lot of overlap, and techniques from each will help the other. That’s a little bit about the importance bit. I think it’s both tractable and neglected in the sense that we just have all of these confusing phenomena. And for the most part, I feel like industry incentives are not that aligned with trying to investigate these phenomena really carefully and doing a very careful natural sciences exploration of these phenomena. So in particular, iterating between trying to come up with models or theories for what’s happening and then making empirical predictions with those theories, and then trying to test that, which is the kind of thing we tried to do in this paper. Daniel Filan: Okay. Why do you think industry incentives aren’t aligned? Vikrant Varma: I think it’s quite a high risk, high reward sort of endeavor. And in the period where you’re not making progress on making loss go down in a large model, it’s maybe harder to justify putting a lot of effort into that. On the other hand, if your motivation is “If we understood this thing, it could be a really big deal for safety”, I think making the case as an individual is easier. Even from a capabilities perspective, I think the incentives to me seem stronger than what people seem to be acting on. Daniel Filan: I guess there’s something puzzling about why there would be this asymmetry between some sort of corporate perspective and some sort of safety perspective. I take you to be saying that, “Look, there are some insights to be found here, but you won’t necessarily get them tomorrow. It’ll take a while, it’ll be a little bit noisy. And if you’re just looking for steady incremental progress, you won’t do it.” But it’s not obvious to me that safety or alignment people should care more about steady incremental progress than people who just want to maximize the profit of their AI, right? Vikrant Varma: You mean [“safety people should] care less about that”? Daniel Filan: Yeah. It’s not obvious to me that there would be any difference. Vikrant Varma: Right. I think one way you could think about it, from a safety perspective, is multiple uncorrelated bets on ways in which we could get a safer outcome. I think probably a similar thing applies for capabilities except that… And I’m really guessing and out of my depth here, but my guess would be that for whatever reason, it’s harder to actually fund this kind of research, this kind of very exploratory, out-there research, from a capabilities perspective, but I think there is a pretty good safety case to make for it. Daniel Filan: Yeah, I guess it’s possible that it’s just a thing where it’s hard to … I don’t know, if I’m a big company, right, I want to have some way of turning my dollars into people solving a problem. One model you could have is for things that could be measured in “how far down did the loss go?” It’s maybe just easier to hire people and be like, “Your job is to put more GPUs on the GPU rack” or “your job is to make the model bigger and make sure it still trains well”. Maybe it’s harder to just hire a random person off the street and get them to do science of deep learning. That’s potentially one asymmetry I could think of. Vikrant Varma: Yeah, I think it’s also just: I genuinely feel like there are way fewer people who could do science of deep learning really well than people who could make the loss go down really well. I don’t think this fundamentally needs to be true, but it just feels true to me today based on the number of people who are actually doing that kind of scientific exploration. Daniel Filan: Gotcha. When I asked you about the alignment case for science of deep learning, [you said] there’s this question of dual use and then there was this question of what alignment things there might be there, and you said you’d ignore the dual use thing. I want to come back to that.