WEBVTT

1
00:00:04.000 --> 00:00:05.010
David Bau: compliance.

2
00:00:05.110 --> 00:00:06.189
David Bau: Hey, how's it going?

3
00:00:10.340 --> 00:00:15.069
David Bau: Okay, yes, that worked. Okay, too much to cover again today. Sorry, you guys.

4
00:00:15.360 --> 00:00:17.580
David Bau: Converse implications down the street.

5
00:00:17.830 --> 00:00:22.290
David Bau: So, okay, so I'll just dive in.

6
00:00:22.670 --> 00:00:29.709
David Bau: So I… so we've had a few meetings, and, you know, I'll need that.

7
00:00:30.700 --> 00:00:35.390
David Bau: Okay. Actually, Hayu, is Hayou here? Yep. Had an initial question.

8
00:00:35.860 --> 00:00:39.069
David Bau: And, so now, I'm gonna head on right now.

9
00:00:39.460 --> 00:00:43.940
David Bau: Hi, you. What was your question? So what does it mean for the model linearly, then?

10
00:00:44.480 --> 00:00:53.380
David Bau: what does it mean for the model to linearly represent truth, right? So one of the papers you guys read, first one was this Geometry of Truth paper.

11
00:00:53.500 --> 00:00:57.809
David Bau: And they had these nice scarapot for me, which was supposed to be suggestive of something.

12
00:00:59.680 --> 00:01:03.229
David Bau: And so, so none of the paper is actually

13
00:01:03.470 --> 00:01:07.100
David Bau: It's actually really hard to find any modern paper

14
00:01:07.290 --> 00:01:12.680
David Bau: Pin down a definition of what it means for something to be linearly represented.

15
00:01:13.130 --> 00:01:22.189
David Bau: But… but this is… so I'll make up a definition, and you can see this in some old papers. This is an old definition, right? So let's say…

16
00:01:22.410 --> 00:01:26.040
David Bau: You have a whole bunch of neurons, In your neural network.

17
00:01:26.270 --> 00:01:35.560
David Bau: And you have the numbers that say how much neurons are activated in a vector X, right? And for different stimulus.

18
00:01:36.030 --> 00:01:41.110
David Bau: this X will be different. That's your… X is your representation.

19
00:01:41.250 --> 00:01:45.489
David Bau: What it means for something to be linearly represented in that vector is just…

20
00:01:45.660 --> 00:01:55.010
David Bau: this, is that you've got some scalar, you've got some attribute, some property A, that you care about. Like, is this sentence true, right? Or is this…

21
00:01:55.230 --> 00:02:02.519
David Bau: Is this, AI being truthful or not? Something like that. That might be… that might be your scalar output. Maybe…

22
00:02:02.860 --> 00:02:07.770
David Bau: A big number means yes, true. Maybe a small number means no, false.

23
00:02:08.060 --> 00:02:08.820
David Bau: Right?

24
00:02:09.440 --> 00:02:13.500
David Bau: And what it means for it to be linearly represented is that you can just

25
00:02:13.770 --> 00:02:16.059
David Bau: Do what's called a linear readout.

26
00:02:16.300 --> 00:02:17.630
David Bau: of the neurons.

27
00:02:17.760 --> 00:02:22.729
David Bau: You can just take a weighted sum of all the neurons, or equivalently, a dot product by some.

28
00:02:22.860 --> 00:02:25.130
David Bau: Fixed vector W.

29
00:02:25.590 --> 00:02:30.500
David Bau: And that'll give you some number, which is just your readout of this.

30
00:02:30.770 --> 00:02:32.160
David Bau: of this property.

31
00:02:32.670 --> 00:02:36.070
David Bau: Now, for truth, maybe the property is true or false.

32
00:02:36.280 --> 00:02:43.419
David Bau: So, you know, I think there's some details here that I've omitted. Like, you might… you might threshold it at some value, say if it's…

33
00:02:43.460 --> 00:02:59.890
David Bau: bigger than the value that you say is true, and as small as the values say is false. It might not be 100% accurate, but, you know, it might be good enough it was 80 or 90% accurate. People would say, oh yeah, this information is in here. It's linearly readable. Does that make sense? Does that sort of answer the question? So that's all it means.

34
00:03:00.290 --> 00:03:03.770
David Bau: For something to be linearly… encodable, okay, so…

35
00:03:04.180 --> 00:03:20.530
David Bau: Yeah, so it's all about around dot products, like, everybody learns what dot products are from linear algebra, and then when you see a scatter plot, like what Sam Marks put in this paper, you know, you can see that this is… the reason he thought… it's obvious after I put my scatter plot that this is linearly readable.

36
00:03:20.610 --> 00:03:25.349
David Bau: Because you can see this clear separation, so he didn't put this vector here.

37
00:03:25.570 --> 00:03:29.680
David Bau: But this vector would be, like, the wave vector W shape that you're dot producting with.

38
00:03:29.960 --> 00:03:48.719
David Bau: And so if your dot product with this thing, then whatever the dot product is in the positive, it's in the red area, if it's negative, it's in the blue area. And when you do the linear readout, you know, you can make a little histogram of, like, how many dots, you know, were negative, and how many dots were positive you did this. So, you know, I recommend if you…

39
00:03:48.950 --> 00:03:50.850
David Bau: You know, do this with your own…

40
00:03:51.550 --> 00:03:55.680
David Bau: Your own, data, you know, like histograms like this, and you can kind of see

41
00:03:56.070 --> 00:03:57.979
David Bau: You know, how well it's separating.

42
00:03:58.590 --> 00:03:59.559
David Bau: you know, different…

43
00:04:00.080 --> 00:04:09.269
David Bau: different things. So all the things that are supposed to be negative are over here, all the things that are supposed to be tied up over here. So why do people believe that anything would be linearly readable?

44
00:04:10.010 --> 00:04:12.779
David Bau: in a neural network. The reason is because

45
00:04:12.950 --> 00:04:15.330
David Bau: It's how neural networks work internally.

46
00:04:15.880 --> 00:04:20.029
David Bau: So, you know, we draw all these network diagrams, Of the neural network.

47
00:04:20.390 --> 00:04:24.059
David Bau: Where every neuron picks as input.

48
00:04:24.510 --> 00:04:41.130
David Bau: a weighted sum of the alphas of other neurons? All this is is a dot product. All this is is a linear readout. So, if you think of a deep neural network, all it's doing is at every layer, every neuron is one of these operations, is one of these linear readouts. So.

49
00:04:41.310 --> 00:04:44.749
David Bau: So, so the hypothesis is…

50
00:04:44.900 --> 00:04:50.810
David Bau: That all the useful information in a neural network Maybe.

51
00:04:51.280 --> 00:04:54.580
David Bau: Is actually able to be read out linearly, because

52
00:04:54.800 --> 00:05:00.670
David Bau: If it wasn't read or able to be read out linearly, the neural network itself wouldn't be able to access it.

53
00:05:00.790 --> 00:05:04.609
David Bau: Right, because that's all the neural network has as a tool for reading its own

54
00:05:04.960 --> 00:05:12.850
David Bau: information. So that's sort of the hypothesis. Yes, go ahead. Does this linearity decrease even after we add

55
00:05:13.030 --> 00:05:22.560
David Bau: deliberately add nonlinearity to the network. So then, then after… so you're saying, oh, typically you add a little nonlinearity after you do this to the threshold that…

56
00:05:22.900 --> 00:05:31.640
David Bau: That's right. And it's the same detail here where I say, oh, you know, you might add a thresholding or something like that. And so, that's… that's equivalent to, let me…

57
00:05:31.920 --> 00:05:33.859
David Bau: I'll come to that in a minute.

58
00:05:34.100 --> 00:05:39.180
David Bau: But that's sort of equivalent to saying, hey, you know, at some point, we might want to draw a separating plane.

59
00:05:39.490 --> 00:05:47.919
David Bau: Here? So, all linearity really means is, even if you have, like, a separating plane where you're gonna have something nonlinear, like a decision.

60
00:05:48.240 --> 00:06:02.799
David Bau: You know, with the linear readout, all you can really do is slide that plane around. You can't make it into a curve or something like that, right? You just gotta slide it around. Now, obviously, you can make curves and stuff if you have multiple layers of a control network.

61
00:06:03.030 --> 00:06:07.309
David Bau: But if you just have one layer, then all you can do is basically read out

62
00:06:07.560 --> 00:06:09.570
David Bau: A straight… straight line.

63
00:06:09.780 --> 00:06:17.259
David Bau: It's readout something linear. And so, so, so the linear readout Hypothesis is that

64
00:06:17.680 --> 00:06:24.599
David Bau: Well, the neural network is going to curve all the data around, but at the moment that it's got to use it in a useful way.

65
00:06:25.180 --> 00:06:31.840
David Bau: And maybe it's… the useful information is linearly readable at that… at that moment, and you should be able to read it too.

66
00:06:32.070 --> 00:06:37.670
David Bau: by finding the right linear model, just put on. So that's… that's what all these… papers that I…

67
00:06:37.840 --> 00:06:43.800
David Bau: assign are all about is linear redocking. Okay? Okay, so I like this other question.

68
00:06:44.880 --> 00:06:51.300
David Bau: Eunice asked this question, and Kavala asked, oh, another related question. Yeah, so what was Eunice's question?

69
00:06:51.780 --> 00:06:56.520
David Bau: Is your audio working? Do you want to ask your question?

70
00:06:56.830 --> 00:07:00.440
Emre Tapan: Yeah, yeah, I am on Zoom, so, can you hear me?

71
00:07:00.570 --> 00:07:01.589
David Bau: Yes, it's beautiful.

72
00:07:01.590 --> 00:07:10.940
Emre Tapan: Cool. So all of the papers were kind of, tried to prove that concepts are linearly represented, but

73
00:07:13.330 --> 00:07:22.279
Emre Tapan: Are there any cases, concepts are kind of nonlinear, or clustered, or I don't know, what are the… what are the other options?

74
00:07:22.280 --> 00:07:29.369
David Bau: Now, what are the other options? If everything's so linear, then, like, what's not linear? Right? I mean, I think that's a really good way of understanding

75
00:07:29.610 --> 00:07:37.060
David Bau: like, what it even means to have a hypothesis. If it's not falsifiable, then what we're not even doing science, right? So, yes.

76
00:07:37.310 --> 00:07:47.179
David Bau: you can definitely have nonlinear encodings, and my… and the, like, my clearest example is actually from the oldest neural network paper in history. So who's ever heard of

77
00:07:47.360 --> 00:07:50.410
David Bau: The Romo Hart, Hinton, and William Scaper.

78
00:07:50.930 --> 00:08:03.349
David Bau: Anybody ever heard of this paper? Yeah, okay. My own grad students have heard of it, because I've made them read it, right? So, this is the first neural network paper ever. This is what won, Jeff Minton the Turing Award.

79
00:08:03.590 --> 00:08:08.849
David Bau: And it's where they basically told everybody, oh, you can train neural networks using backpropagation.

80
00:08:09.730 --> 00:08:15.429
David Bau: And so, now, a lot of people, including, what's, what's his name?

81
00:08:15.760 --> 00:08:18.170
David Bau: The guy who thinks we invented everything?

82
00:08:18.880 --> 00:08:34.409
David Bau: No, Yamukun did invent him. Yeah, Schmindover. Yeah, Schmindover. Who invented, you know, LSTM and several things, so he's…

83
00:08:34.470 --> 00:08:39.340
David Bau: He is an inventor of a lot of things, but he likes to claim he invented for them. So, Schmidtluberg.

84
00:08:39.480 --> 00:08:46.199
David Bau: always… This is his least favorite paper, because he says, well, you know.

85
00:08:46.340 --> 00:08:51.030
David Bau: Clinton didn't invent back propagation. You know, Isaac Newton invented

86
00:08:51.290 --> 00:08:57.460
David Bau: You know? Why did Hinton win, like, the training award for this?

87
00:08:57.700 --> 00:09:15.080
David Bau: And then, you know, furthermore, not just Hinton, but, like, all these other graduate students between Isaac Newton and Hinton, you know, went into it a lot more than Hinton did, so Hinton wrote this one-pager for Nature and in one terrain award, like, why did he get this credit? And if you go and read the paper.

88
00:09:15.520 --> 00:09:21.500
David Bau: The main idea of the paper is not that you can use derivatives optimize the system. Everybody knew that.

89
00:09:21.600 --> 00:09:22.410
David Bau: Right?

90
00:09:23.160 --> 00:09:26.699
David Bau: The main idea is… was an interpretability thing.

91
00:09:26.860 --> 00:09:31.610
David Bau: He's like, after you train a bunch of neural networks, and you look inside them.

92
00:09:32.050 --> 00:09:45.999
David Bau: they learn really interesting representations, and so he has a series of neural networks where he tears apart each one after he's trained it with gradient descent to show that. So he's like the first interpretability researcher. It's really what he won the award for.

93
00:09:46.700 --> 00:09:58.279
David Bau: And so, but his neural networks are too small, and they didn't follow… not… like, a lot of them didn't follow the principles that we generally see in very large neural networks. So, for example, this is one of his networks we studied.

94
00:09:58.480 --> 00:10:10.169
David Bau: It's a network, with, I think, 8 neurons, 9 neurons, something like that. If you count the input nodes as neurons, which maybe you wouldn't even count, because it's just inputs, in which case it just has 3 neurons.

95
00:10:10.310 --> 00:10:11.000
David Bau: Right?

96
00:10:11.290 --> 00:10:11.990
David Bau: And…

97
00:10:12.270 --> 00:10:26.390
David Bau: And, and it's supposed to detect, like, palindromes, or, or, or pixel patterns that are the same on the top as on the bottom, and they're mirrors of each other, right? So he trains this thing as a mirror detector.

98
00:10:26.570 --> 00:10:35.529
David Bau: And if you put a symmetric pattern in, then this output will say, yeah, I got a mirror. And if you put a not-symmetric pattern in, it'll say, it didn't. It took him…

99
00:10:35.560 --> 00:10:48.840
David Bau: you know, a long time, many hours to train this thing, you know, you could probably train it in a millisecond today. And, and these are the trained weights that come out. And the interesting thing, if you think about this encoding, it's very non-linear.

100
00:10:48.950 --> 00:11:02.729
David Bau: Right, what it learns. And why is it nonlinear? It's because, you know, these neurons that he's learning are… have learned this encoding, which is very strange. For example, if this neuron reads 3.6,

101
00:11:02.820 --> 00:11:09.259
David Bau: is what comes into it, because of the weight's 3.6. It means everything is zero, except for,

102
00:11:09.570 --> 00:11:10.750
David Bau: The first pixel.

103
00:11:11.040 --> 00:11:11.840
David Bau: Right?

104
00:11:11.950 --> 00:11:14.400
David Bau: And if it reads, like, negative 7.1,

105
00:11:14.530 --> 00:11:24.509
David Bau: It means everything's there except for the second pixel, right? And if it reads negative 3.5, it means both pixels 1 and 2 are on. Now, the reason this is non-linear is because

106
00:11:24.720 --> 00:11:29.429
David Bau: Linear thinks… They scale their meaning when you scale the value.

107
00:11:29.940 --> 00:11:34.760
David Bau: And so, really what should happen is, if you, like, scale this up.

108
00:11:35.060 --> 00:11:42.440
David Bau: you know, when we made this negative, it should be in, like, the pixel 1 is, like, negative, or something like this. If you doubled it, it should be in, like, pixel 2,

109
00:11:42.720 --> 00:11:52.779
David Bau: It shouldn't have anything to do with Pixel 2, it shouldn't have to do with Pixel 1, but it changes its meaning, and then you can kind of see when it gets to 14.2, it means, oh, it has nothing to do with Pixel 1 and 2, it has to do with Pixel 6.

110
00:11:53.250 --> 00:11:54.040
David Bau: Right?

111
00:11:54.260 --> 00:11:58.870
David Bau: So, what he's done is he's learned this neural network, which splits up the number line.

112
00:11:59.040 --> 00:12:11.820
David Bau: And for every different part of the number line, it has, like, a pretty unrelated semantic meaning. It has to… he's using arithmetic coding here, or this neural network has learned how to do arithmically, but that's different from a linear…

113
00:12:12.220 --> 00:12:17.430
David Bau: Representation, because it doesn't scale its meaning With the scale and value.

114
00:12:18.440 --> 00:12:22.929
David Bau: for truth, It would be like, oh, if something was already true and you doubled it.

115
00:12:23.170 --> 00:12:27.700
David Bau: It would just mean… like, more true. Like, I'm more certain that it's true.

116
00:12:27.980 --> 00:12:33.920
David Bau: Right, and then there's some threshold where it's false, and then if you make it, double it and made it more false or more negative.

117
00:12:34.120 --> 00:12:39.129
David Bau: then it'd be like, oh, that's more false, I'm more assuming it's false, or something like that, right? So, but here, it doesn't work that way.

118
00:12:39.380 --> 00:12:53.379
David Bau: And so, yeah, so there's definitely… and it is a neural network, so, like, neural networks can learn nonlinear encodings, it's just… it's just that we tend to not see them in large networks. It's, like, they're pretty hard to learn, they don't show up that often, but it's not impossible.

119
00:12:54.140 --> 00:12:55.000
David Bau: Make sense?

120
00:12:55.340 --> 00:12:58.489
David Bau: So, actually, I've always been puzzled by this.

121
00:12:58.690 --> 00:13:00.660
David Bau: When I was starting…

122
00:13:00.900 --> 00:13:06.850
David Bau: studying neural networks. Have you ever thought about this? Like, how much information can be stored in one neuron?

123
00:13:07.270 --> 00:13:10.429
David Bau: Like, that's weird, right? Because a neuron is, like, a real number.

124
00:13:10.910 --> 00:13:11.610
David Bau: Right?

125
00:13:11.750 --> 00:13:13.170
David Bau: How many digits go in a real number?

126
00:13:14.410 --> 00:13:16.570
David Bau: 6, how many digits can go in a real number?

127
00:13:19.460 --> 00:13:20.850
David Bau: Like, infinite, right?

128
00:13:21.760 --> 00:13:27.260
David Bau: I guess it depends on the hardware you choose, but that's not real, that's a floating point number. What about a real number?

129
00:13:27.370 --> 00:13:31.610
David Bau: Like, what if it was, like, a biological neuron? It's like a real… like, it really activates.

130
00:13:31.690 --> 00:13:45.530
David Bau: Right? What's a real voltage? Like, how many digits can go in a real number, like a real biological neuron? It's, like, infinite, right? So, that leads you to this puzzle. It's like, how much information can you store in a real number? Like, if I write out a real number, like, I can…

131
00:13:45.610 --> 00:13:55.229
David Bau: you know, I could encode, like, the whole Declaration of Independence as ASCII, and then put them in the digit that was a real number, and, like, a single neuron should be able to encode that, right? So that's what's going on

132
00:13:55.380 --> 00:14:03.030
David Bau: With these neurons here, it's like… like, each one of these neurons is, like, encoding a whole 6-bit pattern.

133
00:14:03.350 --> 00:14:06.540
David Bau: And neurologers, they're able to learn this if you train them hard enough.

134
00:14:06.680 --> 00:14:15.139
David Bau: But… but it's, you know, it's really weird, right? Like, you might extrapolate this into, you know, sort of the Declaration of Indenis.

135
00:14:15.520 --> 00:14:19.800
David Bau: You're right. But, but it turns out that's really hard to train networks

136
00:14:20.720 --> 00:14:33.129
David Bau: do this, and what we tend to learn is we tend to learn networks that learn… that represent everything linearly. The, there's a… there's a way out of… so, if people are machine learning people and are puzzled by this thing about, like.

137
00:14:33.940 --> 00:14:37.079
David Bau: How can a single neuron encode an infinite number of bits?

138
00:14:37.390 --> 00:14:49.569
David Bau: You know, the thing to look at is all the variational, people, like, variational autoencoders and things like this. They work through the, the mystery of

139
00:14:49.680 --> 00:15:00.099
David Bau: you know, how to control the number of bits in the neuron. And it's about reasoning about, oh, it's… the neurons are really operating in noisy environment.

140
00:15:00.330 --> 00:15:01.060
David Bau: Bye.

141
00:15:01.170 --> 00:15:02.320
David Bau: That they might have.

142
00:15:02.710 --> 00:15:07.190
David Bau: Okay, so… so this is… this is… this is what nonlinear looks like, but…

143
00:15:07.630 --> 00:15:16.430
David Bau: a lot of stuff is linear, including… so I'm… like, it frustrates me a little bit, when people say, oh, this is nonlinear.

144
00:15:16.770 --> 00:15:24.549
David Bau: And what they really mean is that some things are higher dimensional. They're not one-dimensional. So, like, I gave you a definition for linear readout.

145
00:15:24.650 --> 00:15:26.540
David Bau: Which is, like, for scalar things.

146
00:15:26.730 --> 00:15:38.480
David Bau: Right? But some things can be linear, even if they're not scalar, so that… so who asked this? So… oh, yes, so Grace asked about this, and Aria asked about this, so… Okay, Grace, what was your question?

147
00:15:38.640 --> 00:15:49.019
David Bau: Maybe you're, maybe you didn't know you asked me this. Yes, yes, sorry, I was, I was, my, yes, my question was about the sentiment paper. Yes, yes. Saying, you know, can you be found…

148
00:15:49.070 --> 00:15:59.959
David Bau: singular direction for sentiment. For sentiment. And no one was happy if this is a big number, it's sad, if it's a slow number, right? Yeah, that's exactly right. And then in the limitation section, they say.

149
00:16:00.210 --> 00:16:16.990
David Bau: But we don't really know if this means that we can't, like, decompose it into… It's just a linear combination of a bunch of other directions, and so I'm like, what does it mean to say singular versus… Right, yeah, what does that even mean? Let me ask Aria to ask a version.

150
00:16:17.320 --> 00:16:25.669
David Bau: I think I asked basically exactly the same thing. Like, referring to the same conversation. Yeah. Yeah, about this, about this kind of thing.

151
00:16:26.060 --> 00:16:30.869
David Bau: And so, yeah, so one way of thinking about what they're really saying is that

152
00:16:31.100 --> 00:16:34.259
David Bau: Oh, the neural network might have a more sophisticated…

153
00:16:34.680 --> 00:16:41.000
David Bau: representation of sentiment that we haven't bothered to decode. It might have, you know, 10

154
00:16:41.320 --> 00:16:45.390
David Bau: Attributes in there saying, oh, you know,

155
00:16:45.980 --> 00:17:03.550
David Bau: I'm happy because, you know, I'm feeling fulfilled, and I think that things are true, and I think that things are ethically good, and I'm feeling very full and happy of food and good social mood. You know, they might have a bunch of different axes of, like, you know, positive sentiment.

156
00:17:03.890 --> 00:17:09.850
David Bau: And maybe what they're doing is they're finding that, like, so if you had a bunch of different

157
00:17:10.550 --> 00:17:12.299
David Bau: Axes in a vector space.

158
00:17:12.680 --> 00:17:17.299
David Bau: You could… you could just define some weighted sum of all those positive…

159
00:17:17.430 --> 00:17:34.180
David Bau: types of sentiment, and say, this is, like, this is your average sentiment over all these attributes, and you could have one singular vector, which is just the average of all those vectors, and I think that what they're saying is, oh, we think that there might be a chance that we've just measured some average vector.

160
00:17:34.590 --> 00:17:35.410
David Bau: Here.

161
00:17:35.660 --> 00:17:41.349
David Bau: And… and if you were to look closer, you might be able to find that you could decompose it into

162
00:17:41.460 --> 00:17:44.500
David Bau: Other vectors which add up to that average.

163
00:17:44.900 --> 00:17:48.829
David Bau: And so, now, what does this mean?

164
00:17:49.290 --> 00:17:54.300
David Bau: What this means is if you really wanted to understand sentiment, It's not a scaling.

165
00:17:54.620 --> 00:18:01.570
David Bau: Right? It might mean that there's 3 or 4 dimensions of sentiment that that paper didn't get into.

166
00:18:01.820 --> 00:18:16.390
David Bau: And and is that linear? Does that make it non-linear now? Because it's, like, 3 or 4 dimensions instead of one? And so… and I think… no, that's an abuse of terminology. It is still linear, if you can do this.

167
00:18:16.910 --> 00:18:22.870
David Bau: So, if you have, like, an n-dimensional You know, attribute.

168
00:18:23.200 --> 00:18:27.449
David Bau: Like, three dimensions of sentiment, or something like that.

169
00:18:27.750 --> 00:18:32.000
David Bau: If you can read them, Each out using a dot product.

170
00:18:32.190 --> 00:18:37.340
David Bau: Which is the same thing as having a matrix of wave vectors that you multiply by.

171
00:18:37.460 --> 00:18:40.819
David Bau: Right? Then it's still linearly decodable.

172
00:18:41.170 --> 00:18:41.850
David Bau: Right.

173
00:18:42.100 --> 00:18:44.609
David Bau: And I think the situation happens.

174
00:18:44.760 --> 00:18:48.660
David Bau: A lot in large neural networks. In fact, we've seen it before.

175
00:18:49.280 --> 00:18:51.620
David Bau: Like logic lengths.

176
00:18:51.740 --> 00:18:52.710
David Bau: does this.

177
00:18:53.420 --> 00:18:57.439
David Bau: We played with Logit Lens for a day or two, right?

178
00:18:57.570 --> 00:19:00.890
David Bau: So, when you look at Logic Lens, and…

179
00:19:01.110 --> 00:19:09.290
David Bau: And you see, oh, like, it's decoding this vector to love, Like, what you're doing is…

180
00:19:09.550 --> 00:19:16.130
David Bau: This is a… you're interpreting this using a 50,000 dimensional linear readout.

181
00:19:17.460 --> 00:19:19.679
David Bau: What you're doing is, love…

182
00:19:19.790 --> 00:19:27.300
David Bau: And every other word in the vocabulary corresponds to a linear readout, and what you're doing is you're reading out, like, kind of the law of probability.

183
00:19:27.420 --> 00:19:39.400
David Bau: that the model thinks that it should assign to love, but you're also reading out the love probability of every other word, and in Legendlands, we just happen to, like, read them all out, and then we just show you the top one.

184
00:19:39.560 --> 00:19:41.090
David Bau: It's all the pleasure, Princess.

185
00:19:41.270 --> 00:19:44.169
David Bau: That makes sense? But it's… but the process that we're using

186
00:19:44.440 --> 00:19:50.940
David Bau: It's a linear readout, so that… that what you're doing is, you're taking this representation.

187
00:19:51.550 --> 00:19:56.580
David Bau: And you're shoving it through… remember this little icon from first class, right? You're shoving it through the decoder head.

188
00:19:56.860 --> 00:20:02.599
David Bau: Which is just, you know, it's a fixed matrix of the size of the network.

189
00:20:02.840 --> 00:20:13.469
David Bau: And then it comes out with this vector, which is like, oh, a more is probably 3, love is probably 2, you know, M is probably 1, or whatever, right? It just gives you the numbers, and logically just shows you.

190
00:20:13.950 --> 00:20:15.390
David Bau: A visualization of that.

191
00:20:15.640 --> 00:20:16.659
David Bau: That makes sense.

192
00:20:16.950 --> 00:20:29.090
David Bau: Any questions about that? So, another place that we saw a multidimensional readout was in the third paper. Do you understand the third paper, or was that confusing? Talk about the third paper. I like it a lot.

193
00:20:29.300 --> 00:20:34.779
David Bau: But what they did… what Sheridan did in the third paper is set the weight matrix

194
00:20:34.980 --> 00:20:37.429
David Bau: to some crazy matrix, L sub C.

195
00:20:37.620 --> 00:20:41.170
David Bau: I quote where that comes from, which is the sum of some things.

196
00:20:41.350 --> 00:20:51.399
David Bau: And then that was, like, a linear readout of some concepts. And so, when Sheridan decoded, you know, she traveled to Greece.

197
00:20:51.650 --> 00:20:53.810
David Bau: and looked at the representation of priests.

198
00:20:53.920 --> 00:20:58.800
David Bau: Sharon is, like, reading out the representation of Greece into some multidimensional vector through this thing.

199
00:20:58.940 --> 00:21:05.130
David Bau: That, that, you know, Sharon's been able to catch, you know, with some imbalances. So it's another linear readout.

200
00:21:06.930 --> 00:21:07.750
David Bau: Sense?

201
00:21:08.530 --> 00:21:17.780
David Bau: Yes. So, in the previous one, when we had VSCA, we could have used the threshold to figure out whether a particular concept is present or not. That's right.

202
00:21:17.990 --> 00:21:29.179
David Bau: But here, if it is multi-dimensional, what are we doing? Are we doing the arm of that, or are we looking in control direction? Yeah, so, you know, I think typically you bring it into a space that's

203
00:21:29.390 --> 00:21:34.930
David Bau: meaningful to you, and then… then that space is meaningful to you. I guess…

204
00:21:35.230 --> 00:21:44.100
David Bau: it's supposed to be meaningful, it's supposed to be like, oh, now it's up to you to understand what it means to be in that meaningful space. So, somebody asked a question, and I thought it was nice.

205
00:21:45.870 --> 00:21:54.809
David Bau: I'll just ask a question, because I don't see who asked it, about what if, what if you had, like, spatial information?

206
00:21:55.030 --> 00:22:12.400
David Bau: That was decoded. Well, maybe when you decode the thing, the space is kind of like latitude and longitude, or something like that, but maybe not exactly. Maybe you have to warp it a little bit for it to correspond to what we write down as latitude and longitude. But… but, you know, but it has… it has some comprehensive meaning.

207
00:22:12.710 --> 00:22:17.649
David Bau: That's his idea. And so, so here, in these vocabulary cases.

208
00:22:17.900 --> 00:22:23.560
David Bau: Then, what we commonly see is you decode it out to a space.

209
00:22:23.750 --> 00:22:29.920
David Bau: Whose dimension equals the size of the vocabulary, right? So, 15,000, just, like, one number for every word.

210
00:22:30.250 --> 00:22:35.499
David Bau: And then what comes out is a set of scores. A score for every word.

211
00:22:35.810 --> 00:22:43.329
David Bau: And then the way that we interpret it is we tell us what we're interested in what the top scoring board is, what the top 5 scoring points are.

212
00:22:45.080 --> 00:22:55.240
David Bau: Sense? But every, you know, so it could be different, but I think it's accurate to characterize all these as exercises and

213
00:22:55.860 --> 00:22:58.299
David Bau: Reading some information is nearly out of…

214
00:23:00.250 --> 00:23:11.869
David Bau: You briefly mentioned autoencoders? Oh, yes. I was wondering if what you are explaining here is similar to the concept where, without encoders.

215
00:23:12.290 --> 00:23:30.110
David Bau: you can basically linearly capture one specific concept, and the more you go into one direction, the more that concept is present. Yes, I think so, correct? I think that's right. So basically, you know, when I draw this kind of figure, it's intentionally evocative of

216
00:23:30.430 --> 00:23:34.849
David Bau: You know, you can read this as a linear readout of the neurons.

217
00:23:34.980 --> 00:23:50.870
David Bau: But also, you know, neural networks themselves do that. So, like, an autoencoder is just an example of, you know, network architecture that also is built on linear readout. And so, when you do auto… an autoencoder, it's trying to come up with a small number of

218
00:23:50.980 --> 00:24:00.239
David Bau: Scalars that are each some really informative, useful, linear readout of the… Of the figure vector.

219
00:24:01.790 --> 00:24:02.660
David Bau: Sense.

220
00:24:03.650 --> 00:24:04.870
David Bau: Okay.

221
00:24:05.770 --> 00:24:10.130
David Bau: Let's see, were… any other questions that I missed? Aria got her question.

222
00:24:11.990 --> 00:24:24.500
David Bau: Guang Yan had a question. Oh, I like Guang Yang's question. Is Guang Yan here? Here? Yes. So, yeah, on that paper, it used, like, main data booking means, or logistic regression probe.

223
00:24:25.740 --> 00:24:41.200
David Bau: And why should this method, like, recover sin access? Right, the second paper. Yeah, so, it was a little sloppy, but it's okay. Yeah, what assumptions make, like, this sentiment clusterable in residual space? Right, yeah, right.

224
00:24:41.200 --> 00:24:47.240
David Bau: Okay, I'm not sure I'll totally answer the second one. I think the second one is a little bit empirical.

225
00:24:47.330 --> 00:24:51.350
David Bau: You know, what makes, what makes the network

226
00:24:51.500 --> 00:24:59.109
David Bau: behave a certain way, why… why is sentiment clusterable in a representation? I think this is part of the mystery of deep networks, right? You know, we…

227
00:24:59.150 --> 00:25:12.700
David Bau: We train them on a lot of data, we hope that they learn something useful. Usually, when something useful is learned, they'll take the things that have a similar meaning from our point of view and put them close, and things with a different meaning from afar.

228
00:25:12.990 --> 00:25:17.239
David Bau: You know, why did they pick up on sentiment?

229
00:25:17.540 --> 00:25:22.109
David Bau: Rather than some other thing that maybe they didn't pick up, we don't know, right? It's just…

230
00:25:22.290 --> 00:25:24.799
David Bau: It was useful for some reason, very creative.

231
00:25:24.910 --> 00:25:30.030
David Bau: So, but… You did ask this other, like.

232
00:25:30.580 --> 00:25:39.899
David Bau: the question, that paper, they applied, the mass mean thing, k-means, logistic regression, PCA, DAS…

233
00:25:40.130 --> 00:25:43.279
David Bau: Right? And then they said, oh, we tried all these things, and they all came out the same.

234
00:25:43.850 --> 00:25:50.880
David Bau: So, that's not actually the lesson that I wanted you to learn from reading that paper, because they often come out different.

235
00:25:51.140 --> 00:26:02.790
David Bau: From each other. They are different techniques. It does… if the data is very well behaved, then they will kind of come up and say, you know, I want to give you a little intuition.

236
00:26:02.920 --> 00:26:06.960
David Bau: For what causes them to be different in the main case that you'll see.

237
00:26:07.110 --> 00:26:11.570
David Bau: Right, and so… so, okay, so here's, like, my favorite scenario.

238
00:26:12.270 --> 00:26:15.390
David Bau: Okay, so suppose… You had, like…

239
00:26:15.560 --> 00:26:24.740
David Bau: whole bunch of students in class, except different from this class, because you've got, like, a final exam. Sorry, you guys don't have a final exam. And as a really hard file.

240
00:26:25.550 --> 00:26:27.630
David Bau: You know, half the students are gonna pass the file.

241
00:26:28.150 --> 00:26:32.199
David Bau: But half the students are gonna fail the final. Oh, it's terrible, all these red students are gonna be in trouble.

242
00:26:32.280 --> 00:26:50.099
David Bau: Right? But, you know, but we run the statistics, and we find out that there's this interesting correlation between, like, how the students previously did in the class, right? And so, what I'm going to do is I'm going to plot the two most important variables, which is their homework rates.

243
00:26:50.760 --> 00:26:51.440
David Bau: Right?

244
00:26:51.690 --> 00:26:54.699
David Bau: And so here's, like, how the students did on the homework before.

245
00:26:55.080 --> 00:26:57.039
David Bau: And, and who passed and failed.

246
00:26:57.620 --> 00:27:01.059
David Bau: And then… and then here's, like, the minutes spent on homework.

247
00:27:01.570 --> 00:27:08.020
David Bau: Right? So it's, like, the effort that they spent in their homework. And that's also, like, relating to pass and fail.

248
00:27:08.230 --> 00:27:08.890
David Bau: Right?

249
00:27:09.010 --> 00:27:13.350
David Bau: And so we plan all this stuff, and then we say, oh, great, you know what I'm gonna do?

250
00:27:13.470 --> 00:27:17.100
David Bau: I'm gonna figure out what advice to give to students next year.

251
00:27:17.510 --> 00:27:28.390
David Bau: and the teacher, who obviously shouldn't be failing half of the students. Maybe they can do a better job at running their class, right? And so, and so I'm going to run logistical regression.

252
00:27:28.680 --> 00:27:38.559
David Bau: and figure out what advice I should get. And so here's the logistic regression for a very realistic data set. So, what logistic regression is doing is it's trying to find a separating plane

253
00:27:39.110 --> 00:27:42.290
David Bau: Meaning these two sets, and with maximum margin.

254
00:27:42.400 --> 00:27:57.989
David Bau: of it that way. So this is… so I actually ran logistic regression on this data. There's not real student data, so there's no privacy issue here. But the, I ran real logistic regression, and here's the advice. The logistic regression probe says.

255
00:27:58.390 --> 00:28:04.680
David Bau: If you want to cross this line, you're on the bad side of the line, you're not, you're failing the student here.

256
00:28:04.960 --> 00:28:09.360
David Bau: Right? We want to get the students to succeed, we should do more of this, go this way.

257
00:28:09.520 --> 00:28:16.019
David Bau: What is this way? We should cross the line. What is this way? It means we should give everybody higher homework scores.

258
00:28:17.190 --> 00:28:23.300
David Bau: Right? So, teacher, you're scoring your homework too hard. Give everybody A's on the homework.

259
00:28:24.030 --> 00:28:28.049
David Bau: Right? And we should make sure that students spend less time doing it.

260
00:28:31.100 --> 00:28:34.020
David Bau: Right? It seems like the total wrong advice.

261
00:28:34.260 --> 00:28:35.500
David Bau: Right, doesn't it?

262
00:28:35.900 --> 00:28:44.219
David Bau: Right, so… so this, like, for me, this is a good example of why logistic regression can be an excellent way of classifying

263
00:28:44.810 --> 00:28:48.760
David Bau: Data, but it ends up with this… Vector…

264
00:28:49.180 --> 00:28:54.020
David Bau: the pro factor, which is not… causal…

265
00:28:54.500 --> 00:28:58.999
David Bau: Piece of advice on what to do to have a causal effect.

266
00:29:00.550 --> 00:29:01.619
David Bau: That makes sense?

267
00:29:01.930 --> 00:29:02.710
David Bau: Right.

268
00:29:02.840 --> 00:29:12.260
David Bau: And so there's… so when you hear people say, oh, correlation is not causality, and this and that, right? It's sort of related to this, right? It's a slightly different phenomenon, but it's related to this.

269
00:29:12.610 --> 00:29:17.449
David Bau: And so, so, nope, when, so…

270
00:29:17.810 --> 00:29:26.139
David Bau: So that's… so that's one of the methods, is logistic regression. You see this a lot everywhere. It's very easy to train a logistic regression probe ever phrase.

271
00:29:26.770 --> 00:29:30.450
David Bau: taught in machine learning, you should do this. So you see this all over the literature.

272
00:29:30.740 --> 00:29:35.559
David Bau: But what Sam Marks advocates in his paper

273
00:29:35.680 --> 00:29:38.600
David Bau: Is… oh, actually, maybe we want to use this other thing.

274
00:29:38.740 --> 00:29:42.939
David Bau: Much simpler than a tree administration regression. Gives it a fancy name.

275
00:29:43.060 --> 00:29:44.210
David Bau: masked meme.

276
00:29:44.460 --> 00:29:45.359
David Bau: Grow up.

277
00:29:45.540 --> 00:29:47.290
David Bau: What the heck is a mass mean probe?

278
00:29:47.850 --> 00:29:50.300
David Bau: Let me just take the average of all the green guys.

279
00:29:51.100 --> 00:29:53.070
David Bau: That's the meme, right here.

280
00:29:53.230 --> 00:29:57.049
David Bau: Hey, the average of all the red guys. It's the mean right there.

281
00:29:57.210 --> 00:30:01.509
David Bau: Right, and then the interesting vector is the vector that connects those two means, that's all it is.

282
00:30:02.050 --> 00:30:03.000
David Bau: Make sense?

283
00:30:03.150 --> 00:30:07.499
David Bau: And you can see how this would be more guaranteed to be causal.

284
00:30:08.090 --> 00:30:11.260
David Bau: Than the other one, because it's so simple.

285
00:30:11.410 --> 00:30:14.480
David Bau: It's like, what's your advice for all the red people?

286
00:30:15.250 --> 00:30:18.099
David Bau: Well, the advice is, you should be more like the green people.

287
00:30:18.370 --> 00:30:19.529
David Bau: It's simple.

288
00:30:19.690 --> 00:30:26.480
David Bau: And so how do the red people differ from being people with, like, the higher homer grades? That's good, work on that, right? But they also, like.

289
00:30:26.970 --> 00:30:33.970
David Bau: study for more hours, right? So, you should do that also, right? So… so, you know…

290
00:30:34.940 --> 00:30:42.239
David Bau: It's, it's a simpler… it's a simpler thing, and, and it's… It's, more guaranteed.

291
00:30:42.380 --> 00:30:54.689
David Bau: to be connected to causal variables, yes, maybe this is a… more of a data question than a modeling question, but it feels to me like the master approach kind of assumes that you have, like.

292
00:30:55.140 --> 00:31:11.130
David Bau: the same distribution. Yes. Whereas I could imagine, like, maybe you have two types of students that would be a high level. Low effort, low grade, and high effort, still low grade. That's right. And then if you have those two clusters, taking the average of that might…

293
00:31:11.210 --> 00:31:26.240
David Bau: It also might not… That's the same way. That's right. So I'm just curious, you know, would you categorize that as, like, a issue with the approach, with the data? Like, not very counterfactual? The way I put it… the way I think of it is like this, is,

294
00:31:26.750 --> 00:31:33.109
David Bau: You… you have this hope that somewhere in this huge signal that you've got, there's a causal variable.

295
00:31:34.440 --> 00:31:36.069
David Bau: There might not be a causal variable.

296
00:31:36.410 --> 00:31:47.120
David Bau: But you hope that somewhere in there, this one. And the danger of using the wrong kind of probe is that you might throw out that causal variable by accident. Like, linear regression will do that a lot.

297
00:31:47.220 --> 00:31:54.790
David Bau: Right, but if you use the mass mean thing, if there's a causal variable in there, then you'll bring it along. You're bringing along all the variables, because you're just, like.

298
00:31:55.000 --> 00:31:56.900
David Bau: Along the means, and you're not throwing anything out.

299
00:31:57.030 --> 00:32:07.199
David Bau: Right? Now, there's… it's possible that your system's set up so that there's no causal variable in there. Maybe there's not… there's no causality in any of the things we measured, right?

300
00:32:07.390 --> 00:32:13.599
David Bau: And then… and then this is not gonna help that, right? You can't, like, invent causality where there isn't… Maybe it's just, like, that,

301
00:32:13.730 --> 00:32:32.149
David Bau: linear combination of, like, different factors. You're trying to measure a combination. Yeah, no, right, exactly. So there might be something causal somewhere in the system, but by the time you measure it, it's all downstream of causality, and none of your variables, none of the things you have are causal anymore, right? It could be, right? So you often have the situation where you're downstream of causal variables.

302
00:32:32.360 --> 00:32:40.600
David Bau: And, and nothing that you affect actually has a causal effect anymore, right? Does that… does that kind of make sense? You can have these systems as, like.

303
00:32:40.780 --> 00:32:51.740
David Bau: you know, downstream of causal variables. It's like, oh, umbrellas! When people have umbrellas, it rains, right? Oh, that's very downstream of it raining, like, you know, if I give everybody an umbrella, it's not gonna make it rain.

304
00:32:51.920 --> 00:32:56.630
David Bau: Right? How about rain boots and raincoats and all the other stuff? Right, and none of it helps.

305
00:32:56.860 --> 00:32:59.410
David Bau: Right? It's not gonna rain, we're still in a drought.

306
00:32:59.600 --> 00:33:06.440
David Bau: That's not the way to solve it. It's because it's all downstream. And so you might be in a situation where everything that you measure is downstream and it doesn't help.

307
00:33:06.560 --> 00:33:18.500
David Bau: Right, so… but, if you do have something causal in there, then at least it won't throw it out. And then… so there's a second set of experiments we need to run inside a neural network.

308
00:33:18.810 --> 00:33:24.450
David Bau: To test for causality, and the statistical techniques won't…

309
00:33:24.650 --> 00:33:27.570
David Bau: do them on their own. Just gotta change things.

310
00:33:28.350 --> 00:33:29.130
David Bau: Makes sense.

311
00:33:29.270 --> 00:33:40.449
David Bau: But… but if you… if you set all that fancy stuff up, and you say, oh, the intervention I'm going to do is this LR probe direction, then it might still not work, because you've… you actually throw out the variable.

312
00:33:41.990 --> 00:33:42.920
David Bau: Make sense?

313
00:33:43.300 --> 00:33:44.830
David Bau: Or, or maybe even get it.

314
00:33:45.600 --> 00:33:47.160
David Bau: And then people use.

315
00:33:48.200 --> 00:33:52.840
David Bau: So, the other thing that Sam does is… so Sam has this funny thing where

316
00:33:53.090 --> 00:33:56.489
David Bau: what does he say? Sam Mark says,

317
00:33:56.650 --> 00:34:06.089
David Bau: Oh, you know, when you use this mass main thing, people want to use it as a probe. They want to, like, dot productify it and see, like, if something is truthful or not.

318
00:34:06.360 --> 00:34:25.299
David Bau: He says, well, like, if you just use a dot product, it's not a very good probe. You see that the separating plane would cut through a lot of cases. It'd be easy to not, like, get wrong here. And so, what Sam points out is that the main cause for all these cases is that you have some covariance, you have some skew in your data.

319
00:34:25.719 --> 00:34:36.330
David Bau: And the right way to… the way he likes to think about it is that it is a good probe, but first you sort of have to unscrew the data. So he says, measure the covariance of your data.

320
00:34:36.460 --> 00:34:42.169
David Bau: I'll whiten by the programs, and then you can cross that line, and that… that will tilt.

321
00:34:42.389 --> 00:34:45.150
David Bau: It'll tilt all the angles so that… you know.

322
00:34:45.360 --> 00:34:48.639
David Bau: They basically… what a right angle looks like.

323
00:34:48.830 --> 00:34:53.479
David Bau: You know, after it's untilted, will basically look like this one.

324
00:34:54.480 --> 00:35:00.779
David Bau: It'll be slightly different rule, it's a different method. It's… it's a method called linear discriminative analysis.

325
00:35:03.040 --> 00:35:06.790
David Bau: But, but yeah, so that's it. So that's… that's what's hemp.

326
00:35:07.220 --> 00:35:09.709
David Bau: Proposes in his paper, and

327
00:35:10.320 --> 00:35:17.100
David Bau: Okay, so that's… any, any other… is that sort of more than you wanted to know about how… how to find a linear direction? Okay.

328
00:35:17.520 --> 00:35:20.290
David Bau: There's, I mean, there's a lot of other things that people can do.

329
00:35:20.550 --> 00:35:36.509
David Bau: So there were some other questions here, so for, Court… Courtney… oh, I missed your… did I miss the T, Courtney? Courtney asked a question, Yuci asked a question, Claire asked a question, yeah, so some questions here related, yes. Remind me your name, I forget who's from. Courtney here. Courtney, I agree.

330
00:35:36.510 --> 00:35:46.579
David Bau: My question is about how the linear structure emerges, like, across the layers, and they said for some more complicated, like, tasks or structures, the later layers are…

331
00:35:46.580 --> 00:36:03.590
David Bau: better to find the linear separation. Yes. How do you know when is the ideal layer to, like, cut off the learning or pull off the linear structure? Yes. If there's so much variance, is there, like, a systematic way to do this? Yeah, there's a couple systematic ways, and I'll show you. I'll give a little sense for that.

332
00:36:03.620 --> 00:36:08.529
David Bau: And then let me let the other folks ask their questions as well, if that'd be very exciting.

333
00:36:08.830 --> 00:36:11.590
David Bau: So, you chief, UT?

334
00:36:11.730 --> 00:36:21.759
David Bau: I think it's the same question, but I was more focused on, like, the sphere, and we're supposed to be infrastructure, right? Yeah, because… because the transformer's this big grid.

335
00:36:21.920 --> 00:36:26.510
David Bau: With all the layers and all the tokens and so, right? A lot of possible things to prove.

336
00:36:26.740 --> 00:36:28.059
David Bau: Claire had a question.

337
00:36:28.670 --> 00:36:30.760
David Bau: Yes. I was just…

338
00:36:30.950 --> 00:36:43.770
David Bau: I couldn't understand the summarization motif. I read that section, like… Yeah, it's a summarization motif. Yeah, I apologize for this paper. I wanted to give you a little sense for, you know, what people are doing, but it's a little bit of messy paper. I can explain it.

339
00:36:44.150 --> 00:36:45.110
David Bau: Okay.

340
00:36:45.220 --> 00:36:55.409
David Bau: So, but it's related to the other two questions. So, what's going on here is this… so, you remember in Sam Mark's paper that he had this mysterious figure, this one?

341
00:36:55.680 --> 00:37:00.129
David Bau: And then in paper number 2, they had this other map figure, this one.

342
00:37:00.250 --> 00:37:12.969
David Bau: So, both of these are… they're different techniques, but from each other, but they're both what those researchers used to try to answer this question. Which layer should we be

343
00:37:13.220 --> 00:37:14.070
David Bau: probing.

344
00:37:14.220 --> 00:37:17.210
David Bau: Right, and which token… to be probate.

345
00:37:17.370 --> 00:37:24.540
David Bau: And… And so, like, I really like San Marc's approach, what he did. So what he did is.

346
00:37:24.720 --> 00:37:28.430
David Bau: This is… this is a causal analysis of tokens

347
00:37:28.930 --> 00:37:32.550
David Bau: And, and layers. And so what Sam did.

348
00:37:32.700 --> 00:37:34.679
David Bau: Is he took a single sentence.

349
00:37:35.300 --> 00:37:39.890
David Bau: He didn't even do this over the whole data set, he just took a single sentence. He said, Chicago.

350
00:37:40.140 --> 00:37:42.570
David Bau: Chicago is in Canada.

351
00:37:43.230 --> 00:37:48.390
David Bau: is not… actually, I'm sorry, there's another sentence he puts after that. He says, this statement is…

352
00:37:48.550 --> 00:37:51.059
David Bau: And you want to see if the model says true or false.

353
00:37:51.600 --> 00:37:53.239
David Bau: Right? Chicago's in Canada.

354
00:37:53.400 --> 00:37:55.079
David Bau: And the model should say false.

355
00:37:55.660 --> 00:37:56.420
David Bau: Right?

356
00:37:56.760 --> 00:38:01.920
David Bau: And then he runs it again and says, Toronto is in Canada. The statement is…

357
00:38:02.740 --> 00:38:14.990
David Bau: And he'd expect us to say true, right? So these two runs are really similar, and what he wants to do is he wants to look inside the model to see what variables are causal inside. So we're going to have a whole class around

358
00:38:15.090 --> 00:38:17.969
David Bau: This type of causal scanning, Next week?

359
00:38:18.270 --> 00:38:22.620
David Bau: But… but he… but he does some causal analysis.

360
00:38:22.790 --> 00:38:24.829
David Bau: And he, he finds that

361
00:38:24.940 --> 00:38:41.579
David Bau: certain layers and certain tokens are more important than others. All these dark blue ones are the important ones. And he draws boxes around them and discusses them. And he says, well, this is sort of causally important, but not that interesting, because it's, like, the city, Chicago or Toronto. But he says, oh, this is really interesting.

362
00:38:41.740 --> 00:38:44.560
David Bau: At the period, at the end of the sentence.

363
00:38:44.770 --> 00:38:52.399
David Bau: There's some really important representations that are causal that cause the model to, say, flip true or false at the end.

364
00:38:52.650 --> 00:39:02.329
David Bau: And so this was really fascinating, because the period doesn't have any information in it, right? The period is the same between these two sentences, Chicago or Toronto.

365
00:39:02.590 --> 00:39:08.530
David Bau: And yet, the period has something in it that is decisive, in its representation, has something decisive

366
00:39:08.680 --> 00:39:12.269
David Bau: That somehow gets the model to say true or false and flip between them.

367
00:39:12.650 --> 00:39:25.719
David Bau: between the sentences, even though there's nothing different about the period itself, right? And so… so what… what Sam hypothesizes is that something in the representation in the period, somewhere around layer, you know, 15 or something.

368
00:39:25.870 --> 00:39:31.109
David Bau: Is, like, summarizing… The truth of the sentence, right?

369
00:39:31.350 --> 00:39:35.739
David Bau: So he uses this console Hang on one, one example.

370
00:39:36.080 --> 00:39:48.400
David Bau: to, to justify that. And it says, so therefore, for the rest of the paper, I'm gonna just look at the representations that are sitting here, and see what the deal is, and I'll probe them, and I'll…

371
00:39:48.580 --> 00:39:49.960
David Bau: Scatterplot them and everything.

372
00:39:50.950 --> 00:39:52.689
David Bau: So we'll look at this period.

373
00:39:53.190 --> 00:39:59.250
David Bau: And so… so that's… that's what he does. And actually, this is one of the pieces of floor that I'll just share with you.

374
00:39:59.720 --> 00:40:03.759
David Bau: It's like, these periods, these, like, punctuations at the end of sentences.

375
00:40:04.230 --> 00:40:08.449
David Bau: the Transformers, they put a lot of information on them, it's crazy, right? Like…

376
00:40:08.560 --> 00:40:22.789
David Bau: you know, when you… so this other paper is also doing something similar, other punctuation. So, this paper is using a different approach. They're saying, okay, when you get to the end, John feels very, you know, happy or sad or whatever, right? So, like, okay, so you're about to make this prediction.

377
00:40:23.360 --> 00:40:25.920
David Bau: What are the attention heads doing?

378
00:40:26.350 --> 00:40:37.490
David Bau: when they're trying to look back and try to remember what's going on, say, you know, whether John feels happy or sad. So they find the relevant attention pads, and they look at the heat maps, and they say, oh, look!

379
00:40:38.590 --> 00:40:40.809
David Bau: They're, they're paying a lot of attention.

380
00:40:41.590 --> 00:40:42.799
David Bau: to the comma!

381
00:40:43.700 --> 00:40:45.760
David Bau: This comma. What the heck?

382
00:40:46.480 --> 00:41:02.630
David Bau: The comma seems so innocent, right? But actually, but actually, like, there's something important there that the attention heads are looking at. So, these are two different views of the same kind of phenomena, where, like, it looks like a lot of the punctuation is used by the transformer.

383
00:41:03.050 --> 00:41:17.579
David Bau: To deposit, like, summary information about what it understood about the phrase, or what it understood about the sentence, and then later on, when it has to have been pre-reading, instead of going and looking at all individual words, it often looks at the punctuation, right?

384
00:41:18.050 --> 00:41:20.999
David Bau: So, at least that's what we're finding.

385
00:41:21.130 --> 00:41:23.849
David Bau: It's a sort of a story that we've seen a lot.

386
00:41:24.090 --> 00:41:32.529
David Bau: And so, if you're… if you don't yet know how to do all this causal analysis, but you want to try to do some scatter plots and…

387
00:41:32.640 --> 00:41:34.520
David Bau: Do some probes and look around.

388
00:41:34.940 --> 00:41:38.860
David Bau: Oh, you could do worse than looking at a piece of punctuation.

389
00:41:39.250 --> 00:41:41.639
David Bau: Right? You can find it, it might work.

390
00:41:41.880 --> 00:41:54.040
David Bau: it might work. And we can do this logical analysis and everything, but… but, like, I recommend people just, like, look at the period of data sent. So, like, if you… if you're in the power team, and you have sentences that describe somebody

391
00:41:54.190 --> 00:42:08.239
David Bau: who's projecting a lot of power, and you have other sentences where sentences are somebody who's really powerless, and you want to see, does the model know the difference between these? Well, where would you probe? Maybe the period at the end sentence might be a good place to start.

392
00:42:08.490 --> 00:42:12.659
David Bau: Okay, that make sense? And same with all the other… all the other projects.

393
00:42:12.840 --> 00:42:30.179
David Bau: Yes. Slightly different, but when a model hallucinates, do you think there is, like, any pattern that… or has any work been done that tried to… because I think this is really new for me, that the comma might hold power, but, like, in the hallucinatory content, if there is, like, any pattern or…

394
00:42:30.570 --> 00:42:42.970
David Bau: any, like, so, for example, I see a lot of digits keep on getting repeated when the model starts hallucinating, so is there… that's been, like… Yeah, yeah, yeah. So, hallucination is a big deal, so a lot of people have done 50% hallucination.

395
00:42:43.110 --> 00:42:44.800
David Bau: I, I don't think that…

396
00:42:44.940 --> 00:42:56.769
David Bau: I've seen any that relate it to the punctuation, but I'll share an interesting phenomenon. There's another thing that happens is… so there's two really interesting moments when a model is about to emit a token.

397
00:42:56.960 --> 00:42:59.809
David Bau: One is right before it emits the token.

398
00:43:00.210 --> 00:43:05.129
David Bau: And then you can see… Like, in the final bunch of layers.

399
00:43:05.480 --> 00:43:21.230
David Bau: the model's, like, bringing together all sorts of information to make this really difficult decision of whether something is true or false, or whether some sentiment should be predicted a certain way, some arithmetic is figured out a certain way or something, right? So right before the token is predicted, you see all this interesting information converge.

400
00:43:21.510 --> 00:43:23.310
David Bau: On that, right before token?

401
00:43:23.650 --> 00:43:33.470
David Bau: But then there's this other thing that happens, that after the token is emitted, and it's in the stream, and now the model's kind of going back autoregressively and reading, what it just said.

402
00:43:33.960 --> 00:43:39.010
David Bau: There's this other thing that happens, is that it says this, What's going on?

403
00:43:39.140 --> 00:43:41.710
David Bau: And maybe it, like, you have the token.

404
00:43:41.980 --> 00:43:45.109
David Bau: And it'll assess whether it's true or false right after that.

405
00:43:45.550 --> 00:43:48.530
David Bau: And, and so one of the hallucination results

406
00:43:48.720 --> 00:43:53.429
David Bau: Is that even when models make a mistake, and they hallucinate something.

407
00:43:53.530 --> 00:43:57.139
David Bau: After they emit the token, the wrong token they put in the stream.

408
00:43:57.270 --> 00:44:02.379
David Bau: Then immediately, you know, one or two tokens after, the model will look at what it said.

409
00:44:02.860 --> 00:44:06.450
David Bau: And it'll say… That doesn't seem right.

410
00:44:07.430 --> 00:44:09.049
David Bau: What's so bad?

411
00:44:09.180 --> 00:44:13.040
David Bau: Right? Like, they'll know that they made a mistake, which is really interesting.

412
00:44:13.210 --> 00:44:29.899
David Bau: And so, and I think the hallucination paper… so I know that in my lab, the best place to find when a previous hallucination was a mistake is actually at punctuation, but there's been some other papers, so I don't know if other labs agree with that.

413
00:44:30.090 --> 00:44:30.969
David Bau: Let me have mine.

414
00:44:31.980 --> 00:44:32.660
David Bau: Holy crap.

415
00:44:33.920 --> 00:44:36.000
David Bau: So, okay.

416
00:44:37.020 --> 00:44:39.659
David Bau: So, let's see… other things.

417
00:44:39.760 --> 00:44:44.080
David Bau: Oh, other… oh, Courtney also asks what, what, what later, right?

418
00:44:44.720 --> 00:44:45.840
David Bau: And,

419
00:44:46.400 --> 00:44:51.839
David Bau: And there's simple things that you can also do, so there's these fancy causal things that you can do, but you can also just…

420
00:44:52.040 --> 00:44:53.469
David Bau: Measure the margin?

421
00:44:53.570 --> 00:44:56.850
David Bau: Right? Just like a classic machine learning person would do.

422
00:44:57.070 --> 00:45:04.650
David Bau: Right, and so, like, a simple way of measuring the margin is, like, the Cohen's D, Statistic?

423
00:45:04.910 --> 00:45:11.080
David Bau: Which is, you know, you have your two classes, you know, your true things, your false things, your powerful things, your powerless things, whatever.

424
00:45:11.230 --> 00:45:22.700
David Bau: And you have your direction, let's say your mass mean direction, and you just project everything on that direction. So then, like, how well are things separated? Well, you can just ask, like.

425
00:45:22.890 --> 00:45:25.699
David Bau: How… how far are the beans apart from each other?

426
00:45:25.830 --> 00:45:29.149
David Bau: But now, what's the scale that you should use for that? Well…

427
00:45:30.160 --> 00:45:33.590
David Bau: You know, if the neural network happens to be

428
00:45:34.080 --> 00:45:42.649
David Bau: outputting the hundreds out of the neurons, then that would be a big scale, and if the neural network happens to be outputting, you know, 1 tenth out of the

429
00:45:42.810 --> 00:45:59.159
David Bau: the neural network, that would be a small scale, but really what matters is, like, what's the relative scale relative to the general variance of the signals, right? And so what Cohen's key is, is you take the difference in the means, and then you just divide it by the standard deviation of what the data's looking like.

430
00:45:59.690 --> 00:46:04.859
David Bau: And that tells you, on a relative scale, like, how much of the variation

431
00:46:05.310 --> 00:46:08.289
David Bau: You know, is, is, you know, can be…

432
00:46:08.500 --> 00:46:11.130
David Bau: Seen as a separation between these two classes.

433
00:46:11.700 --> 00:46:21.429
David Bau: Right, and so if you do this for a real piece of data, then you'll often see, you know, graphs that go up and go down. This is two graphs. There's another one called

434
00:46:21.650 --> 00:46:27.339
David Bau: That you can Google for, called, The, the Fisher separation.

435
00:46:27.660 --> 00:46:32.909
David Bau: Where, where it's a little bit more vector-oriented. So, there's different ways of, like, estimating the ratio.

436
00:46:33.710 --> 00:46:36.070
David Bau: But they're just, they're simple ratios.

437
00:46:36.860 --> 00:46:44.150
David Bau: Make sense? So it's another thing you can do, and this might suggest, oh, I should look at, you know, I should look at the model at layer 2027.

438
00:46:44.830 --> 00:46:47.850
David Bau: Right, it looks like there's a big separation here, maybe…

439
00:46:48.060 --> 00:46:49.790
David Bau: Maybe they'll tell me what's going on.

440
00:46:49.980 --> 00:46:54.809
David Bau: Okay, so what I want to share with you… so, actually, you can even bring this up. This is actually a nice…

441
00:46:54.950 --> 00:46:59.390
David Bau: I prepared a thing that is kind of a model, But what?

442
00:46:59.610 --> 00:47:06.040
David Bau: I hope you might do for your project, or at least resource this for things that you might do. So there's this GitHub.

443
00:47:06.460 --> 00:47:09.389
David Bau: Paige Puns is also a GitHub

444
00:47:09.960 --> 00:47:12.580
David Bau: Project, you can just look at all the code.

445
00:47:12.820 --> 00:47:21.599
David Bau: So I sort of barcoded it, but it does have a lot of nice examples of code in it. And I… and so I just wanted to take a little interlude.

446
00:47:22.230 --> 00:47:27.289
David Bau: To step you through How you would do this type of analysis in your projects?

447
00:47:27.830 --> 00:47:31.269
David Bau: And some of the things we talked about Last week, but…

448
00:47:31.690 --> 00:47:35.839
David Bau: Because I was really disorganized by the project timing, we didn't really have that much time to talk about it.

449
00:47:36.180 --> 00:47:37.830
David Bau: We're gonna talk about it again here.

450
00:47:38.000 --> 00:47:38.750
David Bau: Thanks.

451
00:47:39.250 --> 00:47:40.139
David Bau: And so…

452
00:47:40.350 --> 00:47:56.200
David Bau: So the steps are like this. So first we, like, make some data sets, and we, like, make contrasted data sets, and then… and then we use it to benchmark models, and then after we benchmark models, we can use NDIF to gather internal activation data, and then we… we analyze it. And so I'm just going to step you through what each of these steps looks like.

453
00:47:56.510 --> 00:48:07.800
David Bau: Right. And so, the example, project I made up for myself, Oh, one person Tajif is… is puns.

454
00:48:07.910 --> 00:48:12.180
David Bau: So my research question is, do you burn, right? Do…

455
00:48:12.690 --> 00:48:16.020
David Bau: Do, do LLMs know how to take a joke?

456
00:48:18.140 --> 00:48:21.749
David Bau: Or at least puns. At least, like, really, like, stupid jokes.

457
00:48:21.920 --> 00:48:23.479
David Bau: Like the puns, right?

458
00:48:23.790 --> 00:48:31.729
David Bau: And so there's a whole bunch of reasons, and I wrote them in this webpage, like why puns might be actually interesting, but mostly I picked it because they're funny.

459
00:48:31.990 --> 00:48:37.649
David Bau: And so, so, my first step was, if I'm gonna probe…

460
00:48:37.850 --> 00:48:42.590
David Bau: upon understanding, I need a joke book that I can use.

461
00:48:42.760 --> 00:48:56.859
David Bau: to test the models, so I can go to them and say, do you think this is funny? Or whatever, right? But they have to be pretty funny jokes, so I didn't have enough. I'm not that creative. I was able to come up with, you know, 5 or 6 puns on my own.

462
00:48:57.440 --> 00:49:02.969
David Bau: And so… so I… so I did what a lot of researchers are doing now. I went to all the…

463
00:49:03.200 --> 00:49:10.509
David Bau: LLMs, and I said, alright, here's some examples of a pun. Can you make me a thousand of these? Right?

464
00:49:10.620 --> 00:49:12.849
David Bau: And they're happy to oblige.

465
00:49:12.970 --> 00:49:15.200
David Bau: They make thousands and thousands of puns.

466
00:49:15.470 --> 00:49:30.120
David Bau: And, and I just, I said, oh, format it as JSON, put it in a file for me, I'll merge them all up, I've got my, you know, my ChatGPT puns, my Cloud puns, my Gemini puns, they're all, like, I've got about a thousand of them in total. But let me tell you.

467
00:49:30.890 --> 00:49:38.590
David Bau: They're really bad. Here's, like, the farmer woke up early every morning because he had to make Hey.

468
00:49:39.350 --> 00:49:42.089
David Bau: It's not very funny.

469
00:49:42.460 --> 00:49:45.190
David Bau: Right, and some of them, like, they would be funnier, but they're, like.

470
00:49:45.560 --> 00:49:51.580
David Bau: bad, like, so here's… the door salesman was the best closer because he always knew how to

471
00:49:53.590 --> 00:50:01.159
David Bau: clothes, but you said clothes already, so you ruined the joke, right? Like, it should say something like.

472
00:50:01.680 --> 00:50:02.820
David Bau: You know?

473
00:50:03.850 --> 00:50:07.390
David Bau: The, the, the door salesman was the…

474
00:50:07.610 --> 00:50:14.480
David Bau: best in the country, because he always knew how to close, right? And then that would be the, like, that would be a joke, but, like.

475
00:50:14.920 --> 00:50:19.100
David Bau: you know, the superhuman AGI systems.

476
00:50:19.240 --> 00:50:27.260
David Bau: I don't get it, right? Okay. But… but you… but I get, like, a thousand, like, low-quality jokes like this, and that, of course, you have to…

477
00:50:27.450 --> 00:50:29.250
David Bau: You know, inspect them by hand.

478
00:50:29.570 --> 00:50:36.159
David Bau: So maybe that's good, maybe they were good jokes, and then I'd be done. I'd have my data set. Great, right? But they were terrible.

479
00:50:37.040 --> 00:50:48.740
David Bau: And there were too many for me to work through, so I did the same thing that we read about what Perez did. So I'm just following the formula of the Perez paper, just like what you guys read last week, right?

480
00:50:49.090 --> 00:50:54.260
David Bau: I said, I need to filter these, and there's too many for me to filter them by hand.

481
00:50:54.600 --> 00:51:04.639
David Bau: So, chat, GPT, Claude, Gemini, let's scramble these up, look at each other's jokes, right? I need you to rate these jokes. Tell me what's funny.

482
00:51:04.930 --> 00:51:06.470
David Bau: Tell me what's not funny.

483
00:51:06.870 --> 00:51:13.319
David Bau: Right? I'll make this easier for you. You have to do two steps. First, Explain the joke!

484
00:51:13.710 --> 00:51:14.840
David Bau: Explain it!

485
00:51:15.180 --> 00:51:33.150
David Bau: And then tell me why it's funny. If you can't explain why it's funny, then give it a rating, so give all these numbers. So you do it, and it says, oh yeah, the pun relies on make… hey, it's a farming activity, it sounds like this, whatever, right? Like 3.3 out of 10, 3. Maybe not that funny.

486
00:51:33.920 --> 00:51:37.680
David Bau: Right? But some of the others, like, you know, they're random 5.

487
00:51:37.850 --> 00:51:44.969
David Bau: And so I did this, and then, like, after I filtered it, it wasn't good enough. I had very few drugs that I survived.

488
00:51:45.130 --> 00:51:53.870
David Bau: the same… but some of the jokes were kind of close, so I also ran what I call the joke repair workshop, right? Where I said, okay, now read the explanation of the joke.

489
00:51:54.100 --> 00:51:59.530
David Bau: And then think about why it's not funny, and see if you can, like, make it funnier. Here's a couple examples of repaired jokes.

490
00:51:59.810 --> 00:52:02.699
David Bau: Right? And I've repaired another couple hundred jokes for me.

491
00:52:03.040 --> 00:52:09.039
David Bau: Right And then re-rated them, and so on. And so after, after a while, I did this.

492
00:52:09.200 --> 00:52:14.300
David Bau: And… and I… and I had to… so, the honest truth is.

493
00:52:14.410 --> 00:52:24.160
David Bau: It, like, it wasn't enough to get a few hundred funny jokes. I got maybe 100 funny jokes out of this, and then there were 100 other jokes that were almost funny.

494
00:52:24.300 --> 00:52:28.849
David Bau: But they couldn't figure out how to repair them, and so I had to go and repair them by hand, right, as a human.

495
00:52:29.060 --> 00:52:30.580
David Bau: And repair the jokes, and…

496
00:52:30.740 --> 00:52:49.110
David Bau: did all the things. Okay, so that's… so… so that was sort of the process I did. So I got a couple hundred jokes out of this, right? Oh, and then now you have to watch out for me, because I've got my 200 puns. Okay. So, so then the next step is, we need to study these jokes.

497
00:52:49.500 --> 00:52:50.380
David Bau: And…

498
00:52:50.960 --> 00:52:58.400
David Bau: models where we can look on the inside enough. Like, we're not going to study ChatGPT, because OpenAI is not going to let us look inside.

499
00:52:58.520 --> 00:53:00.010
David Bau: ChatGP model.

500
00:53:00.340 --> 00:53:12.800
David Bau: So… so I… so I… so I was like, let me just study, like, let me… let me test all the different models and see which one… like, I'm… so the… the real challenge with any of these things is you can't study

501
00:53:13.060 --> 00:53:14.570
David Bau: a phenomenon.

502
00:53:15.120 --> 00:53:18.969
David Bau: in an LLM that's too dumb to actually have the phenomenon.

503
00:53:19.140 --> 00:53:24.570
David Bau: Right? If the LLM can't take a joke at all, then there's no, like, joke awareness to studying it.

504
00:53:24.860 --> 00:53:28.849
David Bau: So the first thing you gotta do is you've got to find the LMs that actually

505
00:53:29.270 --> 00:53:37.820
David Bau: that you can look inside, they're open LLMs, but they actually know the concept that you're interested in. And so… so I just, like, took my 205 jokes.

506
00:53:38.230 --> 00:53:41.720
David Bau: and use the fastest service I could find to just throw them

507
00:53:42.310 --> 00:53:47.890
David Bau: at all these LLMs, what I did is I took the humorous word that was, you know,

508
00:53:48.460 --> 00:53:49.330
David Bau: you know.

509
00:53:49.710 --> 00:53:55.829
David Bau: The baker was poor because he didn't make enough.

510
00:53:56.580 --> 00:53:58.010
David Bau: Leave it blank.

511
00:53:58.190 --> 00:54:04.190
David Bau: And then you'd have to say, don't, right? And, and,

512
00:54:04.370 --> 00:54:08.279
David Bau: And then a lot of the, LMs cannot do that.

513
00:54:08.470 --> 00:54:10.019
David Bau: But some of them can't.

514
00:54:10.200 --> 00:54:16.380
David Bau: And… and so I just sent them all to all the… so this is… this is the smallest LLM,

515
00:54:16.510 --> 00:54:33.919
David Bau: And then the dark red is the biggest one, and you can see, for this batch of jokes, you know, the smallest LLM could only get, like, you know, 10% of them, and the biggest LLM could fill in the joke for 80% of them. And so there's a real difference between the small and the big ones. So these tiers…

516
00:54:34.040 --> 00:54:39.049
David Bau: And then, not only that, but it was an interesting chance for me to rate the jokes themselves.

517
00:54:39.390 --> 00:54:47.440
David Bau: And so a lot of the jokes were so laugh-out-loud funny that even the dumb model would basically get it.

518
00:54:48.010 --> 00:54:48.860
David Bau: Right?

519
00:54:49.260 --> 00:54:50.360
David Bau: That make sense?

520
00:54:50.660 --> 00:54:52.860
David Bau: But then, some of the jokes

521
00:54:53.620 --> 00:54:57.460
David Bau: Take off the funny word. They read like straight sentences.

522
00:54:57.720 --> 00:55:04.279
David Bau: And the AI doesn't get that there's a joke going on, and just puts in a regular word.

523
00:55:04.420 --> 00:55:11.580
David Bau: you know, the banker was… Demoralized.

524
00:55:11.720 --> 00:55:15.970
David Bau: Because he had lost his… Job.

525
00:55:16.840 --> 00:55:23.600
David Bau: Right, they'll say. But if you're in a joking mood, you might say the banker was demoralized because he lost his interest.

526
00:55:24.050 --> 00:55:26.669
David Bau: Right? Get it? Interesting.

527
00:55:26.800 --> 00:55:31.420
David Bau: Patrick's… Like, I should have been doing that.

528
00:55:31.580 --> 00:55:33.299
David Bau: Okay? Get it?

529
00:55:33.710 --> 00:55:42.470
David Bau: So, but that's… see, that's what happens with all the models. They're like, I don't get it. I don't get it. Job! He would lose his job. Why would he care about losing his interest?

530
00:55:42.620 --> 00:55:45.659
David Bau: That's like a marketing job's more serious.

531
00:55:46.030 --> 00:55:46.850
David Bau: Right?

532
00:55:47.490 --> 00:56:02.310
David Bau: And so… so that's where… so there's… so there were, like, ADA prompts that were like this. So I… so I probed all the models, but it was also an opportunity for me to probe the dataset, which was interesting, for a couple other reasons, right? Does that make sense? Okay.

533
00:56:02.600 --> 00:56:03.320
David Bau: So then…

534
00:56:03.670 --> 00:56:17.320
David Bau: Okay, so then the nice thing about doing this and asking the dev models to complete it is that I actually got these nice long lists of straight versus funny answers, so, you know, what is this? The tailor won his court case because he had an excellent

535
00:56:21.290 --> 00:56:24.370
David Bau: Suit. Yeah, an excellent suit.

536
00:56:24.610 --> 00:56:30.690
David Bau: But if you ask, like, the small models to answer this, they're, like, an excellent lawyer?

537
00:56:30.800 --> 00:56:40.489
David Bau: Right? Right? He had excellent defense, he had a good reputation, he had a good excuse, he had an alibi, right? You know, what's it like? You know, he had all these other things.

538
00:56:40.560 --> 00:56:58.249
David Bau: So they don't get the joke, so… but, like, the nice thing about doing this is you get all the other words that are not joke words, and you can have the big LM… you can go filter them, you can have the big LM check. Yes, these are not joke words, and then this is… these are… there's some other words that are the joke words, and we can tell the difference, right? So… so this is, like, a nice…

539
00:56:58.320 --> 00:57:01.150
David Bau: opportunity to do this, so I added this to the dataset.

540
00:57:01.660 --> 00:57:03.250
David Bau: And then, after that.

541
00:57:03.950 --> 00:57:08.670
David Bau: I was like, okay, so there's a problem with this. So the problem with this is that

542
00:57:08.880 --> 00:57:16.869
David Bau: I actually don't know, after doing this experiment, whether this is Lava 405B, the giant model, right? I don't know if Lava 405B

543
00:57:17.290 --> 00:57:19.140
David Bau: Actually knows about humor.

544
00:57:19.560 --> 00:57:22.960
David Bau: All I know from this is as read more text.

545
00:57:23.150 --> 00:57:25.709
David Bau: And maybe it's seen this joke before.

546
00:57:26.530 --> 00:57:29.109
David Bau: I don't know if it even knows that it's funny.

547
00:57:30.000 --> 00:57:31.069
David Bau: Does that make sense?

548
00:57:31.900 --> 00:57:36.380
David Bau: So… so what I did is I said, okay, but you know what? These ones are interesting.

549
00:57:36.970 --> 00:57:40.609
David Bau: These ones are interesting, because these are, like, hidden jokes that none of the models, like.

550
00:57:40.950 --> 00:57:43.340
David Bau: You know, can see on the surface of their jokes.

551
00:57:43.590 --> 00:57:47.869
David Bau: And so I said, let's do this. Let's create sentences like this.

552
00:57:48.120 --> 00:57:53.209
David Bau: So let's say this is, like, a hidden joke. The dangerous iPhone was arrested and charged with…

553
00:57:53.630 --> 00:57:58.690
David Bau: I don't know, theft, I don't know, what would you charge a dangerous iPhone with?

554
00:58:00.150 --> 00:58:03.239
David Bau: What would you charge a dangerous iPhone with? USB.

555
00:58:03.440 --> 00:58:06.410
David Bau: Battery! Battery!

556
00:58:07.540 --> 00:58:12.290
David Bau: somebody's finger. You charge it with battery. So now,

557
00:58:12.580 --> 00:58:17.940
David Bau: So, if you were in this context, the diner went to the paint store to get thinner.

558
00:58:19.140 --> 00:58:33.239
David Bau: the two antennas met on the roof, and the wedding had excellent reception, right? And then, after you saw that, you were like, okay, I get it now, the dangerous iPhone was arrested and charged with, and you'd be like, battery, that's it, battery, right? But…

559
00:58:34.060 --> 00:58:38.689
David Bau: If you had a different sentence, and you said the diary went to the paint store to get

560
00:58:39.120 --> 00:58:40.730
David Bau: Paint colors.

561
00:58:41.370 --> 00:58:42.180
David Bau: Right?

562
00:58:42.510 --> 00:58:44.760
David Bau: The two antennas met on a roof.

563
00:58:45.310 --> 00:58:49.100
David Bau: And the wedding had excellent Hors d'oeuvres?

564
00:58:49.590 --> 00:58:50.450
David Bau: Right?

565
00:58:50.570 --> 00:58:54.390
David Bau: And the iPhone was dangerous and was arrested and charged.

566
00:58:54.510 --> 00:58:59.270
David Bau: for… You might be like, I don't know, jaywalking? Theft?

567
00:58:59.570 --> 00:59:06.059
David Bau: Violation of privacy? Making too much noise? Disturbing the peace? I don't know, what can an iPhone do?

568
00:59:06.350 --> 00:59:11.530
David Bau: Right? So… so that… so… so you might not get it, right?

569
00:59:11.750 --> 00:59:15.640
David Bau: And so, So, what I did is I paired

570
00:59:16.490 --> 00:59:20.560
David Bau: So, whenever you do experiments, it's useful to have contrasts.

571
00:59:21.130 --> 00:59:24.950
David Bau: So, I made these paired datasets, where you had one.

572
00:59:25.730 --> 00:59:28.860
David Bau: That, you know, sentence like this, which everything was straight.

573
00:59:29.100 --> 00:59:32.439
David Bau: And one sentence where the jokes are… come before.

574
00:59:32.720 --> 00:59:37.810
David Bau: And then I looked at the last thing, and I said, what does the model Predict.

575
00:59:37.960 --> 00:59:43.450
David Bau: for the last word, the joke word or not, I have these nice data sets of, like, joke words and not joke words, right?

576
00:59:43.650 --> 00:59:49.639
David Bau: And then here you go. So, like, this stupid little model doesn't know anything, can never tell the joke.

577
00:59:50.150 --> 00:59:50.890
David Bau: Right?

578
00:59:51.000 --> 00:59:56.080
David Bau: But then, when you get to the big models, like, you know, the 4 or 5B,

579
00:59:56.560 --> 00:59:59.289
David Bau: You know, when it's a straight context.

580
00:59:59.970 --> 01:00:11.700
David Bau: You know, and sometimes it gets the idea of the joke anyway, even though there's no jokes being told. It'll tell the joke anyway, so at some rate, right, like, you know, 15% of the time or something.

581
01:00:11.850 --> 01:00:12.540
David Bau: Great.

582
01:00:12.680 --> 01:00:18.540
David Bau: But then, when it's a joke context, then it, like, reliably gives a joke, right? So this gap.

583
01:00:18.700 --> 01:00:24.170
David Bau: between its behavior and the joke context and the street context. It's like, it's stronger evidence.

584
01:00:24.360 --> 01:00:29.020
David Bau: that actually knows what funding is. So with this in hand, I was like, okay.

585
01:00:29.180 --> 01:00:33.050
David Bau: We can proceed. We can, we can do, we can do some more experiments, is, like, promising.

586
01:00:33.190 --> 01:00:44.929
David Bau: And so… so that basically brought me to, the things that I'm talking about in the class today. Let's… let's actually collect together our 100 jokes.

587
01:00:45.120 --> 01:00:52.450
David Bau: And look at some representations. And we're gonna do more of this next week. But what I'd like you to do for your project, for homework.

588
01:00:52.650 --> 01:00:57.170
David Bau: Is to do some of this for your own data in the next couple days.

589
01:00:57.300 --> 01:01:07.889
David Bau: So what I did is I went this, I took this joke, the diner went to the paint store to get thinner, and 205 jokes just like it, and I said, what is a representation at the period?

590
01:01:09.080 --> 01:01:18.430
David Bau: And then I analyzed all the representations in the query, I did the mass mean thing, and I got this little graph to say that the margin is biggest at layer 27 or something.

591
01:01:18.540 --> 01:01:27.870
David Bau: And I did a PCA visualization of pink funny jokes and blue serious ones, and I can kind of see that, oh yeah, there's kind of a separation.

592
01:01:28.160 --> 01:01:30.720
David Bau: Between the funny things and the serious ones.

593
01:01:30.910 --> 01:01:36.439
David Bau: And then there's some details, like, you might want to figure out the mass-mean vector with half of your data set.

594
01:01:36.930 --> 01:01:52.710
David Bau: And then measure the separation with the other half, so that you're not overfitting. Right, so there's, like, classic machine learning things you still want to pay attention to here, so you might want to do a holdout set. Is this the, difference or separation between the last token? Yes.

595
01:01:52.710 --> 01:01:57.810
David Bau: So, basically, what I would do is I would take two versions of the sentence.

596
01:01:58.640 --> 01:02:01.189
David Bau: And look at the representations at the period.

597
01:02:01.390 --> 01:02:19.759
David Bau: Couldn't that just be because, like, the answer is different, or how is that, like… Oh, yeah, there's after the answer. So, after the answer, the answer is definitely different. So, he says, the diner went to the paint store to get thinner, and then the diner went to the paint store to get paint. And… and so,

598
01:02:19.840 --> 01:02:27.589
David Bau: So yeah, so what we could be looking at is we could be looking at… there's all these funny words, like thinner, and there's all these serious words like pink.

599
01:02:27.870 --> 01:02:33.070
David Bau: And… and there's maybe… there's just a gap between just those words. So,

600
01:02:33.500 --> 01:02:49.290
David Bau: So, you know, that might be the case. I guess my gut feeling is if you actually looked at the words in isolation without the joke, they seem just like other words, and it doesn't seem like anything special about them. But you might be right, there could be a compounder here. There could be other reasons you get the separation.

601
01:02:49.420 --> 01:03:04.870
David Bau: And so we're gonna have to do more to prove to ourselves that this is actually humor awareness. But the fact that the model can kind of separate them is a sign, it's a signal, that maybe it knows that it's funny.

602
01:03:05.260 --> 01:03:08.670
David Bau: Maybe it knows it's fine. At least, if it…

603
01:03:09.070 --> 01:03:12.769
David Bau: Didn't know it was funny, it probably wouldn't have a separation like this.

604
01:03:13.140 --> 01:03:19.779
David Bau: Right? That makes sense. We're still not sure. I might know something else that's correlated with funny, yes. Would you take a look at the DOM model?

605
01:03:20.520 --> 01:03:27.900
David Bau: Oh, yes, see? Yes, I didn't do that, but you should. You can take a look at the dumb model and see that it doesn't separate. Things like that.

606
01:03:28.100 --> 01:03:33.570
David Bau: Oh yeah, I have this nice demo deck, too, I'll show you. So, if you use my code.

607
01:03:33.730 --> 01:03:37.989
David Bau: You can see what else I vibe-coded, I'm so happy with this. Yeah. Let's go with this.

608
01:03:38.200 --> 01:03:41.510
David Bau: Oh, that's localhost. Local host? Oh. Oh!

609
01:03:42.590 --> 01:03:43.460
David Bau: I don't know.

610
01:03:43.660 --> 01:03:44.970
David Bau: I don't know if you'll see this.

611
01:03:45.430 --> 01:03:46.910
David Bau: demo life.

612
01:03:48.160 --> 01:03:49.229
David Bau: of the host.

613
01:03:50.080 --> 01:03:50.880
David Bau: Yes.

614
01:03:57.330 --> 01:03:58.180
David Bau: Oh.

615
01:03:58.610 --> 01:04:05.389
David Bau: Yeah, I want to show you this demo, it's so full, but then I also don't want to slow down the class. Okay, how much, how much time do I have left?

616
01:04:05.530 --> 01:04:06.409
David Bau: Am I over?

617
01:04:06.840 --> 01:04:09.910
David Bau: I have 20 minutes? Oh, I can show you the demo.

618
01:04:13.080 --> 01:04:19.489
David Bau: Yeah, although I do want to explain this other paper to you.

619
01:04:19.670 --> 01:04:29.260
David Bau: But… but you get the idea. So, let me, but in the get-go, Should I put this here?

620
01:04:29.430 --> 01:04:34.270
David Bau: You skip verification, Or is this thing… oh, I don't know.

621
01:04:34.520 --> 01:04:37.220
David Bau: If I, if I share it on Zoom.

622
01:04:37.500 --> 01:04:41.259
David Bau: so many things that you can do here. We have too much technology, there's too many options.

623
01:04:41.640 --> 01:05:00.119
David Bau: Hey, so… on Zoom. Am I on Zoom? I'm not on Zoom. Okay, forget it. I won't show you the notebook. It's very cool, though. So, if you… if you actually try the notebook, and you get it to run, actually, it might… Yeah. It might take some doing a state run, because there's… there's currently a bug in NDIF that I'm having the team debug. But, you know, you can get a nice 3D…

624
01:05:00.230 --> 01:05:04.489
David Bau: View this and look around and get a sense for, like, the scatter plot.

625
01:05:04.630 --> 01:05:07.540
David Bau: You know, has his… has his points of the emissions.

626
01:05:07.660 --> 01:05:10.980
David Bau: Okay, so you can, you can do that kind of. But this is just a 2D projection, right?

627
01:05:11.360 --> 01:05:12.130
David Bau: Okay.

628
01:05:12.510 --> 01:05:13.470
David Bau: So…

629
01:05:15.320 --> 01:05:24.630
David Bau: So, so, so, let me… so, so that's… so, so up to step 5 is, like, what I would encourage you guys to try to do for your projects.

630
01:05:25.210 --> 01:05:26.050
David Bau: You know, this week.

631
01:05:26.820 --> 01:05:38.489
David Bau: Let's talk about a couple other things about vector representation. So, so let's… we had questions from Claire, Rice, and Ayush, and… oh, and Avery related to this paper. So, what was… what was Claire's question?

632
01:05:40.920 --> 01:05:48.240
David Bau: I think I talked about birthpec, but I talked about, like, is there a relation between, like, the attention

633
01:05:48.990 --> 01:06:04.610
David Bau: In the, like, intention embedding of sentences, like, if we take the joint embedding of a sentence, can we relate two sentences together in that way? Right, right, right, right. So maybe I, yeah, maybe I misunderstood your question, but I'll relate it to this anyway.

634
01:06:04.630 --> 01:06:19.859
David Bau: Which I think that there's… there is probably a geometry there. I… I don't know if the, if there's been any good papers looking at the geometry of sentence embeddings, that would be an interesting research topic, but there's been a lot of papers on, like, geometry of word embeddings.

635
01:06:19.980 --> 01:06:24.259
David Bau: And, and I'd expect sentence and vetted to have some of the same properties.

636
01:06:24.440 --> 01:06:27.259
David Bau: Let's… how about Rice and Ayush?

637
01:06:28.740 --> 01:06:31.880
David Bau: See red rights here? See, right? Yes.

638
01:06:32.120 --> 01:06:34.500
David Bau: I think I was just asking if we could…

639
01:06:34.960 --> 01:06:39.349
David Bau: Since, from the results, it seems like the bacteria is, like, cleaner.

640
01:06:39.580 --> 01:06:58.479
David Bau: like, in that subsurface where you… You mean in Sheridan's paper? Yeah, yeah. Yeah, yeah, I'll describe Sheridan. That's great, yeah, so that's what that was about. Okay, cool. Yeah, and the answer is, yeah, I think so. It's… I think it's… it's cleaner. I don't know if it's more effective for patching. It's probably an open research question. You can ask Sheridan. Sheridan's back.

641
01:06:59.800 --> 01:07:06.630
David Bau: But it's a good question. And then Ayush… I'll talk… I'll talk about what all these things are in a minute, and then… Ayush gonna agree?

642
01:07:09.500 --> 01:07:16.650
David Bau: I agree. It's more about, like, the concept and token. It's… Oh, yeah!

643
01:07:17.020 --> 01:07:28.850
David Bau: I'll go over that. Okay, yeah. Okay, I'll go over that. Yes, it's about… you want to hear about induction hens. Okay, so let me just describe this. So, the history for this piece of work is… there's a very famous paper

644
01:07:29.370 --> 01:07:32.800
David Bau: by Mikoloff in 2013 called Word2nd.

645
01:07:33.210 --> 01:07:35.150
David Bau: Where they train, like.

646
01:07:35.400 --> 01:07:40.739
David Bau: like, a zero-layer neural network. It's, like, the simplest model, on a lot of text.

647
01:07:41.090 --> 01:07:53.010
David Bau: And then, get vector embeddings for all sorts of words, and then makes a big deal about the fact that, the words as vectors have this interesting geometry.

648
01:07:53.150 --> 01:07:55.199
David Bau: Like, if you take the vector for man.

649
01:07:55.500 --> 01:07:59.579
David Bau: And the vector for women, it makes this vector a difference between them.

650
01:07:59.910 --> 01:08:02.190
David Bau: And then if you take the vector for king.

651
01:08:02.480 --> 01:08:09.279
David Bau: And you add that vector difference that you had to that, then you get king plus woman minus man.

652
01:08:09.520 --> 01:08:14.529
David Bau: And you look around, you say, what word had that back there? And what do you think you find?

653
01:08:16.770 --> 01:08:17.560
David Bau: Yeah!

654
01:08:18.120 --> 01:08:21.050
David Bau: He's like, you get queen! That's pretty neat!

655
01:08:21.830 --> 01:08:22.620
David Bau: And,

656
01:08:22.970 --> 01:08:34.189
David Bau: And so, so he's like, well, he made, he, he, like, you can make all these parallelograms between man, woman, king, queen. That's not exact, right? You know, the queen's a little bit off, but it's kind of close.

657
01:08:34.460 --> 01:08:37.700
David Bau: Right? And so, so is increments. Neat.

658
01:08:37.800 --> 01:08:41.450
David Bau: Right? So this is, like, this is, semantic vector arithmetic.

659
01:08:41.990 --> 01:08:45.570
David Bau: And so, after Rick Law found this.

660
01:08:46.010 --> 01:08:49.539
David Bau: Every natural language professor in the world.

661
01:08:49.710 --> 01:08:51.869
David Bau: Todd, everybody, this is what happens.

662
01:08:52.300 --> 01:08:53.760
David Bau: With vector embeddings?

663
01:08:54.120 --> 01:09:01.720
David Bau: Except for this caveat here, which is, weirdly, if you look at state-of-the-art networks and embedders and rail networks.

664
01:09:02.130 --> 01:09:03.969
David Bau: They don't really do this very often.

665
01:09:04.359 --> 01:09:05.130
David Bau: Great.

666
01:09:05.260 --> 01:09:14.270
David Bau: Never they do this. Like, you get… I mean, Queen might be in, like, that neighborhood, but it's, like, really far, and you might be closer to some other encoding, like Idaho or something.

667
01:09:14.970 --> 01:09:15.750
David Bau: You know?

668
01:09:16.170 --> 01:09:30.420
David Bau: And so, okay, so I'm… so the paper… the paper eventually gets to talking about what's going on here. But to… to lead up to it, I want to just describe Sheridan's work, which I think is really cool, and it goes through induction pads.

669
01:09:30.870 --> 01:09:36.770
David Bau: So, so taking an interpretability class, I need to teach you what an induction head is.

670
01:09:37.020 --> 01:09:40.580
David Bau: It's not optional. So, if you guys need to learn this.

671
01:09:41.040 --> 01:09:43.410
David Bau: Okay, so what an induction head is, it's about

672
01:09:44.500 --> 01:09:50.760
David Bau: So, like, if you have a sentence that's repeated in a piece of text, like, I hear the cardinals.

673
01:09:51.390 --> 01:10:00.960
David Bau: I hear the… Right, and you put that into your autopgressive model, The… the normal bigram…

674
01:10:01.120 --> 01:10:03.609
David Bau: statistics, for I hear the…

675
01:10:04.360 --> 01:10:10.510
David Bau: would probably say, I hear the car, I hear the explosion, I hear the music, or something like that after it.

676
01:10:10.750 --> 01:10:13.099
David Bau: It wouldn't say, I hear the card.

677
01:10:13.320 --> 01:10:14.919
David Bau: Not something that you would hear.

678
01:10:15.120 --> 01:10:19.569
David Bau: But in this context, That is the most likely answer.

679
01:10:20.160 --> 01:10:22.219
David Bau: It will say, I hear the card noise.

680
01:10:22.570 --> 01:10:24.839
David Bau: And why does it say that?

681
01:10:25.390 --> 01:10:29.610
David Bau: It says that because… It was just Seth.

682
01:10:30.020 --> 01:10:36.279
David Bau: And it knows that English text, whatever, like, real, you know, human text, often repeats itself.

683
01:10:36.810 --> 01:10:46.240
David Bau: And so, if it's getting ready to repeat itself again, that's, like, really likely. So this repeating thing is a thing that all the language models learn how to do.

684
01:10:46.470 --> 01:10:57.339
David Bau: And it turns out that if you look inside to see how do they compute whether they're going to do a repeat or not, you find these interesting attachments inside the model that do this. They're called induction hits.

685
01:10:57.770 --> 01:10:59.730
David Bau: The way the induction heads work.

686
01:10:59.880 --> 01:11:05.999
David Bau: And, you know, so here's, like, the attention heads that do this. There's, like, you know, half a dozen induction heads in this bottle. You can see they sort of…

687
01:11:06.320 --> 01:11:09.400
David Bau: Scatterplot blue around the architecture of the model.

688
01:11:10.010 --> 01:11:13.339
David Bau: And, and the way they do this is…

689
01:11:13.670 --> 01:11:17.400
David Bau: When it gets to any word, it more or less asks.

690
01:11:17.770 --> 01:11:21.440
David Bau: When is the last time, you know, I hear the…

691
01:11:21.590 --> 01:11:23.670
David Bau: When is the last time that was uttered?

692
01:11:24.390 --> 01:11:30.299
David Bau: And they would find that this is the last time that it was uttered, but it's not… it's not actually what they're asking. They're asking.

693
01:11:30.850 --> 01:11:34.260
David Bau: What word came after the last time it was cited?

694
01:11:34.730 --> 01:11:40.129
David Bau: And so, the attention heads will be like, oh, this word came right after I hear that.

695
01:11:41.140 --> 01:11:46.720
David Bau: So basically what happens is, every word is tagged with its context.

696
01:11:47.060 --> 01:11:50.040
David Bau: And every time you have a context, it goes and searches

697
01:11:50.150 --> 01:11:54.070
David Bau: Was, was any word tagged as coming right after this context before?

698
01:11:54.320 --> 01:11:57.940
David Bau: And here, in this case, it'll find it. It'll find the card was there.

699
01:11:58.110 --> 01:12:06.350
David Bau: And then the attention head says, oh, well, in that case, we should copy it. And the attention head goes and looks at, you know, the representation of card.

700
01:12:06.500 --> 01:12:11.250
David Bau: It picks up the fact that it's the word card, and then outputs this… outputs reduction.

701
01:12:11.360 --> 01:12:18.999
David Bau: Make sense? So that's what an induction head work is, and they're all over the place. And induction heads is why models tend to repeat themselves a lot.

702
01:12:19.500 --> 01:12:21.870
David Bau: It's because they have these mechanisms inside them.

703
01:12:22.370 --> 01:12:28.429
David Bau: And so… oh yeah, this is from the original introduction head paper, I just wanted to copy something out of the original paper, so this is the original work.

704
01:12:28.550 --> 01:12:31.859
David Bau: That, that analyzed from this, their diagrams found that.

705
01:12:32.980 --> 01:12:33.920
David Bau: Okay.

706
01:12:34.240 --> 01:12:40.850
David Bau: So, so, now, there's this other interesting thing that happens. Oh, I should have put in another thing, which is…

707
01:12:40.990 --> 01:12:53.930
David Bau: If you actually look in detail at the induction heads, and Sheridan did, Sheridan found that there's a bunch of tokens that, when you copy text, that don't seem to be copied by the induction heads.

708
01:12:54.270 --> 01:12:59.449
David Bau: And so, so Sharon's like, where, like, how does the model figure out how to copy these things?

709
01:12:59.720 --> 01:13:07.530
David Bau: And so, so what Sheridan found out was you had this other set of heads.

710
01:13:07.940 --> 01:13:13.319
David Bau: That work like this. That basically, you know, if they look back.

711
01:13:13.580 --> 01:13:20.150
David Bau: And instead of attending to card, they attend to, like, innels, like a token ahead.

712
01:13:20.320 --> 01:13:29.129
David Bau: And then, instead of changing the prediction of the next word, they indirectly change the prediction of two ad words.

713
01:13:29.540 --> 01:13:31.949
David Bau: So there's these weird, like, two ahead.

714
01:13:32.100 --> 01:13:39.170
David Bau: induction pads that kind of, like, skip a token. They kind of look like one x-ray, one ahead.

715
01:13:39.340 --> 01:13:42.330
David Bau: Which is really bizarre. I said, why is it doing that?

716
01:13:42.960 --> 01:13:43.700
David Bau: Right?

717
01:13:43.850 --> 01:13:56.320
David Bau: That makes sense? So Sharon was gonna, like, write a paper about the two-head things, but then, was characterizing all the other things that were in this, these two-head induction heads, and let me show you some weird things that happened with two-way head induction.

718
01:13:56.800 --> 01:14:06.489
David Bau: So, it turns out that if you have, like, a natural language translation piece of text, like translating Japanese boat to Chinese boat.

719
01:14:06.840 --> 01:14:07.640
David Bau: Right?

720
01:14:08.360 --> 01:14:14.920
David Bau: And you look at what the two head induction heads are doing, They are actually necessary.

721
01:14:15.110 --> 01:14:18.050
David Bau: We're doing this natural language translation, if you cut the heads.

722
01:14:18.260 --> 01:14:19.880
David Bau: Then it can't do this anymore.

723
01:14:20.380 --> 01:14:21.790
David Bau: And then, moreover…

724
01:14:22.180 --> 01:14:28.069
David Bau: if you take the same thing going on in, like, you know, Spanish or Times thing, which…

725
01:14:28.430 --> 01:14:31.930
David Bau: Translating the word cloud nouve, right, to Nubola.

726
01:14:32.470 --> 01:14:41.260
David Bau: where these attention pads are still, like, important. If you cut them, and you cut these attention pads, and you cut these attention pads.

727
01:14:41.410 --> 01:14:46.949
David Bau: And you reroute the data that's transported these detection heads from this sentence to that sentence.

728
01:14:47.490 --> 01:14:50.019
David Bau: Right? And then you let the model run.

729
01:14:51.160 --> 01:14:54.789
David Bau: And instead of saying Nuvola, It says platinum.

730
01:14:56.570 --> 01:14:57.839
David Bau: Is that bizarre?

731
01:14:59.000 --> 01:15:02.429
David Bau: And so this is just, like, the two token-ahead induction heads.

732
01:15:02.680 --> 01:15:09.140
David Bau: And that, that led Sheridan to believe that, oh, what the two token-ahead induction heads.

733
01:15:09.390 --> 01:15:13.309
David Bau: Our transporting is not the second token ahead.

734
01:15:14.390 --> 01:15:21.379
David Bau: It's translate… it's transporting some language-independent notion of the concept.

735
01:15:22.690 --> 01:15:23.990
David Bau: Strange, isn't it?

736
01:15:25.000 --> 01:15:27.129
David Bau: Very strange, different kind of introductions.

737
01:15:27.640 --> 01:15:34.210
David Bau: They, like, transport the whole concept, the whole word, as opposed to this, right? Have you ever heard of this paper? That's weird.

738
01:15:34.340 --> 01:15:36.780
David Bau: Anyway, so we'll talk more about this, sort of.

739
01:15:36.940 --> 01:15:46.350
David Bau: Automation analysis stuff. It's transmitting things that require two things you have to hold in your head, like…

740
01:15:46.590 --> 01:15:54.430
David Bau: Boat, plus… Spanish, or, card plus…

741
01:15:54.670 --> 01:16:00.079
David Bau: Yeah, so I think that… so, there's a few other experiments that we did, which…

742
01:16:00.200 --> 01:16:02.590
David Bau: Where you see these induction heads.

743
01:16:02.760 --> 01:16:10.049
David Bau: implicated in, like, paraphrasing and other things where there's not apparent, going on. But, but

744
01:16:11.290 --> 01:16:13.880
David Bau: You know, you might be right, there might be some alternate hypotheses.

745
01:16:14.980 --> 01:16:24.330
David Bau: But right now, the… I like the concept of this. Yeah, but right now, the, the, you know, the, you know, Sheridan's main hypothesis is that

746
01:16:24.440 --> 01:16:26.780
David Bau: These are, like, the language-independent concepts.

747
01:16:27.140 --> 01:16:30.300
David Bau: It was pretty interesting.

748
01:16:30.490 --> 01:16:34.700
David Bau: Okay, so let me… let me skip through these things. Yeah, so this is, like, paraphrasing a…

749
01:16:34.980 --> 01:16:43.040
David Bau: You know, there's some interesting paraphrasing here in here. So, this is a follow-up paper to this, which is also related to your question.

750
01:16:43.260 --> 01:16:47.790
David Bau: And… and several other mentions. So this… so what you guys read was a follow-up paper to this work.

751
01:16:48.090 --> 01:16:58.020
David Bau: Where, so… so the chairman has these concept induction heads, which supposedly, according to Sheridan,

752
01:16:58.310 --> 01:17:00.380
David Bau: Area Language Independent Concept.

753
01:17:00.620 --> 01:17:05.550
David Bau: And so Sharon's like, you know, the Mikulov, parallelogram.

754
01:17:05.910 --> 01:17:08.890
David Bau: Erythrotect doesn't work all that well.

755
01:17:09.610 --> 01:17:20.009
David Bau: But I wonder… If the vector representation read by the constant induction heads Obeys parallelogram better.

756
01:17:20.960 --> 01:17:26.919
David Bau: That makes sense? So, remember how I told you what Sheridan was doing was a linear readout?

757
01:17:27.210 --> 01:17:35.710
David Bau: from their computations. So… and I said, you know, neural networks, they do linear readout in themselves, but what if Sharon did?

758
01:17:36.420 --> 01:17:44.370
David Bau: Let's take a look at the concept induction heads. You know, there's a few dozen of these concept induction heads that, like, seem to pick out the levaging concept.

759
01:17:44.840 --> 01:17:47.650
David Bau: They all do linear readout.

760
01:17:47.810 --> 01:17:56.020
David Bau: of the repetitions. And if you take the sum of them, just take the sum of their linear readouts, that's just a met… that's a matrix, it's just a matter of linear readout.

761
01:17:56.150 --> 01:18:02.309
David Bau: Right? So if we use that matrix to do linear readout of the concepts, and we get this other vector out.

762
01:18:02.570 --> 01:18:06.900
David Bau: I don't really know what that vector means, but… Wonder how it does.

763
01:18:07.250 --> 01:18:09.310
David Bau: And this parallelogram arithmetic.

764
01:18:09.460 --> 01:18:13.169
David Bau: Right. And so… so here's basically the fonts.

765
01:18:13.330 --> 01:18:16.730
David Bau: That basically, if you… if you use the concept

766
01:18:16.850 --> 01:18:20.289
David Bau: Heads, and you try to do all these country capital things.

767
01:18:20.610 --> 01:18:27.960
David Bau: Then, you know, it's basically 80% accurate, and doing the parallelograms, whereas if you use the regular

768
01:18:28.200 --> 01:18:43.089
David Bau: full representation, it was down at 40% or something like that. So this is the fly, right, which is like, oh, it can only kind of do it 40%. But if you clean it up by putting it through the readout that the constant production heads are doing, it goes up to, like, 80%.

769
01:18:43.730 --> 01:18:44.770
David Bau: Does that make sense?

770
01:18:45.110 --> 01:18:52.810
David Bau: And so… but then there was a question about… Why is it… That, some of the time.

771
01:18:53.280 --> 01:19:01.529
David Bau: If you, like, do it for past heads or something, there's two types of heads that you could do a readout for. There's also the regular token induction heads.

772
01:19:01.790 --> 01:19:03.940
David Bau: That are about copying tokens.

773
01:19:04.110 --> 01:19:10.500
David Bau: And when you do the readout that the token induction heads are doing, you also can increase the accuracy

774
01:19:10.710 --> 01:19:13.840
David Bau: Of, paralleled arithmetic.

775
01:19:14.460 --> 01:19:19.030
David Bau: But for different tasks. Like, for changing something to past tense.

776
01:19:19.260 --> 01:19:22.629
David Bau: Then, the token readout is a lot.

777
01:19:23.030 --> 01:19:26.329
David Bau: Stronger at doing teleograms.

778
01:19:26.610 --> 01:19:27.979
David Bau: And the cocktail block.

779
01:19:28.710 --> 01:19:30.180
David Bau: Why would that be?

780
01:19:30.590 --> 01:19:33.959
David Bau: Did anybody get, like, understand, like, why that would be?

781
01:19:34.290 --> 01:19:39.799
David Bau: Like, why is it that when you do readouts, okay, so we have… we're about out of time, so this is my last question for you guys.

782
01:19:40.660 --> 01:19:44.370
David Bau: Why would it be that the internal concept induction heads

783
01:19:44.710 --> 01:19:46.490
David Bau: We'll be good at country capitals.

784
01:19:47.100 --> 01:19:49.220
David Bau: And reasoning geometrically about them.

785
01:19:49.450 --> 01:19:54.440
David Bau: Whereas… When it comes to the past hence, the token induction heads.

786
01:19:54.770 --> 01:19:57.299
David Bau: Would be the ones that were better.

787
01:20:02.280 --> 01:20:04.259
David Bau: I would say I,

788
01:20:04.560 --> 01:20:17.610
David Bau: concrete place versus a, concept or a, grammar that is more abstract? Yeah, one of them is really abstract in grammar, so why would the literal token copying

789
01:20:17.890 --> 01:20:20.079
David Bau: And to be better at the abstract.

790
01:20:21.260 --> 01:20:29.210
David Bau: And whereas the concrete, like, places in the world is better at the abstract concept head. Like, why would that… it's like, it seems like…

791
01:20:29.710 --> 01:20:30.670
David Bau: Almost…

792
01:20:31.240 --> 01:20:38.739
David Bau: reversed, maybe? Or not? Maybe… why would it not be reversed? Yeah, I'm trying to remember from the paper, but it had to do with, like, the…

793
01:20:38.900 --> 01:20:55.630
David Bau: the, like, literal structure of the word, like, going from ED to IMD, is something that is different, that's much more concrete than going from, like, one place to another place. That's right. So some things are concrete in the world.

794
01:20:55.890 --> 01:20:59.709
David Bau: Like… What city is the place?

795
01:21:00.040 --> 01:21:01.659
David Bau: In what country?

796
01:21:02.390 --> 01:21:04.700
David Bau: And some things are concrete in text.

797
01:21:05.500 --> 01:21:08.530
David Bau: But whether you add ED to the end of the word.

798
01:21:08.810 --> 01:21:09.490
David Bau: Right?

799
01:21:09.790 --> 01:21:21.749
David Bau: And so the things that are concrete in text, those manipulations, they follow this nice parallelogram geometry when you read them out using token induction heads, the things that concretely copy text.

800
01:21:23.520 --> 01:21:28.830
David Bau: And then… but when you read out the same tokens using concept induction heads, the things that

801
01:21:29.440 --> 01:21:34.050
David Bau: You know, try to read out conceptually what's going on in a language-independent way.

802
01:21:34.250 --> 01:21:39.039
David Bau: It turns out that the parallel the Gram2GA reflect.

803
01:21:39.370 --> 01:21:44.329
David Bau: The structure of the real world, concepts in the real world, what space, a country, and that kind of thing.

804
01:21:44.610 --> 01:21:48.230
David Bau: And so, you know, so to me, this was super amazing.

805
01:21:49.190 --> 01:21:52.270
David Bau: I told… I told Sheridan, You should…

806
01:21:52.890 --> 01:21:58.959
David Bau: write this in a high-profile journal in Nature or something like that, and go win your Turing Award. I thought this was amazing.

807
01:21:59.110 --> 01:22:02.509
David Bau: And Sherry's like, I don't know, I'm not that proud of this work.

808
01:22:02.630 --> 01:22:04.090
David Bau: Superintendo Workshop.

809
01:22:04.440 --> 01:22:06.319
David Bau: Maybe it's time to go listen to it.

810
01:22:06.580 --> 01:22:08.809
David Bau: So here I am, teaching it to all you.

811
01:22:09.130 --> 01:22:19.769
David Bau: I think it's really cool that the models have these cool tracks, they can pick them in two different ways, and it gives you a little insight or a little flavor for what's going on.

812
01:22:19.840 --> 01:22:33.189
David Bau: And so, so if there's other questions that you guys had, I apologize to the students for not having to follow the question, but for all of our members, try to gather a contrasting data set, find a model that understands the context.

813
01:22:33.490 --> 01:22:36.680
David Bau: Thanks a lot, guys. Have a great day.

814
01:23:06.570 --> 01:23:07.980
David Bau: Alright, excellent.

