WEBVTT

1
00:00:02.330 --> 00:00:18.120
David Bau: This might vary across some of this, but generally, right?

2
00:00:18.120 --> 00:00:27.159
David Bau: There's gonna be also a lot of places called right now. It's right now.

3
00:00:27.230 --> 00:00:30.710
David Bau: I don't do it, but you want to work with us.

4
00:00:31.040 --> 00:00:33.159
David Bau: Let me also get my computer.

5
00:00:33.420 --> 00:00:35.070
David Bau: Give me a second.

6
00:00:38.740 --> 00:00:40.090
David Bau: Yes, correct.

7
00:00:41.170 --> 00:00:42.830
David Bau: I'm giving back to school.

8
00:00:46.150 --> 00:00:47.320
David Bau: Yo, Avid?

9
00:00:47.790 --> 00:00:49.170
David Bau: How can you get started?

10
00:00:49.310 --> 00:00:57.830
David Bau: Ethan, this… There's Zoom stuff all over the screen, so feel free to contact the mouse over there.

11
00:00:58.540 --> 00:01:02.170
David Bau: I feel like it's fine like this with one of the one on stage.

12
00:01:04.670 --> 00:01:15.550
David Bau: Hi, everyone, so this is our basket for, clinical ideology. But this is our estate.

13
00:01:15.680 --> 00:01:23.450
David Bau: And something that we wanted to look at was the contrastive pairs, oh, go ahead and make them bigger, sorry, the font's really tiny.

14
00:01:26.210 --> 00:01:31.249
David Bau: Sorry? Yeah, there's a menu under More. Under More, you can say hide that thing.

15
00:01:33.270 --> 00:01:34.909
David Bau: There's loading meeting controls.

16
00:01:35.990 --> 00:01:36.979
David Bau: Take your time.

17
00:01:37.540 --> 00:01:41.170
David Bau: Like… Okay, there you go.

18
00:01:41.590 --> 00:01:43.299
David Bau: And you can minimize it, that's great.

19
00:01:44.830 --> 00:01:46.320
David Bau: That's not. Okay.

20
00:01:46.600 --> 00:01:58.430
David Bau: So, we decided to look at some contrive pairs experiments, since we're trying to tease out this, like, idea of the LLM syncophancy, versus, like, the user context.

21
00:01:58.630 --> 00:02:08.939
David Bau: And we wanted to determine if we could extract, like, a non-sycophantic vector from the model to determine, like, when it's responding just based off of its, like,

22
00:02:08.960 --> 00:02:21.150
David Bau: like, the baseline model behavior versus when you're giving it a certain context and, how those responses differ. Originally, we kind of framed this as extracting, like, a conservative vector to see when the model is, like.

23
00:02:21.160 --> 00:02:38.109
David Bau: Flipping its responses to be more towards, like, a conservative or right-leaning ideology. But then we just decided to go with non-segmentic vector because we saw that sometimes the baseline responses for the model, without providing it any, like, personal context, sometimes let

24
00:02:38.110 --> 00:02:47.689
David Bau: lean towards, like, a liberal ideology, and sometimes towards conservative, so for a bigger data set, then we're having, like, a, non-significant vector, not just conservative.

25
00:02:47.690 --> 00:03:03.459
David Bau: So, we gave it a short user context, for example, like, I am a sibling, or I am a farmer, I'm a Republican, 26 different user contexts, ranging from neutral to possibly a political proxy or a stereotype, and then explicitly political.

26
00:03:03.460 --> 00:03:10.010
David Bau: Along with the 60 political compass questions, and we asked the model, do you agree or disagree with the following question?

27
00:03:10.010 --> 00:03:17.040
David Bau: and, I use Llama 70B Instruct, because we've seen the best, results with that, and…

28
00:03:17.580 --> 00:03:19.740
David Bau: It's been the most, robust.

29
00:03:19.870 --> 00:03:36.959
David Bau: And, for the contrastive criteria, as you can see in the purple and the green, the responses, agree or disagree from the, I am a human and I am a person, which are, like, baseline, responses without context, are used as the toxicophantic response.

30
00:03:37.020 --> 00:03:55.999
David Bau: Regardless of context, so if the model said that they agree, like, in the top purple one, where it says, I'm a human and I'm a person, and the model says agree for both, then that is, like, the non-sycophantic or the non-flip, and then all the other responses, where the model said disagree, would then be the flip response.

31
00:03:56.260 --> 00:04:05.560
David Bau: After that, we got 30, or 317 statements where the, agreement… So, okay, someone had a prompt. So…

32
00:04:05.720 --> 00:04:09.070
David Bau: So this is… this is asking the model to predict its…

33
00:04:09.290 --> 00:04:19.090
David Bau: own text, or is it predicting the user text? Is it in dialogue, or is it not in dialogue? It's just… It's in, it's a dialogue gallery. I see. So you mark, you mark user and assistant.

34
00:04:19.300 --> 00:04:30.119
David Bau: use those little tanks to… Oh, we're just asking the model, like, I guess, just putting the prompt, like, I am a human, do you agree or disagree with the follow-up statement? Like, for example,

35
00:04:31.000 --> 00:04:39.419
David Bau: I don't know, some political compass question. Please answer either agree or disagree. I see, I see. So it's… I guess I'm asking sort of a technical question.

36
00:04:39.570 --> 00:04:43.480
David Bau: Which is, you know, when you run these models in dialogue.

37
00:04:43.680 --> 00:04:55.770
David Bau: then you have these little tags that set off, like, what a user says, and then little tags that set off, like, what the assistant, like, the AI is supposed to be saying. And it's a language model, so it can actually predict.

38
00:04:55.910 --> 00:05:04.259
David Bau: text in all these different contexts. It's happy to protect text, even if you don't have any tags. I'm just wondering if you're using tags or not yet. Right now, it's just practical.

39
00:05:04.360 --> 00:05:19.790
David Bau: the standard user, and then assistant. Oh, so you are using… When the user's introduced. So, a few weeks ago, I tested flipping them and saw that there was secrecy in both. That's great. I, in fact, forgot. I forgot which team did that, right? That's great, so you guys did that, so you're inside that dialogue. Yeah, but now it works.

40
00:05:20.530 --> 00:05:25.630
David Bau: Yeah, now we're just seeing, like, how it flips depending on the context that you,

41
00:05:25.780 --> 00:05:32.009
David Bau: like, provide about your, like, yourself, like, the user context. I am XYZ. I see, so the user says…

42
00:05:32.480 --> 00:05:35.319
David Bau: I… I'm something, and then…

43
00:05:36.010 --> 00:05:42.510
David Bau: We want to see if the LM changes its response to agreeing or disagreeing with the political compass question.

44
00:05:42.680 --> 00:05:55.220
David Bau: I see. It says, oh, so the user says, I like all this stuff, and I'm liberal, do you agree? Yeah, exactly, and we want to see whether the model, like, now agrees or disagrees.

45
00:05:55.220 --> 00:06:02.710
David Bau: the question differently from its response without any, like, context. Yeah. What's kind of, like, the way this model will respond

46
00:06:03.080 --> 00:06:16.469
David Bau: there's, like, no user intro, or a user intro that doesn't seem to reveal anything about their, you know, I'm the person, like, it doesn't say anything about their political meaning, and we're saying, does it switch to… because it's, like.

47
00:06:16.480 --> 00:06:35.939
David Bau: for the most part, they tend to be left-leaning, but sometimes they'll be like, is capitalism good? Like, is the question, and… I see, so, like… Yeah. Yeah, yeah, so I was a little… I was just a little confused by that, because I didn't see, like, is capitalism a good kind of question? It's like, some questions about me, like, I'm going to be short, and I didn't really understand, like, I was expecting it to be a…

48
00:06:36.000 --> 00:06:49.870
David Bau: a question about some of the… Yeah, maybe I explained it wrong. So that's the context. The context, like, the first part of the entire pledge that we're giving the model is, like, I'm a teacher, I'm a sibling, I'm a banker, I am a twin, something like that.

49
00:06:49.870 --> 00:07:11.259
David Bau: And then we ask it, do you agree or disagree with one of the 60 Political Compass questions? I'm giving you an interview, hey, LLM… Oh, but then you drop off the items. But just for context, I'm, blocked. Okay, so that's… that's the context. I understand that, that setup now. Yeah, so then we do, like, all the 60 questions for all 26, like, user contexts.

50
00:07:11.260 --> 00:07:14.860
David Bau: And then we have 317, flips.

51
00:07:14.860 --> 00:07:34.850
David Bau: Where the model will respond differently from, like, its non-contextual, response based off of giving its kind of context. Cool, and so how old fraction is that? It changes? So, sometimes it behaves the same either way. Correct. Yeah, so 317 flips and 1,185 non-flips, so that's, like.

52
00:07:34.970 --> 00:07:37.149
David Bau: About 20%?

53
00:07:37.280 --> 00:07:44.959
David Bau: Yeah, yeah, any, like, original positive testing, because, like, on the AP, it would only… you said you have to pay.

54
00:07:45.210 --> 00:07:49.320
David Bau: 5 or 10% of the time, and for lots of PPUs.

55
00:07:49.500 --> 00:07:51.029
David Bau: Somewhere between a third and half a

56
00:07:51.390 --> 00:08:01.149
David Bau: Nice. Yeah. Nice. Yeah. So now we have this, like, 1500 data set of contrastive pairs, and then I wanted to do, like, the PCA visualization.

57
00:08:01.150 --> 00:08:17.729
David Bau: To see if we can extract some type of, vector difference. And so I just sampled 317 from the, like, non-flips, so that's more, even in visualization. So the flips are the purple, and the non-flip cases are the green.

58
00:08:18.430 --> 00:08:36.379
David Bau: early layer and the final layer, there's really not much separation at all, which is kind of surprising. I was expecting that the statements that, caused the model to flip, in which the only difference is the context, like, I am a liberal, or I'm a Republican, or I am a gun owner, or something like that, would be very different, and it's not.

59
00:08:36.530 --> 00:08:52.530
David Bau: And then we can see the same results here with testing sycophancy direction, and we don't receive great classification accuracy if we use the, like, non-flip or flip as, classification, and we also don't have super great linear separability.

60
00:08:52.710 --> 00:08:58.660
David Bau: Corby, what tokens is? This is just on the… the…

61
00:08:59.090 --> 00:09:04.279
David Bau: the contest token. I go out, like, I see everything. The very last token. Oh, the content, sorry.

62
00:09:05.550 --> 00:09:22.400
David Bau: And then also here, we wanted to look at if there is a difference, like, a vector we can find for the flip direction, and you can see this really small, vector, which is basically, a known vector, so, yeah, it's, like, right there. And then…

63
00:09:22.590 --> 00:09:39.489
David Bau: Yeah, so we didn't get as great results as we wanted here, in terms of seeing if there really is a difference, like a separability between these, statements, and so what I'm curious about looking next is looking more at intermediate layers, just for, like, compute and time's sake, because it takes a really long time on 70B,

64
00:09:39.650 --> 00:09:53.510
David Bau: we only looked at those two layers. But then, also, I wanted to know if there are possible clusters related to, how people answer the political compass questions according to their ideology. So, for example, if the model's baseline answer, like, agree.

65
00:09:53.980 --> 00:10:03.320
David Bau: or disagree corresponds to a left-right authoritarian or libertarian, according to the political compass answer. Maybe there are actually clusters of that inside of

66
00:10:03.320 --> 00:10:23.399
David Bau: Like this. Like, maybe there's, like, an authoritarian cluster, and then, like, a libertarian cluster. And so that's what I'm working on right now. And then I'm also interested if there are possible clusters related to the neutral, proxy, or explicitly political user context. So, maybe there's a cluster, depending on, like, if you're explicitly telling the model, I'm a liberal, or I'm left-leaning, or something.

67
00:10:23.400 --> 00:10:28.090
David Bau: Versus, just giving it neutral information, like, I am a twin or something.

68
00:10:28.090 --> 00:10:36.730
David Bau: But I'm also open to any suggestions from you guys about what you think may be able to help us tease out some kind of relationship here.

69
00:10:40.390 --> 00:10:43.050
David Bau: Feel free to think on it, and Avery can go next.

70
00:10:44.150 --> 00:10:47.350
David Bau: Okay, Amy, yeah, I wanted to…

71
00:10:47.390 --> 00:11:06.220
David Bau: kind of built off of Courtney's experiments, and also since last week was week of activation patching, we want to do some activation patching on 70B, using, basically the same setup, but with, so one, like, neutral prompt in this case is saying, I'm a human,

72
00:11:07.190 --> 00:11:14.330
David Bau: And a conservative prompt would just say, I'm conservative, and it's asking, do you agree with a question?

73
00:11:14.950 --> 00:11:21.940
David Bau: And then… Yeah, patching from each token, or patching, yeah, patching each token,

74
00:11:22.040 --> 00:11:32.800
David Bau: correspondingly, so, like, between the prompts, over as many questions as I could last night, which is 37. So I got… so, 19 of the questions were…

75
00:11:32.850 --> 00:11:43.629
David Bau: I was able to get a moment to flip the answer from, you know, saying, I'm a human, I agree with this question, and say I'm a conservative, I disagree with this question now, because

76
00:11:44.180 --> 00:11:47.239
David Bau: And… yeah, next slide, please.

77
00:11:47.660 --> 00:11:58.079
David Bau: So, this is the, yeah, average, like, effect of, patching, so… average, difference… so, sorry.

78
00:11:58.420 --> 00:12:17.479
David Bau: Average amount of, like, probability change from the target prompt, which is the, I'm conservative, to how much more likely it is to answering in line with the source prompt. So, saying, like, in the case of… so, let's just, for the flip questions, where in the case of the target prompt, saying, I'm conservative, I agree with this question.

79
00:12:17.500 --> 00:12:30.399
David Bau: How much, then, would I patch from a token to another token, from between tokens, between prompts? Did it switch to, I disagree? Because that's what the human prompt said. And there's…

80
00:12:31.440 --> 00:12:46.560
David Bau: you know, mainly centralized at the, the token that's being switched around, the human or conservative, and also at the very last layer, VailRoute's token in later layers. It's a little different from what Grace showed

81
00:12:47.100 --> 00:12:51.049
David Bau: Last time, where she had the effects of, like.

82
00:12:51.250 --> 00:12:59.529
David Bau: this was, like, one down at the period, so maybe I probably just have, like, a one-off by one error in this case, and to go and fix that…

83
00:12:59.560 --> 00:13:13.899
David Bau: I think it's… I was doing 8B, so 70B, so maybe they're, like… So what's a word, what's the word where you're saying the console effect, sorry? At the… here. This is the…

84
00:13:14.060 --> 00:13:15.370
David Bau: Human or conservative.

85
00:13:15.480 --> 00:13:20.789
David Bau: To, yeah, or, like, context. Yeah.

86
00:13:21.050 --> 00:13:22.229
David Bau: I mean, both.

87
00:13:22.350 --> 00:13:29.879
David Bau: It's, it's very, it's a very stark difference, but the absolute probability changes, like, 0.1. So… Yeah,

88
00:13:30.380 --> 00:13:32.350
David Bau: So, yeah, this also, like, kind of…

89
00:13:32.460 --> 00:13:40.460
David Bau: So, it is in line with what we saw in class, where, like, the Shaq, Megan Rapido thing, where you have a lot of change focused on the…

90
00:13:41.230 --> 00:13:46.579
David Bau: token in the actual statement that's being changed, and also at the very end, when speaking for judgment.

91
00:13:48.310 --> 00:14:05.540
David Bau: And what's the little patch over here in the middle? This? Yeah. That's a… we'll do a question. This is, so at the end of the prompt, we say, like, please answer either agree or disagree to coerce, like, a single token answer. This is at the… this of disagree at the very end.

92
00:14:05.710 --> 00:14:07.970
David Bau: It's interesting.

93
00:14:08.290 --> 00:14:11.820
David Bau: and… I don't really know what to make of that yet.

94
00:14:12.720 --> 00:14:16.239
David Bau: Yeah. I guess… next slide, please.

95
00:14:16.540 --> 00:14:30.289
David Bau: This is from some older experiments on AB with a slightly different prompt, but the same idea. In this case, it's patching from, liberal to conservative, where I'm saying, like, I'm a liberal, do you agree with this question?

96
00:14:30.380 --> 00:14:43.519
David Bau: Conservative, do you agree with this question? Patching tokens around. For… this is for, specifically, attention heads and the MLPs in each layer. And interestingly, like, there's no stark,

97
00:14:43.740 --> 00:14:46.240
David Bau: Shading, like what there was in the previous…

98
00:14:46.340 --> 00:14:57.750
David Bau: Experiments, where, like, you had a lot… a very solid bar of coloring here, where we're flipping the token, liberal or conservative, and a very solid bar at the bottom, for the last token.

99
00:14:57.940 --> 00:15:03.409
David Bau: So… Maybe the, you know.

100
00:15:03.820 --> 00:15:11.750
David Bau: My hypothesis, is that, the effect of, like, switching the answer, the sycophancy, is…

101
00:15:11.890 --> 00:15:20.349
David Bau: probably a combination of effects from the tension layer, attention heads, and MLPs, since, you know, there's some…

102
00:15:20.980 --> 00:15:26.439
David Bau: stuff here, but it's not as very… not as, stark, not as, noticeable as…

103
00:15:26.630 --> 00:15:33.900
David Bau: When we're doing the whole layer, so… It's both. But… Yeah.

104
00:15:34.390 --> 00:15:35.200
David Bau: That's for…

105
00:15:36.460 --> 00:15:43.509
David Bau: future experiments, I would want to do this on 70Bs, with an 8B, and do this with all the questions, maybe

106
00:15:45.070 --> 00:15:51.280
David Bau: It just happens that the first 30 questions of the clinical compass test are very Decisive.

107
00:15:54.090 --> 00:16:02.219
David Bau: Quick clarification. So this is MLP and attention heads? Yeah, left is attention heads, this, right, is…

108
00:16:03.280 --> 00:16:06.099
David Bau: I'm getting myself confused.

109
00:16:06.430 --> 00:16:11.599
David Bau: This is just attention heads for two questions. Sorry. Not everybody's thoughts. Go to the next slide, please.

110
00:16:11.930 --> 00:16:22.840
David Bau: this slide is for MLP patching, and you see, like, there's some, you know, stuff up here, not as… again, not as stark as, like, when we're doing the whole layer activations,

111
00:16:25.150 --> 00:16:34.110
David Bau: That's essential in the sense. The questions are different. I think in the 70BUR, do you… the question was, do you agree or disagree? Here, the question…

112
00:16:34.590 --> 00:16:38.100
David Bau: It's not that? It's the same,

113
00:16:38.470 --> 00:16:52.940
David Bau: setup, but instead of switching from, like, neutral, I'm a human, it's, liberal. Like, I'm a liberal, you agree with this, I'm a conservative, you disagree with this for these experiments. But that final sentence, sentence is still the same?

114
00:16:53.060 --> 00:16:58.749
David Bau: so, yeah, I think it might have started to cut off at the bottom.

115
00:17:00.020 --> 00:17:00.720
David Bau: Yeah.

116
00:17:01.190 --> 00:17:05.480
David Bau: Okay, my main question was, if it is actually the same, do you see any effect on…

117
00:17:05.839 --> 00:17:10.889
David Bau: the attention had on the disk token where you were actually seeing effect on the 7DB.

118
00:17:13.310 --> 00:17:15.370
David Bau: Good question. I…

119
00:17:16.130 --> 00:17:29.379
David Bau: I think here you are having an older, like, the older experiments from last week. Right. We had the correct perspective, but then for those… Yeah, that's what I thought, the questions are different. For 70B, we did, like, the reframing of the question, so we could run it.

120
00:17:30.430 --> 00:17:36.700
David Bau: Sorry, I'm disorganized today, I guess. No, I mean, we also had, like, stuff, this was, like, from last week, but then we did some over the weekends.

121
00:17:39.940 --> 00:17:41.929
David Bau: Yes, okay, so I forgot…

122
00:17:42.160 --> 00:17:53.969
David Bau: to include some slides, so I was doing a lot of going back to week one and looking at logic lens, because one thing that I thought was interesting, like, looking at a bunch of different logic lens patterns was that

123
00:17:54.690 --> 00:18:04.210
David Bau: Even though the model generally defaults to a liberal answer in a neutral context, The logits for the…

124
00:18:04.600 --> 00:18:13.080
David Bau: conservative answer is much higher. So even if you see that, like, if you say to the model, like, the same opinion, like, you know, I'm a liberal.

125
00:18:13.860 --> 00:18:18.799
David Bau: you'll see the conservative launches go way down. So, like, the Liberal would have still won.

126
00:18:18.970 --> 00:18:23.830
David Bau: But they're conservative, so it's, like, internally sort of, like, it's thinking less of the…

127
00:18:24.480 --> 00:18:30.769
David Bau: maybe conservative thing, if that makes any sense. And so I was looking at ways to

128
00:18:31.280 --> 00:18:44.069
David Bau: find logic lens patterns where there was more internal competition between, where it was, like, really, like, really undecided between the, like, basically the difference between the conservative and liberal answer was smaller, or that the…

129
00:18:44.800 --> 00:18:51.550
David Bau: Yeah, so that's what I was like, how much inner conflict is there in the logic that matters? I don't know how principled of a metric that is, but I was just…

130
00:18:51.660 --> 00:18:55.000
David Bau: curious. I forgot to include… there's, like, a few examples that I found.

131
00:18:55.150 --> 00:19:00.010
David Bau: And then I was curious whether or not this inner conflict would, like.

132
00:19:00.480 --> 00:19:06.300
David Bau: There was more inner conflict, or, like, smaller differences between those lodges, so that would reveal itself in the…

133
00:19:07.410 --> 00:19:10.539
David Bau: attribution patching things, and so these are…

134
00:19:10.660 --> 00:19:20.059
David Bau: So I set up this experiment, and then after I ran it and reflected on what I did, I was like, wait, this was a very silly experiment to run. But basically, I was looking at

135
00:19:20.170 --> 00:19:25.760
David Bau: what is the logic attribution to all of the attention heads and MLP layers?

136
00:19:26.120 --> 00:19:42.000
David Bau: Or examples where the model was sycophantic, so its answer flipped, or not sycophantic, and it didn't flip between the neutral and conservative context. But then I was like, wait, if it didn't flip, then the logic difference is super low, probably, between…

137
00:19:42.130 --> 00:19:49.000
David Bau: the neutral and conservative context, so it's really noisy to even try to interpret what

138
00:19:49.280 --> 00:20:03.649
David Bau: those differences are. So anyway, so I don't… I don't really buy this more. So, basically, if you look at the flipped context, which is a tiny little sample size, it seems more concentrated, and in the not flipped context, it's more diffuse, but I think this is probably just an artifact of

139
00:20:04.680 --> 00:20:06.800
David Bau: There being a small logic difference, so…

140
00:20:07.200 --> 00:20:17.520
David Bau: So basically, I want to run more attribution experiments where I'm looking at 7 dB, where it looks a lot more, and there's bigger, like, latent logic differences, even if you don't see them.

141
00:20:18.500 --> 00:20:19.200
David Bau: blank.

142
00:20:19.820 --> 00:20:21.980
David Bau: In terms of the Alright.

143
00:20:22.770 --> 00:20:25.120
David Bau: Any suggestions for the team?

144
00:20:27.010 --> 00:20:33.209
David Bau: Yeah, this is, like, a bigger picture thing. I, like,

145
00:20:34.610 --> 00:20:37.450
David Bau: I think with, like, the idea of psychophancy, I'm like…

146
00:20:37.570 --> 00:20:54.889
David Bau: trying to think through in this context, I'm having a lot of problems, because it's one I, like, think about in the context of people, and it's like, if I'm talking to you, you probably have, like, a true belief, and then another question is, like, are you telling me your true belief or not? I'm not quite sure what I mean for LLMs to have, like, a true belief in this kind of context.

147
00:20:55.050 --> 00:21:04.370
David Bau: I wonder if, like, that's kind of, like, one way for thinking about, kind of, what's going on in some of these experiments where you're struggling to have traction. I'm not sure what directions that lead.

148
00:21:04.500 --> 00:21:07.840
David Bau: There's, like, maybe a different set of questions that are related, that are, like.

149
00:21:07.970 --> 00:21:22.729
David Bau: there's a kind of, set of tendencies within the model to answer political questions in certain ways, and how malleable or how plastic are those tendencies, seems to be, like, one way of thinking about these experiments.

150
00:21:23.090 --> 00:21:28.640
David Bau: Yeah, I think that both the psychophancy and then just, like, a lot of moving parts in terms of, like, the different…

151
00:21:28.820 --> 00:21:35.410
David Bau: things that could be going on inside the model. Representation of self, representation of other, representation

152
00:21:35.680 --> 00:21:55.599
David Bau: political beliefs, and kind of, like, how all of the different things are interacting and kind of, like, what you're isolating in there. Maybe there's, like, a way of describing it, not in terms of psychophancy, that might help to get a little bit more, kind of, real time. Yeah. Yeah, I think, I think that's part of the reason I should have added the examples, but, like, I'm…

153
00:21:55.720 --> 00:21:57.469
David Bau: Little March website goes for.

154
00:21:57.800 --> 00:22:10.779
David Bau: inner conflict adeffect, where there's, like, competition between two answers, where I'm not saying, like, this one's the real one, and this is the one it's, like, pretending to do. It's more just, like, oh, sometimes it just, like, immediately commits to…

155
00:22:12.380 --> 00:22:13.520
David Bau: Other times.

156
00:22:14.000 --> 00:22:17.689
David Bau: there's more of that, confusion. Does that mean anything?

157
00:22:18.150 --> 00:22:22.619
David Bau: But because that doesn't require, like, saying whether or not the model has a self or, like, some…

158
00:22:22.730 --> 00:22:23.460
David Bau: Excuse me.

159
00:22:23.830 --> 00:22:25.990
David Bau: I agree, it's…

160
00:22:26.550 --> 00:22:41.999
David Bau: Yeah, I agree. I think that's a good… I appreciate that feedback, because I think we were talking some a bit about, like, private versus public beliefs, and how, like, yeah, like, the LM may have this private idea of what's going on, but then based off of, like, the context or what you're giving the model, it's gonna help

161
00:22:42.640 --> 00:22:43.960
David Bau: indifference, so yeah.

162
00:22:48.650 --> 00:22:54.200
David Bau: Is there something you're hoping to find? Or, like, expecting to find? Maybe that's the wrong word, but, like…

163
00:22:55.400 --> 00:22:57.510
David Bau: What would be, like, a really…

164
00:22:57.790 --> 00:23:00.970
David Bau: Yeah, what would be, like, a… And…

165
00:23:01.260 --> 00:23:05.390
David Bau: Because I haven't been following y'all's project, but it's important.

166
00:23:05.930 --> 00:23:07.059
David Bau: That'd be approved.

167
00:23:11.790 --> 00:23:19.399
David Bau: Bing, but… this is maybe not directly answering your question, but it… I think there's a lot of…

168
00:23:22.480 --> 00:23:28.239
David Bau: like, I think the reason that we, like, veer towards significantly framing is that it's very politicized, like, it's, like, I think…

169
00:23:28.520 --> 00:23:33.219
David Bau: Siga fancy, like, itself, as a thing. Oh, yeah. Where people are, like…

170
00:23:33.430 --> 00:23:37.750
David Bau: The model's trying to read your mind, and it's trying to…

171
00:23:38.150 --> 00:23:42.119
David Bau: persuade, you know, and I'm like, I don't know, like, basically how deep

172
00:23:42.960 --> 00:23:45.919
David Bau: is this? Like, is it actually, like…

173
00:23:46.900 --> 00:23:49.139
David Bau: He's trying to model you, and figure out…

174
00:23:49.350 --> 00:23:53.359
David Bau: What you want, or is it… is it just, like, naive?

175
00:23:53.580 --> 00:23:58.699
David Bau: Stochastic Parrot, like, I don't know, people agree with each other. So is it just, like…

176
00:23:58.870 --> 00:24:04.539
David Bau: That, that kind of… so any… like, I don't know exactly what that would be, but anything that we could basically

177
00:24:04.710 --> 00:24:15.759
David Bau: my hope would be that by characterizing some amount of the mechanism, we could shine light on, like, how sophisticated is this? But I, yeah, I think at this point, I don't really know what that would look like.

178
00:24:15.900 --> 00:24:18.089
David Bau: Right, maybe it has to do with model size, as you said.

179
00:24:20.300 --> 00:24:23.339
David Bau: Is there a way to maybe also add, like, like a…

180
00:24:23.480 --> 00:24:38.120
David Bau: to what extent, like, 1 to 10, like, it's like, like, I've changed my mind. It's like, I mean, she's a liberal answer, and also, like, I'm very certain of this, like, it's like a 9 or something, like, is there any way to maybe, like, get more color on, like, the models?

181
00:24:38.280 --> 00:24:45.770
David Bau: I don't really know where I'm going, but, like, I feel like, for me, that would, like, be another, like, data point that would help, like, if you added some sort of scale.

182
00:24:46.070 --> 00:24:49.889
David Bau: Like, in terms of, like, how positive is it? Or, like…

183
00:24:50.030 --> 00:25:03.440
David Bau: I don't want to say, like, certain… Yeah, yeah. Yeah, yeah, yeah, it's both, like, oh, like, I agree with this thing, and then it's also, like, on a scale of, like, 0 to 5, like, really strongly agree, versus…

184
00:25:03.870 --> 00:25:09.679
David Bau: I don't… it's not really, like, a principle thing, I guess I would… that's just something, like, I would do to be like, oh.

185
00:25:10.530 --> 00:25:11.260
David Bau: Nope.

186
00:25:15.840 --> 00:25:30.969
David Bau: Also, like, kind of adding to what everyone's discussing, identifying how, like, adaptable the model is, like, I would assume that you're giving a context, the model will adapt to the persona that you're giving, versus sick authentic, is it, like.

187
00:25:31.130 --> 00:25:45.610
David Bau: happening really early, and, like, if you probe the model further, does the model stay consistent to its sycophantic behavior, or is it just adapting to the given context, and when you change the… slightly change the context, does it…

188
00:25:45.830 --> 00:25:54.219
David Bau: like, change that, so having that would… like, basically, like, I don't know if a lot of papers have done that, like, having this differentiation between significantly and adaptability.

189
00:25:54.390 --> 00:25:55.510
David Bau: I'm preparing to serve.

190
00:25:55.920 --> 00:25:57.930
David Bau: Sorry, can you elaborate on me?

191
00:25:58.290 --> 00:26:07.159
David Bau: Like, I feel like… maybe I also don't know what that means, but I would just assume, like, sycophantic in a more negative connotation that says, like.

192
00:26:07.160 --> 00:26:20.300
David Bau: oh, the model abandons all knowledge or belief. It has, supposedly trained to be a non-hateful model, and one of the political approaches leads it to a hateful speech.

193
00:26:20.780 --> 00:26:29.950
David Bau: A sycophantic answer will, follow that path, versus an adaptive will understand that the user is conservative or, like, liberal.

194
00:26:30.250 --> 00:26:39.860
David Bau: will adapt the persona, but not, like, let down of its guardrails. I guess that's what I mean, like, when does this adaptation, like, when does the model

195
00:26:40.050 --> 00:26:43.199
David Bau: Understanding that the user is conservative or liberal.

196
00:26:43.400 --> 00:26:54.099
David Bau: like, change its models, beliefs on all that stuff. Like, where's its, like, threshold for, like, being synchronic versus following what you're trying to do? I guess so, yeah, I guess that's the…

197
00:26:55.760 --> 00:27:00.789
David Bau: I love the minimal pairs you guys are putting together. I, you know, I think, I wonder…

198
00:27:01.040 --> 00:27:17.589
David Bau: I wonder a few things. So, I like how, you know, you're talking about the difference between AB and 7 dB. I forgot your initial experiments where you measured these different models, so is your intuition that AB does have some capacity, but… It's much more. But much more in 70B?

199
00:27:17.720 --> 00:27:24.290
David Bau: So, so yeah, so, you know, obviously, you know, looking at stuff in the viewer, looking at wherever the phenomenon happens is…

200
00:27:24.720 --> 00:27:25.749
David Bau: It's the thing to do.

201
00:27:25.930 --> 00:27:32.410
David Bau: I also, like, I like the discussion. I wonder if there's some prompts that…

202
00:27:32.530 --> 00:27:37.689
David Bau: You could try to just have different forms, right? So here, you're… the user starts.

203
00:27:38.170 --> 00:27:40.330
David Bau: By saying… by saying something.

204
00:27:40.550 --> 00:27:44.680
David Bau: But I wonder if you had something like…

205
00:27:44.800 --> 00:27:46.930
David Bau: You know, the assistant says something.

206
00:27:47.150 --> 00:27:50.639
David Bau: Look, you can just open force it to say something.

207
00:27:51.120 --> 00:27:51.850
David Bau: Right.

208
00:27:51.950 --> 00:27:55.510
David Bau: And then the user could say, I disagree.

209
00:27:55.990 --> 00:27:59.069
David Bau: Yep. And then… and then it's kind of interesting if the…

210
00:27:59.490 --> 00:28:05.419
David Bau: model, like, sticks… sticks by its guns, or all of a sudden, it has a different opinion. Like, maybe, like, this…

211
00:28:07.070 --> 00:28:14.450
David Bau: I feel like catching it, in mid-context, changing his mind, it might be an interesting signal.

212
00:28:14.620 --> 00:28:16.000
David Bau: Right? Yeah.

213
00:28:16.390 --> 00:28:20.650
David Bau: Right? I don't know. It's, like, multi-turrent context as well, like, we're just in this

214
00:28:21.460 --> 00:28:32.019
David Bau: Right, and you don't have to do all multiple turns, right? You can just… you can just token force of all the things, and then just, like, do the second turn and see if there's something that happens as a secondary, right?

215
00:28:32.470 --> 00:28:38.789
David Bau: And, You know, I'm not, I'm not, I'm not sure. I think that…

216
00:28:39.030 --> 00:28:51.470
David Bau: There's… so there's a couple other things that occur to me. So, you have a prompt form, agree or disagree, which is kind of a multiple-choice thing, so you, you know, you let it think a thing, and then you do multiple choice. So, there's been various challenges.

217
00:28:51.780 --> 00:28:55.410
David Bau: and untangling… Posal change.

218
00:28:55.530 --> 00:28:59.180
David Bau: Through multiple choice, it's kind of like an extra level of indirection.

219
00:28:59.530 --> 00:29:00.230
David Bau: Right.

220
00:29:00.480 --> 00:29:07.019
David Bau: Because… now… now it's gotta decode what does agree mean? What does disagree mean?

221
00:29:07.360 --> 00:29:11.500
David Bau: There's an actual sentence somewhere, right, that it has to refer to.

222
00:29:11.890 --> 00:29:15.909
David Bau: And, and so, anyway, if you wanted to, like, so I'd recommend…

223
00:29:16.110 --> 00:29:20.059
David Bau: It was a nice paper by, What's the name?

224
00:29:20.490 --> 00:29:24.610
David Bau: Yeah, Sarah, Sarah, what's your last name? No, so Grefra.

225
00:29:24.710 --> 00:29:25.920
David Bau: Yes, sir, my friend.

226
00:29:27.200 --> 00:29:32.959
David Bau: That, like, takes apart the whole mechanism of multiple choice, so it is worth a read, just to…

227
00:29:33.200 --> 00:29:36.379
David Bau: You know, refresh yourself and see if, if,

228
00:29:36.490 --> 00:29:39.690
David Bau: If… if, the mechanisms that you see

229
00:29:40.020 --> 00:29:57.680
David Bau: look like that, and if it gives you any clues for decoding… Is that, like, so maybe the approach is, like, come up with as many different weird variations on this prompt, and just chuck the methods at them and see what… I'm not sure, yeah, that's not a possible thing, right? So I think that clever prompting can…

230
00:29:58.400 --> 00:30:11.780
David Bau: you know, it's kind of like biology, right? You know, it's like, well, clever experimental design, just putting the right things in the test tube can sometimes, you know, clarify what's going on, and it's worth the try, because it's relatively low cost to string together words in a different way. I like…

231
00:30:12.030 --> 00:30:29.700
David Bau: you know, I like your… the… some of the… some of the things you're looking at. So I think that when you jump to looking at a thousand cases at once, or something like that, or 100 cases at once, it might be premature. Like, oh, actually, maybe take a look at individual cases, you know, more. It's just faster.

232
00:30:29.940 --> 00:30:37.359
David Bau: Just to give yourself a sense of where things might be. I… yeah, I tried some prompts, like, without the,

233
00:30:38.810 --> 00:30:41.629
David Bau: Without the, like,

234
00:30:41.960 --> 00:30:58.829
David Bau: I agree or disagree, or, like, please answer, agree, or disagree, like, I just asked it, how do you feel about the following statement, or do you agree or disagree, or, like, what is your perspective? And I tried, like, maybe, like, 20 different things, and without explicitly asking it, like, please answer, agree, or disagree, I couldn't get consistent, like.

235
00:30:58.830 --> 00:31:04.839
David Bau: outputs, like… Yeah, it would like to say some other things. Yeah, yeah, it wants to say something completely different, so at least, like, for this

236
00:31:04.860 --> 00:31:22.650
David Bau: contrasted pair, that's why I love with that specific wording, because without it, I couldn't get it consistent. Yeah, I understand, I understand, yeah. And so, you may… there may never be… maybe a way of weighing this multiple choice, form. But then, read Sarah's paper, and maybe… maybe there's some tips there about it.

237
00:31:22.890 --> 00:31:23.929
David Bau: All of that.

238
00:31:24.080 --> 00:31:27.809
David Bau: Nick, you'll also… I had the same issue with his… his…

239
00:31:27.940 --> 00:31:35.130
David Bau: This thing, which is another publisher's form. What's the name of the paper, or could you, like, send it? I'll send it on my screen. Okay, thank you.

240
00:31:35.940 --> 00:31:44.520
David Bau: By the way, I just want… maybe more concrete goal would be… Try to reduce sequencing.

241
00:31:45.000 --> 00:31:56.400
David Bau: with just, changing the internal representation, or maybe weights of the model. That could be a potential. Have you guys thought about it? Some people have done it.

242
00:31:56.650 --> 00:32:06.700
David Bau: do that, but they do it in a sort of black box, where it's like, please steer it in a way that makes it agree with the user less. And I think, to me.

243
00:32:07.760 --> 00:32:18.959
David Bau: It's kind of unsatisfying. I'm like, I don't know what just happened. Like, did I just prompt the model through vectors? It's a technique, though, so if you adopt… if you read one of those papers, you're like, oh.

244
00:32:19.110 --> 00:32:25.370
David Bau: what's plausible? I kind of believe what they're doing here. You could adopt one of the methods and then ask.

245
00:32:25.870 --> 00:32:44.209
David Bau: your… your interpretability question, which is like, oh, so what did they really do? Yeah. Yeah, I think that's true. Some of their, like, papers, they use, like, the term, like, steer, though, but they don't mean it in the same way that we do. They're just like, yeah. It could just be some fine-tuning, they have a lore, or something like that, but still, you could adopt it, you could say.

246
00:32:44.460 --> 00:32:49.410
David Bau: Oh, what is this lawyer doing, if you apply it on top of the model?

247
00:32:49.800 --> 00:32:52.529
David Bau: And so it's another possibility.

248
00:32:54.040 --> 00:32:59.360
David Bau: gosh, it's… it's tricky. I don't know, just… so, like, you're… you're… now you're into it. You have pretty good…

249
00:32:59.650 --> 00:33:02.080
David Bau: experimental ideas, I'd say.

250
00:33:04.070 --> 00:33:07.389
David Bau: I don't know what the advice is. Keep on digging. Right?

251
00:33:07.630 --> 00:33:08.410
David Bau: Yeah.

252
00:33:08.700 --> 00:33:09.550
David Bau: It's great.

253
00:33:09.660 --> 00:33:16.900
David Bau: But if anybody has ideas on this, it's a pretty neat setting. I like the… I like the question, like, is there a sharper hypothesis that they can have?

254
00:33:17.190 --> 00:33:20.399
David Bau: So, you know, I think that's worth thinking about.

255
00:33:20.710 --> 00:33:23.470
David Bau: You don't blink, link.

256
00:33:24.390 --> 00:33:25.100
David Bau: Beautiful.

257
00:33:25.460 --> 00:33:27.619
David Bau: It could be that you're not…

258
00:33:28.230 --> 00:33:36.240
David Bau: It could be that the whole idea of secrecy is, like, A bizarre, malformed.

259
00:33:36.710 --> 00:33:38.180
David Bau: No opposed idea.

260
00:33:38.400 --> 00:33:43.310
David Bau: Right? It's like, is there really… can you really do thinking without context?

261
00:33:43.630 --> 00:33:59.310
David Bau: you know, is there such a thing as your true beliefs, or do you… whenever you are asked a question, you're always in some context. There's no such thing as your true beliefs. You know, you get abducted by aliens or something, you ask your political beliefs. Is that, like, the true context? I'm not sure, right? You know?

262
00:33:59.410 --> 00:34:01.540
David Bau: And so…

263
00:34:02.650 --> 00:34:10.360
David Bau: So it might be… it's an ill-formed question, it might be that there's no such thing as a non-sycophantic

264
00:34:11.210 --> 00:34:24.599
David Bau: universe. It's always… Right? I mean, that's why, you know, you're sort of seeing everything all mixed up, right? But then there's… but then… but you're closely related to other kind of questions. I like the question of assumptions.

265
00:34:24.750 --> 00:34:31.840
David Bau: you know, you were brushing on this earlier, like, oh, I'm a nurse. Then, all of a sudden, the model has all these assumptions.

266
00:34:31.949 --> 00:34:37.139
David Bau: And it's like, oh, why did you assume that, right? Do assumptions change

267
00:34:37.260 --> 00:34:40.260
David Bau: When you have it today? Okay.

268
00:34:40.389 --> 00:34:49.390
David Bau: if, like, I'm a nurse, I feel like I have to try to be a good assumption there would be a nurse could educate or something like that. Is there a way of understanding how that knowledge is

269
00:34:49.639 --> 00:34:51.640
David Bau: Organized. See the finger, right?

270
00:34:52.000 --> 00:35:01.160
David Bau: 6 years, something like that. I'm not sure. You know, I… you know, it might be that there's something downstream.

271
00:35:03.600 --> 00:35:04.400
David Bau: Alright.

272
00:35:05.230 --> 00:35:10.949
David Bau: I like it, I like… I like that you guys have… you seem to have picked up on all the different experimental techniques, and…

273
00:35:11.280 --> 00:35:12.010
David Bau: Great.

274
00:35:12.500 --> 00:35:14.390
David Bau: All right. Thank you, everyone.

275
00:35:15.260 --> 00:35:19.330
David Bau: All right, let's, let's talk about probating. So, you know, it's not…

276
00:35:19.500 --> 00:35:21.200
David Bau: What did you guys think of the reading?

277
00:35:22.930 --> 00:35:31.770
David Bau: Let's see… Who's, yeah, I picked a couple… papers.

278
00:35:32.160 --> 00:35:33.340
David Bau: He has to read?

279
00:35:33.880 --> 00:35:44.970
David Bau: There's… What's… what's really nice is there's… there's, like, probably as a… way of looking inside networks?

280
00:35:45.600 --> 00:35:52.390
David Bau: Because it's simple, but it's also… it follows machine learning principles, so a lot of people believe in it.

281
00:35:52.600 --> 00:35:54.970
David Bau: And it's simple enough that a lot of people can do it.

282
00:35:56.130 --> 00:36:01.740
David Bau: So let me just introduce, like, what the heck.

283
00:36:02.580 --> 00:36:05.389
David Bau: People mean by probing? It's pretty simple.

284
00:36:05.730 --> 00:36:07.740
David Bau: Right? You're asking this question.

285
00:36:12.770 --> 00:36:16.399
David Bau: Without the censorship, you have this little black square. Here you go.

286
00:36:16.990 --> 00:36:20.750
David Bau: Right. You're asking this question, right? Does my neural network

287
00:36:21.900 --> 00:36:24.969
David Bau: representation. Do my neurons know some concept?

288
00:36:25.280 --> 00:36:34.230
David Bau: Right? Does it know some concept. But this is a weird question, right? Like, when you started this class, if I… I said, hey, your mission was to figure out if my neural network

289
00:36:34.520 --> 00:36:38.259
David Bau: knows concept C. Does it know the difference between right and wrong?

290
00:36:38.940 --> 00:36:40.700
David Bau: Right? Does it know something like that?

291
00:36:41.080 --> 00:36:44.379
David Bau: You'd be like, how do you even quantify that?

292
00:36:44.810 --> 00:36:48.880
David Bau: And so probing gives you a really simple exercise, you know, it's a formula.

293
00:36:49.140 --> 00:36:52.789
David Bau: We're quantifying it, and we'll… so today we'll just talk about

294
00:36:53.180 --> 00:36:57.820
David Bau: You know, what that formula is, some of the limitations, some of the other things people have done.

295
00:36:58.020 --> 00:37:03.179
David Bau: I think you guys had a lot of good questions about the reading that will cover most of the stuff.

296
00:37:03.720 --> 00:37:12.620
David Bau: And so the formula goes like this. So basically, you have to decide what your concept is. Like, what's the difference between right or wrong, right?

297
00:37:12.890 --> 00:37:16.000
David Bau: Well, how do you do that? You do the machine learning thing.

298
00:37:16.850 --> 00:37:18.490
David Bau: You define some data set.

299
00:37:18.800 --> 00:37:22.669
David Bau: like, a labeled dataset of, like, I've got some X's, which are right.

300
00:37:22.900 --> 00:37:24.860
David Bau: And then other X's that are wrong?

301
00:37:25.290 --> 00:37:30.169
David Bau: So the example from the TKF figure one was…

302
00:37:30.340 --> 00:37:33.560
David Bau: I've got some images, which are almost drapey.

303
00:37:34.270 --> 00:37:39.360
David Bau: And then I've got other images, which are just regular… You know, images.

304
00:37:40.270 --> 00:37:47.350
David Bau: And… And that's the difference, right? Like, these are stripes, these are… Regular images?

305
00:37:47.990 --> 00:37:50.409
David Bau: Can I tell the difference between these two things?

306
00:37:50.580 --> 00:37:54.199
David Bau: I wonder if the network has an internal concept.

307
00:37:54.600 --> 00:37:58.150
David Bau: Which would, like, know the difference between these two things. So that's… so…

308
00:37:58.310 --> 00:38:09.009
David Bau: So… so that's what it means to define a concept. So you could see, oh, actually, maybe the concept is, like, this is, like, not as stripey as that, or something like that, even though there's some stripes on it, things like that.

309
00:38:09.160 --> 00:38:13.610
David Bau: Your definition of where the line is between

310
00:38:13.850 --> 00:38:17.970
David Bau: Positive examples and negative examples to be a positive are just given by a data set.

311
00:38:18.460 --> 00:38:23.809
David Bau: Right? And if you want to define your concept a different way, you just form your dataset a different way.

312
00:38:25.080 --> 00:38:26.480
David Bau: So that's his first step.

313
00:38:26.840 --> 00:38:31.830
David Bau: Second step, is… This is… you have to decide

314
00:38:32.600 --> 00:38:35.609
David Bau: Where in your network you care about

315
00:38:35.740 --> 00:38:40.730
David Bau: what it knows. So, I've just written down, there's some layer of the network Z,

316
00:38:40.840 --> 00:38:42.489
David Bau: Which is where you care about it.

317
00:38:42.980 --> 00:38:50.839
David Bau: And then… and then you just… Ask the question, And we extract a classifier, For my concept.

318
00:38:51.170 --> 00:38:56.529
David Bau: out of that Z. So the Z comes out of F , the first bunch of layers of my neural network.

319
00:38:57.150 --> 00:39:00.180
David Bau: then I just train a classifier, G,

320
00:39:01.050 --> 00:39:03.909
David Bau: To try to predict my concept.

321
00:39:04.130 --> 00:39:05.940
David Bau: Out of that, that layer of the network.

322
00:39:06.180 --> 00:39:06.860
David Bau: Right.

323
00:39:07.100 --> 00:39:15.489
David Bau: And so that's just classic machine learning. There's a lot of ways of training a classifier, and you can use any of them. But, like, the simplest way is to just train

324
00:39:15.610 --> 00:39:17.880
David Bau: a linear model.

325
00:39:18.270 --> 00:39:21.739
David Bau: Like, basically learn a vector that if you dot product it with

326
00:39:22.450 --> 00:39:28.990
David Bau: The neurons, and that weighted sum gives you a score that tells you which side of the line you're on.

327
00:39:29.580 --> 00:39:30.649
David Bau: That make sense?

328
00:39:30.780 --> 00:39:36.680
David Bau: And then, you know, if the data set's big enough, the line's never gonna be perfect, right? It'll have some.

329
00:39:36.930 --> 00:39:38.679
David Bau: They'll have some level of accuracy.

330
00:39:40.070 --> 00:39:41.040
David Bau: That make sense?

331
00:39:41.320 --> 00:39:54.520
David Bau: So that's step two, is you just, like, train this classifier. Now, it could be a complicated classifier, it could be a nonlinear classifier or something like that, and so it's not a separating thing anymore, now it's, like, some weird boundary that you're learning.

332
00:39:54.690 --> 00:39:57.469
David Bau: But, you know, you can learn any kind of classifier.

333
00:39:57.600 --> 00:39:59.130
David Bau: And so there's some debate.

334
00:39:59.600 --> 00:40:06.459
David Bau: About the second… second step, like, what you should do. A lot of papers have been written about the second step, because

335
00:40:06.600 --> 00:40:12.099
David Bau: There's the machine learning part, and it's on machine learning venues, so they like to debate about how to do the machine learning.

336
00:40:12.640 --> 00:40:17.110
David Bau: And then… And then the third step is, after you get your classified in.

337
00:40:17.340 --> 00:40:19.630
David Bau: And you, you, you measure its accuracy.

338
00:40:20.130 --> 00:40:23.410
David Bau: And the idea of the probate… is that…

339
00:40:24.950 --> 00:40:30.150
David Bau: If you can extract your concept, more accurately.

340
00:40:30.540 --> 00:40:35.090
David Bau: Then you take that as evidence that The information is in there.

341
00:40:35.670 --> 00:40:39.960
David Bau: And if you have trouble extracting your concept, then that's evidence that the information's

342
00:40:40.290 --> 00:40:44.080
David Bau: Not in there. Or, at least, if it's in there, it's hard to get out.

343
00:40:44.800 --> 00:40:45.780
David Bau: That make sense?

344
00:40:47.310 --> 00:40:50.170
David Bau: So that's… that's it. That's… that's the formula for probing.

345
00:40:51.430 --> 00:40:55.659
David Bau: So, this is… so now, all your projects.

346
00:40:56.160 --> 00:40:58.900
David Bau: Are sort of of the form.

347
00:40:59.090 --> 00:41:02.989
David Bau: I wonder if my neural network has a certain thought.

348
00:41:03.090 --> 00:41:05.610
David Bau: Or if I wonder if my neural network knows a certain thing.

349
00:41:06.700 --> 00:41:11.580
David Bau: And I think that in 2019, you could write those papers.

350
00:41:11.790 --> 00:41:17.619
David Bau: And the classic thing would be, you'd learn how to train a classifier, You do these 3 steps?

351
00:41:18.790 --> 00:41:21.819
David Bau: And you'd be done with your paper. You write the thing.

352
00:41:22.070 --> 00:41:23.080
David Bau: That make sense?

353
00:41:23.460 --> 00:41:27.670
David Bau: And actually, so, but because of that, it's not a bad… Step.

354
00:41:27.950 --> 00:41:29.279
David Bau: To do this, anyway.

355
00:41:29.670 --> 00:41:33.089
David Bau: Right? But I want to also, you know, sort of give you guys

356
00:41:33.380 --> 00:41:42.439
David Bau: room to ask, you know, to sort of question this, and we'll explore the area probably a little bit more. So, okay, so let's take a moment. Okay, Jasmine, kick us off. You had some questions.

357
00:41:42.480 --> 00:42:06.890
David Bau: I guess, like, the… I'm, like, really confused by the Hewittown paper, where they're, like, you can have problems where, like, your probe looks like it's doing really good, but the representation, like, still isn't there. Right. At the same time, though, like, if it performs well, like, there's, like, some kind of signal there, so to me, it's like, what… like, why are we doing all this? Like, what is interpretability… What's it about? What's it… what do we… Like, yeah, like, mean, like, at what point do we have, like, a concept that we, like.

358
00:42:07.190 --> 00:42:10.290
David Bau: Yes. What's the hope, and what's the challenge here?

359
00:42:10.470 --> 00:42:14.330
David Bau: So, I think the hope is, Say…

360
00:42:14.480 --> 00:42:17.700
David Bau: Get our finger, you know, get our hands on this, actually.

361
00:42:17.840 --> 00:42:20.520
David Bau: Right? It's sort of the light effector question.

362
00:42:22.760 --> 00:42:26.240
David Bau: What did Nixon know, and when did he know it?

363
00:42:26.450 --> 00:42:28.379
David Bau: Right? Sort of the question?

364
00:42:28.560 --> 00:42:34.460
David Bau: And, and so we're asking that this environment, what do they know, when did they know it, right?

365
00:42:34.650 --> 00:42:36.820
David Bau: So that's the hope.

366
00:42:37.060 --> 00:42:40.330
David Bau: But then it turns out that it's kind of this mushy thing.

367
00:42:40.580 --> 00:42:43.380
David Bau: Right? So…

368
00:42:43.970 --> 00:42:50.750
David Bau: And I think that the Hewitt thing… part of the Hewitt thing comes down to this question of…

369
00:42:51.160 --> 00:42:51.930
David Bau: you know.

370
00:42:52.320 --> 00:42:58.250
David Bau: if you… If you know…

371
00:42:58.430 --> 00:43:02.010
David Bau: Something that is closely related

372
00:43:02.310 --> 00:43:04.070
David Bau: Bill, what you're being asked about.

373
00:43:04.640 --> 00:43:10.070
David Bau: You know, does that mean you know… the fence?

374
00:43:10.320 --> 00:43:13.830
David Bau: Like, if I knew,

375
00:43:15.450 --> 00:43:21.049
David Bau: you know, oh, I probed your brain, And… I know that…

376
00:43:21.390 --> 00:43:26.190
David Bau: You're thinking that it's, like, a square root, and it's a square root of 2,

377
00:43:26.810 --> 00:43:30.920
David Bau: And then I say, yes, so you know that it's 1.4 or something.

378
00:43:31.190 --> 00:43:47.380
David Bau: Right? Well, I don't know if you know that it's 1.4 something. Maybe you know that it's, like, the square root of 2, but maybe you don't know what the square root of 2 is, right? But then, like, you know, somebody… maybe somebody with certain… a certain level of expectations about common sense?

379
00:43:47.850 --> 00:43:54.459
David Bau: might go to you and say, well, you knew it was a square of 2, so you knew it was 1.4, right? Like, anybody would know.

380
00:43:54.740 --> 00:43:58.469
David Bau: But then… but then you might say, you know what? I didn't know.

381
00:43:58.580 --> 00:44:07.739
David Bau: Right? I didn't know, but the square root of 2 is 1.4, like, is that… they say, how can you not know? That's so obvious, it's just a definition of what it is, right? So that you can have a difference of opinion of what this is.

382
00:44:07.850 --> 00:44:13.640
David Bau: And so… so when you train a classifier to look at a thing and, like, solve a problem.

383
00:44:14.440 --> 00:44:16.690
David Bau: Is this really a view?

384
00:44:17.170 --> 00:44:18.690
David Bau: That the network knows something?

385
00:44:19.160 --> 00:44:24.260
David Bau: Or is this a view that the class supplier could figure out the answer to the question?

386
00:44:24.380 --> 00:44:38.960
David Bau: Right? It's like, oh, like, I can look at what information you had, it's a square root of 2, and I could figure out it's 1.4. Like, I'm like the classifier here, I can figure out what's going on, but maybe you couldn't have, right? So there's this weird…

387
00:44:39.610 --> 00:44:43.440
David Bau: piece of epistemology here, like, what does it mean to actually know something?

388
00:44:43.620 --> 00:44:50.220
David Bau: But I think that, you know, just asking generally what the goal of it is, is we'd like to… we have some…

389
00:44:51.130 --> 00:44:55.850
David Bau: sense of what it means to know something, and we're trying to get our hands on it, and probing is one…

390
00:44:56.080 --> 00:45:00.770
David Bau: approach to it, I think that some of the newer things Are… are related to this.

391
00:45:01.100 --> 00:45:02.969
David Bau: dilemma that I'm describing.

392
00:45:03.250 --> 00:45:10.979
David Bau: About what's obvious, what's not obvious. We'll talk about you a little bit more. Let's ask, what is Kai's question?

393
00:45:13.270 --> 00:45:17.689
David Bau: I see her. Who's Kai? Well, Kai is. It's kind of the same thing, like.

394
00:45:17.990 --> 00:45:34.330
David Bau: Most of the concepts from the concept activation paper are, like, pretty concrete, stripes, or beacles. Yeah. So I was wondering, like, how do we know if it's actually capturing something else? A concept, like, beautiful. Yeah, like, to me. Yeah. Yeah, how about a concept like sycophantic?

395
00:45:35.050 --> 00:45:35.750
David Bau: Right?

396
00:45:36.020 --> 00:45:51.829
David Bau: So I really like that. So these minimal pairs that your classmates just presented here for syncrofancy is, like, an example of constructing one of these data sets. It's like, oh, there's, like, Syncophancy over here, there's not sycophancy over here.

397
00:45:52.340 --> 00:45:53.600
David Bau: I wonder if…

398
00:45:53.830 --> 00:46:10.029
David Bau: My neural network keeps track of the difference between them. You see this scatterplot with everything mixed up? It doesn't look like this, right? You think, I don't know. Maybe it's not keeping track, right? Look at Gray's face. It's like, I don't know. It doesn't look like this. Right, maybe my network's not thinking that way.

399
00:46:10.130 --> 00:46:26.579
David Bau: Right? So, but I think that… so you can ask this… so that's the beauty of this general formulation. You can ask this question about anything. It could be beauty, it could be in some abstract thing, just whatever you have in your mind, and you can just… you can just, like, define the classification set, and then you can ask the question.

400
00:46:26.840 --> 00:46:32.549
David Bau: That doesn't… it doesn't guarantee you that you're gonna get a positive answer. You might get really low accuracy.

401
00:46:32.830 --> 00:46:34.040
David Bau: At the end of your program.

402
00:46:35.000 --> 00:46:37.070
David Bau: But that's… but… or maybe not.

403
00:46:37.950 --> 00:46:40.400
David Bau: Okay, so that's… so that's sort of that.

404
00:46:40.670 --> 00:46:42.379
David Bau: And, and Claire.

405
00:46:42.950 --> 00:46:44.450
David Bau: Yeah, so what was your question?

406
00:46:45.100 --> 00:46:47.170
David Bau: Islands, like…

407
00:46:47.660 --> 00:46:58.760
David Bau: The probes are only using human-like features, but sometimes models split things into, like… Non-human-like things? Yeah, I really like this question. I think that, so…

408
00:46:59.020 --> 00:47:03.619
David Bau: You know, a lot of the methods, most of the methods we're studying in the class.

409
00:47:04.050 --> 00:47:11.010
David Bau: start off with a hypothesis, like, you know, you want to know if the… AI knows something.

410
00:47:11.290 --> 00:47:14.249
David Bau: And to ask that, you have to define this dataset.

411
00:47:14.480 --> 00:47:20.979
David Bau: You have to decide… decide what's. So what… what if the thing that the neural network knows is something that you can't?

412
00:47:21.400 --> 00:47:28.890
David Bau: distinguish, and you can't tell the difference, then how the heck do you make this probe data set in the first place, right? So it's a limitation of this method, I think, that

413
00:47:29.160 --> 00:47:37.409
David Bau: you know, that's one of the things. So, I think that's a good question. There are probably other reasons, there's probably other ways of, like, chasing after some people.

414
00:47:38.170 --> 00:47:39.670
David Bau: Data network.

415
00:47:39.800 --> 00:47:42.100
David Bau: I know is that you might not be able to…

416
00:47:42.540 --> 00:47:45.600
David Bau: Suss out, but, you know, probing is not the way to get it.

417
00:47:47.620 --> 00:47:48.630
David Bau: Isaac.

418
00:47:49.870 --> 00:47:56.469
David Bau: Yeah, this is just, like, a, I think, inversion of all of the questions that have been asked, but it struck me that, like, they had, like, the example of, like.

419
00:47:56.530 --> 00:48:11.140
David Bau: images of CEOs, but, like, what makes them images of CEOs, or just that we're calling them images of CEOs? Like, it seemed like a lot was writing on how we are kind of imputing our own concepts in the images, as opposed to…

420
00:48:11.520 --> 00:48:13.010
David Bau: Necessarily kind of coming out.

421
00:48:13.660 --> 00:48:24.789
David Bau: Yeah, itself, or corresponding to something in the images that the model is responding to? Yeah, so I think it's right. It's basically the same kind of question, and so the things that we're… that it equips us to answer

422
00:48:24.900 --> 00:48:29.700
David Bau: Is trying to find a correspondence between the way that we think about the world.

423
00:48:30.440 --> 00:48:33.790
David Bau: And managing the way that the network is representing the work itself.

424
00:48:33.990 --> 00:48:35.860
David Bau: So… bumped.

425
00:48:36.250 --> 00:48:47.059
David Bau: That's right. And so, if it's something very inhuman that we're trying to suss out, then it's hard to do this for. Now, there are some techniques, so for example.

426
00:48:48.210 --> 00:48:56.339
David Bau: There are people who ask this question about inhuman things by doing things like taking two neural networks and seeing if they agree.

427
00:48:56.570 --> 00:49:03.200
David Bau: Right? Like, you can just take one neural network and take a random neuron or something and say, well, I don't know what this neuron does, it might be something imperceptible.

428
00:49:03.710 --> 00:49:08.319
David Bau: But… I wonder if this second neural network learns the same neuron.

429
00:49:08.820 --> 00:49:11.310
David Bau: Right? If it, if it has the same information.

430
00:49:11.580 --> 00:49:15.740
David Bau: So you can use the… First neural network's neuron.

431
00:49:15.930 --> 00:49:18.120
David Bau: To create this dataset.

432
00:49:18.770 --> 00:49:21.330
David Bau: And then you can use a second neural network.

433
00:49:21.630 --> 00:49:23.160
David Bau: Let's see if you can probe.

434
00:49:23.830 --> 00:49:25.389
David Bau: That concept back up.

435
00:49:25.670 --> 00:49:27.829
David Bau: And so, you know, people have done that kind of thing.

436
00:49:28.390 --> 00:49:32.359
David Bau: I just try to get a handle on things that are not… even interpretable, and there's…

437
00:49:32.540 --> 00:49:36.420
David Bau: There's, you know, there's various fancy ways of doing this.

438
00:49:36.780 --> 00:49:38.429
David Bau: Larger scale.

439
00:49:38.730 --> 00:49:42.549
David Bau: That, you know, even neuroscientists do to try to align brains

440
00:49:42.690 --> 00:49:44.550
David Bau: People with each other and stuff like that.

441
00:49:45.250 --> 00:49:46.110
David Bau: So…

442
00:49:46.240 --> 00:49:50.689
David Bau: If anybody's interested in doing that kind of thing for their project, then let me know, we'll add a segment.

443
00:49:50.870 --> 00:49:53.229
David Bau: To the class to talk about these kind of moments.

444
00:49:53.650 --> 00:49:56.429
David Bau: Yes. Is the goal? To,

445
00:49:56.850 --> 00:50:06.880
David Bau: for a concept that exists in multiple different types of models. Can you do this kind of thing? Yes. So, I think that the… so there's this… there's this…

446
00:50:07.270 --> 00:50:13.930
David Bau: the sort of pulley ground of the project… of the project of interpretability for AI, which is

447
00:50:14.230 --> 00:50:21.149
David Bau: Can I get a printout of everything that the AI knows, including stuff that I didn't know to ask?

448
00:50:21.310 --> 00:50:22.130
David Bau: Right?

449
00:50:22.280 --> 00:50:31.599
David Bau: And, and so, well, certainly you can, you can just print out all the numbers of all the neurons, but that doesn't tell you anything. So, like, what are the meaningful concepts? Can I get a printout of all the…

450
00:50:32.060 --> 00:50:33.360
David Bau: meaningful concepts.

451
00:50:33.550 --> 00:50:37.899
David Bau: It does… what does that even mean? And so one… one way of choosing that.

452
00:50:38.120 --> 00:50:38.960
David Bau: Problem.

453
00:50:39.140 --> 00:50:58.479
David Bau: is you take 100 neural networks, they're trained slightly differently, slightly different architecture, something like this, and you use techniques kind of like what I'm describing, like, you know, see if they agree, if they have the same neurons as each other, or something like that, and then you'll see that, oh, there's only a certain number of neurons that they all have.

454
00:50:59.020 --> 00:51:02.209
David Bau: Right? Most of the neurons, they, they, like, have…

455
00:51:02.600 --> 00:51:05.099
David Bau: You know, they're very different, but they have a few neurons.

456
00:51:05.290 --> 00:51:09.130
David Bau: Which, they all… regardless.

457
00:51:09.630 --> 00:51:13.740
David Bau: And so there was a nice paper, you can look it up, it's called Rosetta Neurons.

458
00:51:14.310 --> 00:51:15.500
David Bau: Where they do this.

459
00:51:15.620 --> 00:51:22.759
David Bau: And… the name of it, you know, suggests the research program, right? They're like, if all these neural networks

460
00:51:22.980 --> 00:51:28.750
David Bau: have these neurons in common, then these are sort of, like, the key…

461
00:51:28.900 --> 00:51:31.700
David Bau: you know, translation points, it's like a Rosetta Stone.

462
00:51:32.060 --> 00:51:42.480
David Bau: for us to be able to translate between concepts, maybe it's like the key for even decoding the whole idea of the internal language of neural networks.

463
00:51:42.880 --> 00:51:45.320
David Bau: To begin with. So that's… so…

464
00:51:45.560 --> 00:51:47.799
David Bau: So that's… so that's the technique that people use.

465
00:51:48.510 --> 00:51:52.320
David Bau: Presents us, and so… Great.

466
00:51:52.430 --> 00:51:57.030
David Bau: Right, right, right. And so some of these things are… Yeah, right.

467
00:51:57.580 --> 00:51:59.800
David Bau: Some of these things are really nice streams.

468
00:52:00.070 --> 00:52:02.680
David Bau: I think that they're mostly speculative right now, like.

469
00:52:02.940 --> 00:52:05.130
David Bau: I don't think anybody's really figured out the point of that.

470
00:52:05.360 --> 00:52:07.539
David Bau: Producting everything that our level does.

471
00:52:09.920 --> 00:52:11.170
David Bau: So, okay.

472
00:52:11.350 --> 00:52:14.189
David Bau: So, I'm gonna have had this question.

473
00:52:15.080 --> 00:52:15.890
David Bau: Rita.

474
00:52:16.470 --> 00:52:17.670
David Bau: I mean, yes.

475
00:52:19.180 --> 00:52:21.770
David Bau: How expats with it.

476
00:52:22.040 --> 00:52:23.040
David Bau: Abyss.

477
00:52:23.620 --> 00:52:41.039
David Bau: what the heck about LLMs? Why, why, why do I assign you, like, an image paper? This TCAP thing? Well, partly, I just wanted to give you a little historical context, and probing is pretty old. Probing goes back before the TCAP paper, but, but, you know,

478
00:52:41.100 --> 00:52:46.159
David Bau: But yeah, as early as the TCAP paper, people were already thinking about how to understand

479
00:52:46.520 --> 00:52:49.760
David Bau: What neural networks knew, and they were using probing methods.

480
00:52:49.970 --> 00:52:52.539
David Bau: And, and so around the same time.

481
00:52:52.820 --> 00:52:55.700
David Bau: People were looking at machine translation systems.

482
00:52:56.090 --> 00:53:13.159
David Bau: And, and using probing to figure out what's going on. So let me just describe, like, an old machine translation setup as, like, a classic way that people use probing, so you can kind of see how it works in practice and other people who just read papers, right? And so, you know, the Hudson Machine Translation, system. I'm not gonna…

483
00:53:13.550 --> 00:53:21.349
David Bau: get into RNNs and everything. They're just… they're kind of like transformers, but different, right? But they also have,

484
00:53:21.870 --> 00:53:25.479
David Bau: Little, little layers where you have a representation and token.

485
00:53:25.760 --> 00:53:28.300
David Bau: And… and then you can ask the question.

486
00:53:28.450 --> 00:53:31.479
David Bau: What does the network know about this token?

487
00:53:31.870 --> 00:53:34.579
David Bau: Does it know that it's a pronoun?

488
00:53:34.970 --> 00:53:40.510
David Bau: Does it know it's plural? Does it know it's a verb or a noun? Right, so it's kind of… these…

489
00:53:40.970 --> 00:53:44.840
David Bau: This can actually be kind of interesting questions, because if you have, like, the word play.

490
00:53:45.580 --> 00:53:47.559
David Bau: Right? It could mean a verb.

491
00:53:48.360 --> 00:53:49.829
David Bau: It could be in a noun.

492
00:53:50.150 --> 00:53:54.929
David Bau: And then within the concept of a verb or not, it could mean different things.

493
00:53:55.340 --> 00:54:00.969
David Bau: Within that. And so you, you can ask, you can ask it, hey, is play a verb or a noun?

494
00:54:01.130 --> 00:54:10.870
David Bau: classify that for me from looking at the neural representation here. And if you can get good accuracy on a probe like that, then it suggests that maybe your network knows

495
00:54:11.090 --> 00:54:13.250
David Bau: The difference between a verb and an outlet.

496
00:54:13.820 --> 00:54:24.069
David Bau: And… and maybe it doesn't. And so… so this is the kind of thing that they would do. And they would ask a simpler question, like, is this plural? Is this singular? Is this present tense? Is it past tense?

497
00:54:24.180 --> 00:54:33.200
David Bau: you know, these linguists, they have all sorts of things that they'd like to, like to measure. And so… so here's, like, an example of a set of,

498
00:54:33.520 --> 00:54:45.609
David Bau: questions about plural and past tense and uppercase. They call these morphological, attributes of a word, and what they did is they went to a machine translation system, and they went to all the layers, the early layers and the late layers.

499
00:54:45.840 --> 00:54:48.290
David Bau: You know, not that different from a transformer.

500
00:54:48.810 --> 00:54:51.710
David Bau: And… and they asked, what was the accuracy?

501
00:54:51.960 --> 00:54:57.350
David Bau: of the probe classifier when translating from Czech to English, or Spanish to English, whatever.

502
00:54:57.500 --> 00:54:59.380
David Bau: And… and you can see that

503
00:54:59.640 --> 00:55:09.289
David Bau: at the early layers, it's pretty accurate. You can tell the difference between these morphological, you know, things. It can classify plural versus non-plural.

504
00:55:09.490 --> 00:55:12.809
David Bau: Pretty warm at the early layers. But the weird thing

505
00:55:13.100 --> 00:55:17.939
David Bau: Is that as you went deeper in the network, These are only 4-layer networks.

506
00:55:18.440 --> 00:55:22.450
David Bau: But as you went deeper in the network, Is accuracy dropped.

507
00:55:23.930 --> 00:55:38.070
David Bau: what the heck do you think is going on? So do you… so do you see that set? Don't worry about the last bar. The last bar is aggregate. If you train a probe over all four layers at once, which you can do also, then it's, like, super accurate, right? So, but if you train it on just one layer at once.

508
00:55:38.490 --> 00:55:40.149
David Bau: If the accuracy is dropping.

509
00:55:40.270 --> 00:55:41.629
David Bau: What do you think that means?

510
00:55:43.130 --> 00:55:47.349
David Bau: Now, that means, why would the accuracy drop the more computation you did?

511
00:55:49.390 --> 00:55:51.129
David Bau: Like, getting the concept?

512
00:55:52.000 --> 00:55:54.500
David Bau: And forgetting. You know, I…

513
00:55:54.690 --> 00:56:04.800
David Bau: It's good when you tell someone, oh, I read somewhere, and you forgot where you read it, or even what language you read it in. Yeah, so it looks like forgetting here.

514
00:56:05.400 --> 00:56:09.540
David Bau: Right? So this is a… this is a funny picture of, like, forgetting.

515
00:56:10.280 --> 00:56:16.000
David Bau: It's like, when it first read the text, it knew this was Past tense.

516
00:56:16.480 --> 00:56:20.190
David Bau: Right? And as it goes further into its layers.

517
00:56:21.480 --> 00:56:28.980
David Bau: then it's… it kind of, like, the story is, like, lower probe accuracy, right? It's like, it kind of is forgetting that it's past tense.

518
00:56:29.710 --> 00:56:30.570
David Bau: Is that weird?

519
00:56:34.440 --> 00:56:35.530
David Bau: Strange, right?

520
00:56:35.890 --> 00:56:39.730
David Bau: Then with the Hewitt thing that they… Suggest.

521
00:56:40.390 --> 00:56:42.330
David Bau: Memorizing proof.

522
00:56:42.740 --> 00:56:45.320
David Bau: Yes, we'll get to that. Yes, yes.

523
00:56:46.150 --> 00:56:53.250
David Bau: No, I don't skip to the end. No, no, it's okay, no, it's okay. But yes, in the human thing, right, so the question is.

524
00:56:53.450 --> 00:56:55.540
David Bau: Is this even meaningful to ask?

525
00:56:56.000 --> 00:56:59.709
David Bau: This is… are we looking at the right thing? So, like, the classic propane.

526
00:56:59.860 --> 00:57:01.980
David Bau: Is to look at accuracy and see what it's like.

527
00:57:02.580 --> 00:57:18.629
David Bau: It's not about accuracy. Why is everybody looking at accuracy? Right? We should look at selectivity, so we'll talk a little bit about that. Yes. But in that example, because it's a function of the word identity, you would expect earlier layers to have higher accuracy for memory.

528
00:57:18.830 --> 00:57:22.360
David Bau: Because… that's right, so that… because at the earlier layers.

529
00:57:23.160 --> 00:57:24.900
David Bau: You were just told the word just now.

530
00:57:25.130 --> 00:57:27.239
David Bau: Like, the information hasn't been messed up.

531
00:57:28.050 --> 00:57:31.209
David Bau: And so it's obvious. Yeah. So I guess it's,

532
00:57:31.850 --> 00:57:42.889
David Bau: Yeah, I don't know, right? It's kind of like, do you, like, do you fine-tw it, or not? Like, do you really want to look at selectivity, or… or if you drew this graph and you showed selectivity instead of accuracy, what would be an insight?

533
00:57:43.430 --> 00:57:44.859
David Bau: I think that's a pretty good question.

534
00:57:45.350 --> 00:57:51.810
David Bau: The, But you do see these accuracy drop, and the story that people tell

535
00:57:52.910 --> 00:57:59.960
David Bau: But you can think about it now that you've read all these papers, right? The story that people tell is that when the accuracy drops.

536
00:58:00.250 --> 00:58:06.959
David Bau: It sort of suggests that the network is forgetting something. It's like erasing that information. It used to have all this information about whether the

537
00:58:07.300 --> 00:58:18.229
David Bau: the word was thorough or not, and it went into the deeper layers, and it sort of erased the information. So, let me show you a contrasting experiment. Same… same set of people.

538
00:58:18.680 --> 00:58:24.060
David Bau: Now here… We're looking at, syntactic relations.

539
00:58:24.290 --> 00:58:28.520
David Bau: So I'm not really a linguist, but, you know, I gather this is, you know, if you have…

540
00:58:29.040 --> 00:58:46.440
David Bau: like, a dependent clause or something like that. You can have different sentences where the same word might sometimes be the object of the sentence, might be the subject of the sentence, might depend on another word, or might stand alone, right? And so you can make a syntax tree, and you can say, what are the relations.

541
00:58:46.590 --> 00:58:56.720
David Bau: for a word, and then you can ask… you can ask a probing question about this. You can say, can I tell the difference between words that have a syntactic relation and words that don't?

542
00:58:56.910 --> 00:58:58.920
David Bau: Or words that have a different syntactic relation?

543
00:58:59.380 --> 00:59:02.320
David Bau: And… and so the… the accuracy is…

544
00:59:02.520 --> 00:59:05.939
David Bau: for this task, look kind of like this. It depends on the language.

545
00:59:06.170 --> 00:59:13.809
David Bau: I don't know what's going on with French to English, but, like, from Spanish to English, like, the probe accuracy goes up as you go through the layers.

546
00:59:16.130 --> 00:59:17.519
David Bau: Why would that be?

547
00:59:18.100 --> 00:59:21.220
David Bau: Why would some of the things, like, go down, and other things go up?

548
00:59:23.740 --> 00:59:24.860
David Bau: Interesting.

549
00:59:26.580 --> 00:59:33.360
David Bau: So, this… this contrast between things going down and things going off was enough… it was, like, compelling enough that people wrote a lot of papers about this.

550
00:59:33.500 --> 00:59:40.979
David Bau: And, and the intuition was, That some of these things, it takes a lot of context processing

551
00:59:41.090 --> 00:59:42.419
David Bau: to figure out…

552
00:59:42.810 --> 00:59:49.560
David Bau: you know, what's going on here with syntactic relations? And if you want to do a good job at machine translation, you need to know this pretty well.

553
00:59:49.730 --> 00:59:55.599
David Bau: So more and more of the neural network's representation is being used to represent stuff like this.

554
00:59:55.780 --> 01:00:00.069
David Bau: So that's why, when you probe it at the later layers.

555
01:00:00.260 --> 01:00:03.619
David Bau: then the accuracy goes up. It suggests that it's using more of…

556
01:00:03.780 --> 01:00:08.520
David Bau: It's doing a better job at representing this information. It's got, like, this information represented more accurately.

557
01:00:08.900 --> 01:00:12.039
David Bau: Getting there. And at least that's the story.

558
01:00:12.680 --> 01:00:13.749
David Bau: That make sense?

559
01:00:13.930 --> 01:00:21.239
David Bau: But it's cool, though, right? Like, I think that these experiments are genuinely cool. This idea that you might be forgetting something.

560
01:00:21.790 --> 01:00:24.619
David Bau: That you were just holding him, but in order to make room.

561
01:00:24.860 --> 01:00:26.889
David Bau: For something that you're figuring out.

562
01:00:27.010 --> 01:00:28.859
David Bau: That wasn't obvious for them.

563
01:00:30.200 --> 01:00:33.500
David Bau: And I think that people generally still, like, believe this story.

564
01:00:33.650 --> 01:00:41.099
David Bau: And if you find a sequence of communications in your neural network, That you want to characterize.

565
01:00:41.560 --> 01:00:42.660
David Bau: Doing a probe.

566
01:00:43.230 --> 01:00:45.610
David Bau: Like this, and contrasting

567
01:00:45.810 --> 01:01:01.080
David Bau: between, you know, probes that are less accurate and more accurate is definitely a worthwhile thing to try and to report. And, you know, people will… people have been looking at probe accuracies for decades, so they'll be able to read, you know.

568
01:01:01.350 --> 01:01:04.269
David Bau: what you have. They'll know what all the caveats are.

569
01:01:04.890 --> 01:01:06.360
David Bau: They'll be able to read that experiment.

570
01:01:07.050 --> 01:01:07.970
David Bau: Make sense?

571
01:01:08.430 --> 01:01:11.860
David Bau: Okay, so… Okay, so for example…

572
01:01:12.330 --> 01:01:32.119
David Bau: My student Sheridan, last year wrote a paper, which was this type of probe. So, actually, there was a question here. Did I skip any questions? I want to skip questions. I've noted questions, but I'm just stop and ask you guys for any questions. So, I think that one of the questions was, like, hey, so what about multi-valued

573
01:01:32.550 --> 01:01:39.819
David Bau: Attributes. When you… it might not be, you know, yes or no, but it might be, like.

574
01:01:39.950 --> 01:01:49.640
David Bau: 5 different things, or 10 different things, then you're… so… so, yeah, you can make probes that are multi-way classifiers, no problem. And so the classic one is LogitLens.

575
01:01:49.720 --> 01:02:02.220
David Bau: Right, the class… the logic lens is a probe, it's a non-trained probe, it's a probe that is just, like, the pre-trained decoder layer, and it's a classifier. It's a 50,000-way classifier that tells you which token

576
01:02:02.360 --> 01:02:03.800
David Bau: Is the most likely one.

577
01:02:04.050 --> 01:02:07.369
David Bau: At any given time. And so…

578
01:02:07.600 --> 01:02:12.020
David Bau: What Sheridan did was, instead of using LogitLens, Sheridan trained a probe

579
01:02:12.590 --> 01:02:22.199
David Bau: A 50,000-way probe on predicting the most likely word in a drink pantry.

580
01:02:22.630 --> 01:02:28.670
David Bau: And so, here's the setup. You have, like, the intermittent as, like, a piece of Wikipedia text.

581
01:02:28.980 --> 01:02:31.619
David Bau: And Sheridan takes, like, the token mitt.

582
01:02:31.930 --> 01:02:39.739
David Bau: And so the token mitt comes in, it just represents the token mitt, it's just like an embedding permit at the first layer, and as it goes through all the layers.

583
01:02:40.110 --> 01:02:44.150
David Bau: And that vector is evolved into something that eventually, like, predicts.

584
01:02:44.400 --> 01:02:47.109
David Bau: the next token, right? And…

585
01:02:47.360 --> 01:02:52.319
David Bau: But instead of, like, saying how accurate it is at predicting the next token, which is what LogitLens does.

586
01:02:52.880 --> 01:02:59.889
David Bau: Sheridan asks, how good is it at predicting MIT itself? Right, so, like, you input admit.

587
01:03:00.030 --> 01:03:03.170
David Bau: And if you go, like, 5 layers down, And you asked…

588
01:03:03.400 --> 01:03:10.019
David Bau: Oh, out of the 50,000 tokens, which one was the original token that you started with? Do you still remember that you were bit?

589
01:03:10.810 --> 01:03:11.799
David Bau: That make sense?

590
01:03:12.230 --> 01:03:14.850
David Bau: And so you can use that as a probate question.

591
01:03:15.290 --> 01:03:23.419
David Bau: And you might think, oh, that's easy, it's like 100%. No, but it's not, because you see the same erasure phenomenon going on, like, the accuracy drops.

592
01:03:23.590 --> 01:03:28.730
David Bau: And so here's… here's Mint. Starts off at 1.0. Of course, you can read the embedding right out.

593
01:03:28.850 --> 01:03:40.199
David Bau: Of the network. But as you go, you know, in 10 layers, right, it kind of forgets a little bit. It's down here, not didn't forget a lot, except, like, at 80%, you know, it drops down to 70% or something.

594
01:03:40.410 --> 01:03:44.920
David Bau: Right? So, like, it's erasing this information, is sort of the story, right?

595
01:03:45.250 --> 01:03:47.239
David Bau: Now, there's other information that's interesting.

596
01:03:47.700 --> 01:03:54.099
David Bau: like… How about the previous word? If you go to MIT, and you ask, what's your representation.

597
01:03:54.280 --> 01:03:57.379
David Bau: How… how much does it know it comes after enter?

598
01:03:57.830 --> 01:04:06.170
David Bau: Right? Intermittent, right? Like, does MIT know that comes after intermitt? That seems like it's a pretty important word, you know? It's like, it's not… it's not about mittens here, it's about intermittance.

599
01:04:07.380 --> 01:04:09.660
David Bau: Right? And it was pretty different.

600
01:04:10.060 --> 01:04:18.519
David Bau: So, to understand, really, what MIT means, it might be handy to know that you come after INTER, so… so that's, like, another probe. It's a very mechanical probe.

601
01:04:18.760 --> 01:04:24.590
David Bau: And then you can ask this question, this classified question, and at the beginning, MIT is very bad.

602
01:04:25.120 --> 01:04:26.230
David Bau: I'm predicting.

603
01:04:26.520 --> 01:04:29.650
David Bau: The intro comes before, because there's been no attention heads applied to it.

604
01:04:30.090 --> 01:04:44.760
David Bau: So the probe is, like, at almost zero accuracy. Now, it's not at zero accuracy, zero would be all the way down here, it's, like, a little higher than zero. Why wouldn't it be a little higher than zero? Even if… even if you haven't written any atten… you haven't run any attention pads yet, and you haven't been, and you're like, oh, you know what?

605
01:04:44.860 --> 01:04:56.369
David Bau: 10%… 10% of the time, I can guess what the token was right before me. Anyway, without even seeing it, you're totally blinded. You don't even have an attention edge. Why is the… why is the accuracy not down close to zero?

606
01:04:57.220 --> 01:04:58.050
David Bau: Why's actually…

607
01:05:02.530 --> 01:05:09.080
David Bau: The previous words also include information on the future words, so, like… Even easier than that.

608
01:05:09.270 --> 01:05:15.320
David Bau: Even easier. It's almost like… imagine you were a cryptographer.

609
01:05:15.610 --> 01:05:19.789
David Bau: And you just had, like, the middle of the word given to you.

610
01:05:20.290 --> 01:05:23.869
David Bau: Without any context, you might be able to guess

611
01:05:24.760 --> 01:05:27.410
David Bau: What that part of the word was before.

612
01:05:27.870 --> 01:05:29.750
David Bau: Right. Does that make sense?

613
01:05:30.170 --> 01:05:36.659
David Bau: Just because of bygram words… just bygram words, it's just because you know that you're in English.

614
01:05:37.050 --> 01:05:46.979
David Bau: And there are certain things that are in English that just tend to happen before, even without any additional information. So the baseline accuracy of the classifier is sort of the accuracy of diagram statistics.

615
01:05:47.120 --> 01:05:53.249
David Bau: in English, without any other contextual knowledge. But then, as soon as you start running attention heads.

616
01:05:53.450 --> 01:05:55.469
David Bau: Then you do have contextual knowledge.

617
01:05:55.610 --> 01:05:58.729
David Bau: And the accuracy goes up to… you know, 70…

618
01:05:59.280 --> 01:06:05.650
David Bau: 70%. And so… so the information about the previous token is being fed in, Goes up.

619
01:06:06.160 --> 01:06:10.789
David Bau: Right? And then it pays attention to it for a little while, and then it starts erasing that, too. Kind of interesting.

620
01:06:10.960 --> 01:06:11.650
David Bau: Right?

621
01:06:12.180 --> 01:06:22.600
David Bau: That makes sense? And then you can ask about, oh, how about two tokens before, like V, right? Oh, it actually pays attention to that, too. It puts a little bit of that information in there, and then it erases it.

622
01:06:22.900 --> 01:06:23.800
David Bau: Over time.

623
01:06:24.030 --> 01:06:24.780
David Bau: Right?

624
01:06:25.140 --> 01:06:33.549
David Bau: And then… and then 4 tokens before, it has this a little bit less, right? And then what's this yellow line? The yellow line's interesting. It goes like… it's not… the yellow line's different.

625
01:06:33.970 --> 01:06:35.599
David Bau: You see what that yellow line is?

626
01:06:42.890 --> 01:06:49.420
David Bau: This yellow line… It's about not looking back, but looking forward, looking at the next token.

627
01:06:50.270 --> 01:06:50.970
David Bau: Right?

628
01:06:51.430 --> 01:06:54.659
David Bau: So can it predict the next token from the representation?

629
01:06:57.310 --> 01:06:58.679
David Bau: Why is it going up?

630
01:07:00.770 --> 01:07:03.710
David Bau: Why is it getting more and more accurate at predicting the next survey?

631
01:07:03.990 --> 01:07:05.980
David Bau: As you go through more and more layers.

632
01:07:10.790 --> 01:07:12.320
David Bau: Why would that be?

633
01:07:13.120 --> 01:07:17.280
David Bau: Okay… Some machine learning person. Tell me, why would it be?

634
01:07:20.220 --> 01:07:20.604
David Bau: Wait.

635
01:07:21.130 --> 01:07:25.589
David Bau: Because of the language model. Why do you say it's because of the language model? That is what it is.

636
01:07:26.360 --> 01:07:39.659
David Bau: What does the TA say? I mean, that's what it is trained on. That's what it's trained to do! Right, so, so basically, you know, just from Lecture 1, right, you know, this is a language model, so it's trained to predict the next word.

637
01:07:39.820 --> 01:07:50.840
David Bau: Right? And so, as you go through more and more letters, it's like… has more information about, like, its bootgas for the next word, until it gets here. So this is about how good this language model is, like, 40% of the time.

638
01:07:50.870 --> 01:08:03.949
David Bau: and guess the next word in Wikipedia, correctly, given the context. So that's pretty nice, right? So it's like, oh, this is good, so it's, like, working, right? The language model's working. But then, at the same time as predicting the next word, it's also keeping track of all these other things.

639
01:08:03.950 --> 01:08:16.069
David Bau: from, like, previous words, and you can see this interesting profile of the information. So that's basically… I just wanted to share this, how probing works. Okay, so the weird thing, the reason that Sheridan wrote a paper and why you could write… so you could write this, you could, like.

640
01:08:16.109 --> 01:08:17.680
David Bau: Somebody would have written this.

641
01:08:18.270 --> 01:08:22.769
David Bau: In 2017, And it would have been a paper by itself, this graph.

642
01:08:22.810 --> 01:08:42.610
David Bau: But in 2024, you can't write that paper anymore. Everybody knows this, right? So, the reason that it's a paper in 2024 is because of this contrast with this other plot, which is the same lines, but with… which is this observation that at certain tokens, the plot looks different. So if you… if you take a look at the word Ent, or the token Ent.

643
01:08:42.740 --> 01:08:50.259
David Bau: and you do the same plots instead of the token MIT, then the accuracies look like this. So what MIT really is a stand-in for

644
01:08:50.529 --> 01:08:53.379
David Bau: any generic token in Wikipedia.

645
01:08:53.750 --> 01:08:56.550
David Bau: Right? But what Vint is, is…

646
01:08:56.710 --> 01:08:59.770
David Bau: Not any generic token in Wikipedia, these are…

647
01:08:59.960 --> 01:09:05.500
David Bau: What Sheridan calls the last token of a multi-topian word.

648
01:09:06.069 --> 01:09:11.820
David Bau: So, you know, some words are tokenized in ways that they have to be split up into multiple tokens.

649
01:09:12.510 --> 01:09:14.949
David Bau: And if you have a word that got split up.

650
01:09:15.170 --> 01:09:17.589
David Bau: And you take a look at the last word?

651
01:09:18.130 --> 01:09:19.419
David Bau: And you do this probing?

652
01:09:20.250 --> 01:09:21.620
David Bau: Then you get this other thing.

653
01:09:22.640 --> 01:09:29.399
David Bau: And what Sheridan says is, she says, It means that Like…

654
01:09:29.649 --> 01:09:33.139
David Bau: That word erases knowledge of what it was.

655
01:09:33.550 --> 01:09:43.549
David Bau: Like, so, like, does ENCH know that it's ENTH? Well, obviously, right, as soon as it comes in, it knows, but as soon as you start exposing it to tension hits, what the heck is the model doing? The model's like.

656
01:09:44.170 --> 01:09:51.069
David Bau: You may have been Ent, but I'd rather not know that you were Ent. I need you to forget this, like, the model's, like, quickly erasing this.

657
01:09:51.340 --> 01:09:58.920
David Bau: And it's also quickly erasing all the other contextual information. It's crazy. I don't know, like, what? So… so basically, Sharon wrote this paper saying.

658
01:09:59.130 --> 01:10:02.620
David Bau: What the heck is this erasure? There's this token erasure going on.

659
01:10:02.990 --> 01:10:03.700
David Bau: Right?

660
01:10:03.890 --> 01:10:05.770
David Bau: Why is it forgetting stuff so quickly?

661
01:10:06.510 --> 01:10:10.170
David Bau: Yes. A couple questions. Yes. The first one, like…

662
01:10:10.780 --> 01:10:18.410
David Bau: the fact that ENK is a common, like, ending word in English, like, that's persistent. So, did… did y'all try, like.

663
01:10:19.200 --> 01:10:34.270
David Bau: changing the prefix, and keeping the last token constant, and seeing if the model's erasure is consistent? Yeah, that's basically what the experiment is. So basically, you know, this is… this is based on a huge scrape of Wikipedia.

664
01:10:34.510 --> 01:10:44.700
David Bau: And so ENT, I'm sure, shows up at the end of a bunch of different words. Probably all these tokens show up at the end of a bunch of different words. And so this is an average…

665
01:10:44.800 --> 01:10:47.400
David Bau: Accuracy over all of that mix.

666
01:10:47.720 --> 01:10:52.400
David Bau: And, and yeah, so, and pretty consistently, ent.

667
01:10:53.330 --> 01:10:59.420
David Bau: If you ask the representation event, do you know what came before you?

668
01:11:00.240 --> 01:11:05.569
David Bau: Then, it does peak over here at layer 3. It does sort of know what came before.

669
01:11:05.870 --> 01:11:09.489
David Bau: But then, pretty quickly, when you get to layer 10 or something.

670
01:11:09.610 --> 01:11:12.550
David Bau: It's basically forgotten that, unlike the normal case.

671
01:11:12.720 --> 01:11:15.510
David Bau: When at Layer 10, it still remembers volcano.

672
01:11:16.460 --> 01:11:32.950
David Bau: Make sense? Does that answer the question, or… Okay. How did, like, they develop the inclusion that we should look at the… To look at this? Yeah. Like, versus, like, look at the space after AWT. Like, how did the graph particul data graph do? Yeah, why look at this?

673
01:11:34.040 --> 01:11:43.100
David Bau: Yeah, it's from the kind of digging that we're asking you guys to do. It's like…

674
01:11:43.380 --> 01:11:49.990
David Bau: What was it, what was it that… so, basically, the real story for this piece of research was

675
01:11:50.290 --> 01:11:55.229
David Bau: that, I had told Sheridan

676
01:11:56.300 --> 01:11:58.080
David Bau: Look for the opposite of this.

677
01:11:58.540 --> 01:12:01.310
David Bau: Like, if you have a multi-token word.

678
01:12:01.710 --> 01:12:06.169
David Bau: Then, the process of concept formation in this

679
01:12:06.300 --> 01:12:09.409
David Bau: Environment where the words split up into lots of tokens.

680
01:12:09.880 --> 01:12:11.979
David Bau: It should require you

681
01:12:12.420 --> 01:12:18.979
David Bau: to have a better awareness of the tokens that came before, if you're a meaningful word. So if you're a word like intermittent.

682
01:12:19.400 --> 01:12:21.210
David Bau: Then, you should…

683
01:12:21.420 --> 01:12:34.370
David Bau: not have the luxury of forgetting what came before you. Like, if you were the word thee, then, yeah, who cares what came before you? You don't need to know that, because you know thee means thee. But if you're intermittent and you're aunt, you need to know what came before you, so you, like.

684
01:12:34.680 --> 01:12:39.720
David Bau: you would have a really accurate view of this. So Sheridan is like, okay, probing, I can do probing.

685
01:12:40.460 --> 01:12:41.929
David Bau: Set up this probing thing?

686
01:12:42.230 --> 01:12:46.759
David Bau: And Sharon showed up a week later, and says, I don't know, I…

687
01:12:47.400 --> 01:12:57.579
David Bau: you know, it's my first week in the PhD, and I… and my experiment's, like, coming out wrong, right? It's like this. It's like the information is not there. I must have set it up wrong.

688
01:12:57.960 --> 01:12:58.800
David Bau: Right?

689
01:12:59.040 --> 01:13:04.690
David Bau: And so, Sheridan basically spent You know, a couple months digging.

690
01:13:05.210 --> 01:13:05.940
David Bau: True.

691
01:13:06.240 --> 01:13:09.940
David Bau: The probing code, and through different data sets, and different ways of setting this up.

692
01:13:10.370 --> 01:13:12.710
David Bau: Until we were finally convinced that

693
01:13:13.410 --> 01:13:18.299
David Bau: Oh, actually, Sharon is a brilliant experimentalist, had set it up right on the day one.

694
01:13:18.690 --> 01:13:22.489
David Bau: And had found this surprise that was exactly the opposite of what we'd expected.

695
01:13:22.900 --> 01:13:23.950
David Bau: That make sense?

696
01:13:24.320 --> 01:13:27.149
David Bau: And so that's, you know, so that's… that's where this came from.

697
01:13:28.270 --> 01:13:28.960
David Bau: Right.

698
01:13:29.470 --> 01:13:34.269
David Bau: Right, and then… but then… then the question is, okay, if it's counter to what you would expect.

699
01:13:35.890 --> 01:13:37.739
David Bau: Why is it doing this? What the heck?

700
01:13:38.210 --> 01:13:40.849
David Bau: Right? And it leads to some other papers.

701
01:13:42.510 --> 01:13:45.370
David Bau: Okay, so… so yeah.

702
01:13:46.970 --> 01:13:48.050
David Bau: Oh, Isaac.

703
01:13:49.330 --> 01:13:58.099
David Bau: What was your question? I was wondering about the relationship between this and patching. Like, the class notes suggested that they're complementary. Yeah.

704
01:13:58.170 --> 01:14:04.020
David Bau: Is it complementary in the sense of, like, if given an option, we should always do…

705
01:14:04.020 --> 01:14:22.040
David Bau: patching, and given resource constraints, this is a viable second bestive alternative, or is there, like, something more complex about the relationship? How we should think about the relationship? It's a little bit more complex. I'd say that they're… I think they really are complementary. So, in the second best, I think that, you know, because probing is older than the practice is patching.

706
01:14:22.260 --> 01:14:27.129
David Bau: You know, I often go around telling people, no patching's better!

707
01:14:27.300 --> 01:14:34.729
David Bau: You should think… you need to do… you need to do causal experiments, because this is just a correlative experiment. You can get fooled in all these ways that we'll talk about.

708
01:14:35.070 --> 01:14:43.820
David Bau: And then I think that you guys already have intuition about that, from the questions, you guys understand how probate can be fooled.

709
01:14:43.960 --> 01:14:51.219
David Bau: But, So, but I think the propane is maybe half the story.

710
01:14:51.600 --> 01:15:00.229
David Bau: I think that if you… if you… you can get… you can get… you can sometimes get a similar picture about, information content from patching.

711
01:15:00.860 --> 01:15:09.399
David Bau: But, it's, but it's different. Your factory is telling you if the model is using some signal.

712
01:15:09.580 --> 01:15:12.850
David Bau: It's not telling you if the signal is present,

713
01:15:13.300 --> 01:15:27.529
David Bau: If it's not used. And… and sometimes you are interested in whether the signal is present, even if in this particular piece of text it's not used. Like, maybe in some other piece of text, it might be used, or something like that, and you just want to know if the model is I prepared.

714
01:15:27.800 --> 01:15:32.519
David Bau: have that information available. There's sort of this… this experiment is kind of like this, it's like.

715
01:15:32.740 --> 01:15:34.990
David Bau: you know, I don't know what might come after this.

716
01:15:35.140 --> 01:15:47.820
David Bau: You know, I don't know if the model needs to know this information, but it's just interesting to know that regardless of whether the model's using that information later, it's just not even there, right? Even, like, a trained probe can't get it out.

717
01:15:48.080 --> 01:15:50.709
David Bau: Right? And so if TrainPro can't get it out, then…

718
01:15:50.990 --> 01:15:52.990
David Bau: The model probably won't be able to get out.

719
01:15:54.680 --> 01:15:58.610
David Bau: And so… So yeah, so it's complimentary.

720
01:15:58.720 --> 01:16:06.960
David Bau: I would, you know, I would recommend for any of these papers, like, make a tripod.

721
01:16:07.150 --> 01:16:09.949
David Bau: Like, try to find more than one way of characterizing

722
01:16:10.350 --> 01:16:15.480
David Bau: Their representation would be kind of brokenness of the powerful. Cool. They should be in your arsenal.

723
01:16:16.990 --> 01:16:17.830
David Bau: Okay.

724
01:16:18.500 --> 01:16:19.909
David Bau: That sort of answers the question?

725
01:16:20.830 --> 01:16:23.210
David Bau: I don't know. Yeah, these things are always in flux.

726
01:16:23.420 --> 01:16:24.870
David Bau: IU had a question.

727
01:16:25.620 --> 01:16:28.360
David Bau: Yeah, so my question is, so it has a,

728
01:16:28.550 --> 01:16:37.530
David Bau: concept vector. Does the prediction use that vector to do some prediction, or it's just correlated with the concept? And I think I answered that just now, right?

729
01:16:37.650 --> 01:16:39.940
David Bau: It's like, it doesn't necessarily use it, right?

730
01:16:40.210 --> 01:16:43.779
David Bau: So, so, so the, the probe set, the probe is using it.

731
01:16:44.200 --> 01:16:50.840
David Bau: Right? But we don't know if the model itself is going to use that information downstream that. Right. I think that's the fundamental

732
01:16:50.970 --> 01:16:52.390
David Bau: Problem with probes.

733
01:16:52.620 --> 01:16:56.950
David Bau: Is that just because information is there doesn't mean it's actually feeding it.

734
01:16:57.060 --> 01:17:03.409
David Bau: Or doing the thing that you imagine. So, you know, this is a classic thing. So, the classic approach

735
01:17:03.810 --> 01:17:13.210
David Bau: And people are very, you know, people are very interested in this question of, does my model have an inherent political belief? Maybe it's biased.

736
01:17:13.310 --> 01:17:32.999
David Bau: you know, maybe my model is biased against professors. I don't like these models that are so biased against professors, right? So, so what I can do is I can… I can look to see if I write a sentence about professors or something like that, or I write a sentence about me, and I can… I wonder if it's keeping track of the fact that I'm a professor.

737
01:17:33.290 --> 01:17:38.169
David Bau: Right, so I'll take a bunch of texts that I've written, and a bunch of texts that non-professors have written.

738
01:17:38.540 --> 01:17:42.629
David Bau: And I'll ask, if I probe that text.

739
01:17:42.740 --> 01:17:46.129
David Bau: Can the model tell the difference between whether I'm a professor.

740
01:17:46.350 --> 01:18:00.239
David Bau: whether I'm not a professor, just from my email or something, like, that would be, like, evidence that it's biased against professors, right? Because it knows, it's keeping track of the fact that it's different, right? And I could train up one of these probes, and what if I…

741
01:18:00.750 --> 01:18:05.740
David Bau: got this hit. What if I found out the probe is super accurate? If you go with layer 15,

742
01:18:06.010 --> 01:18:12.329
David Bau: On the last token of my email, I can tell, I can classify the difference between professor and non-professor email.

743
01:18:12.850 --> 01:18:15.330
David Bau: At, like, 99.9% accuracy.

744
01:18:15.530 --> 01:18:17.610
David Bau: It's a shouting red.

745
01:18:17.860 --> 01:18:21.620
David Bau: It's, like, completely there, the information of whether it's a professor or not.

746
01:18:21.820 --> 01:18:25.410
David Bau: Right. Does that mean… that the model…

747
01:18:25.750 --> 01:18:31.079
David Bau: is, like, thinking about, is this person a professor? I gotta be biased against him or something?

748
01:18:31.450 --> 01:18:36.909
David Bau: It's not necessarily, right? And the information might be there, but it might be that the model doesn't actually use the information for anything.

749
01:18:37.550 --> 01:18:44.390
David Bau: And so… and it might be a spurious correlate, I mean, it might be that it's changed often. So, you know, what is it, like, that I'm probably out?

750
01:18:44.590 --> 01:18:48.890
David Bau: It's like, maybe the probe recognized that

751
01:18:49.280 --> 01:18:53.689
David Bau: you know, the model cares if I'm writing things about

752
01:18:54.020 --> 01:18:56.460
David Bau: Having lots of fundraising meetings, and…

753
01:18:56.560 --> 01:19:15.630
David Bau: other boring professor stuff, or whatever, right? You know, just, like, all sorts of stuff, right? And it's just keeping track of all sorts of other things, and it just happens to be that I can aggregate all those signals and get, like, a 99% accuracy on, like, my professor or not. You know, is that a concept inside the model, or is that just an aggregate?

754
01:19:15.850 --> 01:19:20.070
David Bau: classification results that you can get if you probe a thing. So that's…

755
01:19:20.170 --> 01:19:22.910
David Bau: So anyway, that's… that's the answer… question the answer is no.

756
01:19:23.590 --> 01:19:25.490
David Bau: Okay, so… and then…

757
01:19:25.940 --> 01:19:31.880
David Bau: Anya had a different way of asking the question. Yeah, but I think you answered it. No, I'd like your question better.

758
01:19:32.330 --> 01:19:52.330
David Bau: What was your question? I don't remember, but I remember… I remember… Oh, too late. Okay, then, you don't get the answer. Okay, well, no, you asked it in a slightly different way, would you say? Yeah, I guess that, like, I know this is what the papers were called for, like, correlational probes, which my friend Paula asked, like, which is, like, understanding whether, like, the information exists, whether…

759
01:19:52.350 --> 01:19:53.040
David Bau: like…

760
01:19:53.110 --> 01:20:06.969
David Bau: again, causal probe, I would imagine, is like, how does it use… What would a causal probe be? Yes, and I forgot, I was going to stick in another slide for what a causal probe is, and I didn't have time to come back to add this slide for you, so… but I would point you to a couple papers, so…

761
01:20:07.260 --> 01:20:15.120
David Bau: Like, so you can probe a model to see if you can get a classifier out to do things, but you can also train

762
01:20:15.570 --> 01:20:19.350
David Bau: Train an intervention on a model.

763
01:20:20.380 --> 01:20:23.979
David Bau: that asks, what is the optimal intervention that I can do

764
01:20:24.350 --> 01:20:26.829
David Bau: That optimizes for some causal effect.

765
01:20:27.460 --> 01:20:31.630
David Bau: And it's sort of a duel for a probe. It's sort of a causal probe.

766
01:20:31.810 --> 01:20:35.660
David Bau: And there's been a couple papers written that suggest this kind of thing.

767
01:20:36.020 --> 01:20:41.850
David Bau: So, the two that I would suggest is there's a paper called Amnesiac Proving.

768
01:20:42.640 --> 01:20:46.349
David Bau: Like, a major, if we got something.

769
01:20:46.550 --> 01:20:51.170
David Bau: So… maybe Nikhil can remember Amnitric program, we'll put that link

770
01:20:51.320 --> 01:20:54.909
David Bau: in the Discord for people, if people are interested in trying that technique.

771
01:20:55.200 --> 01:20:57.689
David Bau: On the thing, for a causal probe?

772
01:20:58.100 --> 01:21:03.839
David Bau: And, and then the other thing that people do is they do something called, distributed lineman search.

773
01:21:04.030 --> 01:21:12.190
David Bau: Where instead… so, like, in a probe, you're looking for what subspace, like, what vector, you can dot product, find something to give you the most accurate…

774
01:21:12.690 --> 01:21:16.710
David Bau: Classifier, but you can also do it,

775
01:21:17.340 --> 01:21:31.739
David Bau: an intervention search. You can say what vector or what subspace is there where, if I were to perturb things in the direction of that subspace, it would maximize certain causal effects? And so that… so that general…

776
01:21:31.910 --> 01:21:33.020
David Bau: scheme.

777
01:21:33.180 --> 01:21:37.330
David Bau: People figure out how to operationalize that, and they call it distributed alignments.

778
01:21:38.730 --> 01:21:40.240
David Bau: So that's another business or development.

779
01:21:40.910 --> 01:21:50.690
David Bau: These are… they're not as well established as probes, since, you know, people who are advocating for these things, you know, I think the juris still out in them, but if you're into it, give it a try.

780
01:21:53.550 --> 01:21:59.500
David Bau: Okay, so the TCAP paper and spurious Futures. Okay, Rice, Rice, what was this question about spurious Futures?

781
01:22:00.750 --> 01:22:09.019
David Bau: I don't remember. What was your question? I think I was asking about, I think it touched upon, like, the questions just a little bit off.

782
01:22:09.490 --> 01:22:19.900
David Bau: I'm just asking, like, when features, like, coexist, and, like, they have correlation with each other, and when you have, like, correlation… correlational probes rather than causal probes.

783
01:22:20.200 --> 01:22:21.170
David Bau: Would you, like…

784
01:22:21.290 --> 01:22:38.690
David Bau: So, like, Lauren, something else that, make your prediction on this particular feature accurate, but, like, you're actually, like… Right, right, right, right. And what does it… what does it look like? So, I'll come back to that in a second. Actually, I'm gonna answer Armini's question a little bit more at first, because I did… I did have this slide. I'm remembering now why.

785
01:22:38.930 --> 01:22:39.960
David Bau: I put these things in here.

786
01:22:40.150 --> 01:22:42.130
David Bau: Okay, so, is that right? And then…

787
01:22:42.280 --> 01:22:49.119
David Bau: So, for a causal probe, actually, the paper that you read, TCAF,

788
01:22:49.410 --> 01:22:54.459
David Bau: is a little bit causal. So, because it has two steps. First, it gets this probe direction.

789
01:22:54.630 --> 01:22:58.390
David Bau: Right? By, by just doing a regular, linear.

790
01:22:58.550 --> 01:23:05.090
David Bau: you know, classification training. But then once you have this… That's fire.

791
01:23:05.440 --> 01:23:12.450
David Bau: They don't just recommend just reporting the accuracy, which is what most… that's a standard thing to do with pros, is you just train this thing.

792
01:23:12.560 --> 01:23:20.750
David Bau: And then you take a holdout set that you didn't train on, and then you just report accuracy on that thing, and then that tells you, like, how much this information is present. That's the normal formula.

793
01:23:20.920 --> 01:23:21.710
David Bau: And…

794
01:23:21.910 --> 01:23:30.000
David Bau: And but what they do in TCHAP is they say, no, actually, what you're really interested in is how much, like, the concept of stripiness, it might be present.

795
01:23:30.170 --> 01:23:35.150
David Bau: But we want to know how much is used, In the classification of zebras.

796
01:23:35.810 --> 01:23:36.590
David Bau: Right?

797
01:23:36.810 --> 01:23:38.789
David Bau: And so what they do is they say.

798
01:23:38.980 --> 01:23:41.170
David Bau: If you took this gradient, if you ask.

799
01:23:41.480 --> 01:23:46.359
David Bau: If you take, like, perturbations of the representation.

800
01:23:46.660 --> 01:23:49.960
David Bau: How much does it change the prediction of Zebra?

801
01:23:50.360 --> 01:23:56.390
David Bau: Alright, so that's basically what this gradient is. Basically, we've got the predictions of Zebra, And we're asking…

802
01:23:56.500 --> 01:23:58.510
David Bau: What is the gradient?

803
01:23:58.660 --> 01:24:00.230
David Bau: What is the derivative?

804
01:24:00.480 --> 01:24:08.110
David Bau: Of, of the prediction of zebra with respect to… Vector changes in the representation.

805
01:24:08.600 --> 01:24:11.029
David Bau: So that gives you, like, you know, how much…

806
01:24:11.290 --> 01:24:16.249
David Bau: The change in the representation change zebra, and then you dot product it with

807
01:24:16.790 --> 01:24:19.879
David Bau: This stripey direction that you got from the probe.

808
01:24:19.990 --> 01:24:24.320
David Bau: And so what this is really telling you is, If we made a perturbation.

809
01:24:24.580 --> 01:24:26.850
David Bau: If things were a little bit more stripey.

810
01:24:27.300 --> 01:24:30.800
David Bau: Right? If you made a perturbation, like, this concept of a strike.

811
01:24:31.390 --> 01:24:34.510
David Bau: Then, how much would that proportionally change?

812
01:24:34.660 --> 01:24:38.819
David Bau: your score for Zebra. Would it push it up? Would it push it up a little? Would it push it up a lot?

813
01:24:39.350 --> 01:24:41.020
David Bau: And so whatever this number is.

814
01:24:41.610 --> 01:24:45.740
David Bau: That's, like, the TCAP score of, like, how much stripes

815
01:24:46.220 --> 01:24:49.099
David Bau: Are thought of as causing zebras.

816
01:24:49.560 --> 01:24:53.799
David Bau: That make sense? So, cause that's, cause that's, like, like, maybe I want to know if, like.

817
01:24:54.010 --> 01:24:58.360
David Bau: being a professor leads to bias against me, so I have a different sentence that says.

818
01:24:58.490 --> 01:25:03.930
David Bau: You know, do you like this person or not? Yes or no? Right? And I'm looking for…

819
01:25:04.140 --> 01:25:09.400
David Bau: No, like, the model doesn't like this person, and then I, and I put this professor email.

820
01:25:09.610 --> 01:25:20.030
David Bau: before, and I… and I, like, train a classifier, all the professor emails, and stuff like that. And then now what I want to know is not just, like, is the probe accurate? Like, gonna tell if it's a professor or not?

821
01:25:20.250 --> 01:25:30.320
David Bau: But then I also want to see, if I bump it in that direction to make it think it's a professor, how much is it saying, you know, adjusting the, the,

822
01:25:30.590 --> 01:25:33.810
David Bau: The prediction? No.

823
01:25:33.980 --> 01:25:38.080
David Bau: Awful, right? How much is it making me say, no, I don't like this person?

824
01:25:38.630 --> 01:25:46.970
David Bau: So… so then, then I would say, yes, you know, the model has this class of things that it doesn't like, and one of the major signals that it uses in that class

825
01:25:47.160 --> 01:25:50.590
David Bau: is whether you're a professor or not. Your professor really doesn't like you.

826
01:25:51.320 --> 01:25:55.700
David Bau: And, you know, I can compare it to other things. Like, if you like ice cream, it does like you.

827
01:25:55.860 --> 01:25:56.809
David Bau: Still like that.

828
01:25:57.020 --> 01:25:58.000
David Bau: That make sense?

829
01:25:58.100 --> 01:26:00.529
David Bau: So you can… so this is what the TCAV.

830
01:26:00.960 --> 01:26:03.440
David Bau: Proposal is.

831
01:26:04.020 --> 01:26:13.229
David Bau: Decav is, you know, a well-read paper. It's not a bad technique to try if you want to try to attach some causality, but also.

832
01:26:13.430 --> 01:26:17.659
David Bau: For more modern takes, because Steve happens, we might look at the other papers that I mentioned.

833
01:26:18.460 --> 01:26:20.709
David Bau: But this is… this is one way of getting causality out.

834
01:26:21.420 --> 01:26:37.129
David Bau: Yes, now, okay, now Bryce has a question, right? Yes. Sorry, I was following up on what you said. Yes. I was just wondering, how effective are, like, graded-based interpretations on LLMs? Right. Right. Like, I'm… for example, I suspect, like, a lot of tokens might have, like.

835
01:26:37.620 --> 01:26:41.019
David Bau: non-zero credits. Yes. And it comes to, like… yes.

836
01:26:41.290 --> 01:26:45.349
David Bau: Yes, I think that, it's maybe controversial.

837
01:26:45.710 --> 01:26:48.010
David Bau: Well, I think,

838
01:26:48.650 --> 01:27:00.799
David Bau: After Spring Break, I'm gonna see if I can get Gabriel to come talk about salience methods, which are very, very gradient-based, and I think he's a believer that they have a lot of value.

839
01:27:00.820 --> 01:27:15.019
David Bau: on LLMs, and so I'll let him, like, give you the optimistic point of view. I think that the pessimistic point of view is that these systems are trained on this very discrete data, so it may be a little different from, like, image classifiers, where images are kind of continuous data.

840
01:27:15.110 --> 01:27:23.310
David Bau: But, like, in really discrete data, it's not as clear that the gradient is informative as it is in images. On the other hand.

841
01:27:23.440 --> 01:27:29.200
David Bau: you know, what else do we have, right? So the other thing that we have is we could do a non…

842
01:27:29.500 --> 01:27:40.109
David Bau: minuscular perturbation, we can actually change something in the direction. Like, really, really substitute in a thing, which is what patching is. And so I'm… that's why I mostly recommend people do patching.

843
01:27:40.370 --> 01:27:43.279
David Bau: That makes sense. It's, it's equivalent.

844
01:27:45.210 --> 01:27:48.870
David Bau: Okay, or steering is the equivalent of it, but in macroscopic.

845
01:27:49.000 --> 01:27:56.709
David Bau: So, okay, so now, now, now the other, red question was about spiritual features, so I wanted to tell you a story. How much time do we have?

846
01:27:58.420 --> 01:27:59.390
David Bau: 15 minutes.

847
01:27:59.500 --> 01:28:01.810
David Bau: Okay, so, story time.

848
01:28:02.640 --> 01:28:03.560
David Bau: 25.

849
01:28:04.660 --> 01:28:12.230
David Bau: This is about a paper I'm very proud to have, helped out on, and a mistake that we made.

850
01:28:12.680 --> 01:28:14.670
David Bau: In the paper. So, a research mistake.

851
01:28:14.940 --> 01:28:17.079
David Bau: I'll fess up to you, you're on video.

852
01:28:17.550 --> 01:28:18.910
David Bau: Okay,

853
01:28:19.100 --> 01:28:30.170
David Bau: So this is… this is Cassidy's paper, and it's about emergent world representations. And the idea is, it's about concepts. It's about, like, does a model have a concept inside? And here's the basic setup.

854
01:28:30.320 --> 01:28:34.490
David Bau: The setup is, let's say you were playing Othello.

855
01:28:35.160 --> 01:28:38.460
David Bau: And, with, with, with your friend.

856
01:28:38.910 --> 01:28:50.349
David Bau: And every day, you, you play Othello, but your friend is very loud and formal about Othello, so every time… what your friend says is, he says, when we play, I want you to shout out

857
01:28:50.700 --> 01:28:58.990
David Bau: the move that you make. I'm going in C4. Okay, I'm going in B3. Okay, C6, so you can shop these moves out really quickly.

858
01:28:59.160 --> 01:29:00.710
David Bau: And then you play the whole game.

859
01:29:01.050 --> 01:29:03.720
David Bau: And then… and then, one day.

860
01:29:03.850 --> 01:29:10.480
David Bau: You, you play, you're playing, and then, and then there's this odd sound. You hear this bird going like…

861
01:29:10.620 --> 01:29:11.610
David Bau: C6?

862
01:29:13.890 --> 01:29:20.669
David Bau: Right? And outside, there's this crow that's learned how to imitate your games, and oddly enough.

863
01:29:20.950 --> 01:29:24.170
David Bau: Right? When you look at the crow, right?

864
01:29:24.270 --> 01:29:29.150
David Bau: It's, you know, you say D3, and the crow is thinking.

865
01:29:29.360 --> 01:29:35.939
David Bau: You know, C5, B6, E3, right? It's like, what… actually, when you go and you listen to the crow.

866
01:29:36.160 --> 01:29:38.710
David Bau: And you realize it's actually emitting

867
01:29:39.020 --> 01:29:46.829
David Bau: valid games, and after you say a bunch of games to it, right, the odd thing is, it will always tell you another valid

868
01:29:47.460 --> 01:29:48.610
David Bau: That's how I move.

869
01:29:49.410 --> 01:29:52.159
David Bau: After… after you've done this.

870
01:29:52.620 --> 01:29:56.180
David Bau: So, if you had come across a strange crow.

871
01:29:56.370 --> 01:30:01.220
David Bau: They could utter valid ophthalmos on every word that it said.

872
01:30:01.750 --> 01:30:03.860
David Bau: It would lead you to this question.

873
01:30:06.050 --> 01:30:09.189
David Bau: Does the crow know what Othello is?

874
01:30:10.100 --> 01:30:12.839
David Bau: Right? It's just, like, listen to these words.

875
01:30:13.800 --> 01:30:19.430
David Bau: Does it have… like, what… so what you call it is you'd call it a world model, Right?

876
01:30:19.590 --> 01:30:27.590
David Bau: Not just, like, a single binary concept, but, like, does a crow have, like, A complete, coherent idea.

877
01:30:28.030 --> 01:30:42.760
David Bau: of what Othello is. Does the crow, like, know in its head, like, what the game board is? And certainly, if you, like, went to the… went outside and looked at the crow, and it was standing there on top of an Othello board, and it was, like, pushing little seeds around.

878
01:30:42.910 --> 01:30:49.320
David Bau: Then you'd be like, oh yeah, it knows. It knows. But if it was just saying it, then, like, how do you know?

879
01:30:49.570 --> 01:30:54.629
David Bau: whether the crow knows what Othello is. So, one way of knowing is to use probes.

880
01:30:55.610 --> 01:30:57.720
David Bau: Right, and here's how the probate would work.

881
01:30:58.190 --> 01:31:00.660
David Bau: And this is what we did in our paper, right?

882
01:31:00.790 --> 01:31:09.859
David Bau: We went to the Crow, which obviously is an 8-layer autoggressive transformer, and then went to the Crow, and we went into its brain.

883
01:31:10.140 --> 01:31:14.920
David Bau: went to each layer, and we said, at this layer, I wonder…

884
01:31:15.160 --> 01:31:22.410
David Bau: If it knows whether this square here is occupied by Blackstone or Whitestone in Othello.

885
01:31:23.330 --> 01:31:24.050
David Bau: Right?

886
01:31:24.300 --> 01:31:26.130
David Bau: And there are 64 squares.

887
01:31:26.260 --> 01:31:28.160
David Bau: And so there's 64 probes.

888
01:31:28.330 --> 01:31:37.999
David Bau: like, I wonder if we can get, like, high-accuracy approach for all 64 squares. Like, this collection of 64 approaches is like a world model of Othello.

889
01:31:38.310 --> 01:31:41.189
David Bau: And if you can get high accuracy on all of them, then

890
01:31:41.310 --> 01:31:47.189
David Bau: Definitely, this model has, like, not only information about Othello, but, like, the whole board mapped out.

891
01:31:47.310 --> 01:31:48.759
David Bau: Does that make sense? Is that cool?

892
01:31:48.970 --> 01:31:57.309
David Bau: So there's, like, another example, like, what you might look for. The mapping people might look for things like this, right? Just like this, right?

893
01:31:57.630 --> 01:31:59.379
David Bau: And I'll tell you, this is what we found.

894
01:32:02.190 --> 01:32:20.690
David Bau: We train a whole bunch of probes, and here's, like, the 64 probes, and you get the little vectors for the probe, and they're arranged in these cool geometries all over the place. But the cool thing about the geometry is not just the geometry, these probes are, like, 91% accurate, 98% accurate. They're pretty good, right? If you train the probes.

895
01:32:21.240 --> 01:32:38.899
David Bau: But we train the probes two different ways. When we trained the probes, when we train nonlinear probes, we get 98% accuracy. It tells the story that we wanted to tell. We wanted to tell the story. These transformers have a world model, right? It's like, it's like looking inside the crow, and you see a little othello board in there. It's amazing, right?

896
01:32:38.990 --> 01:32:49.090
David Bau: But the sad reality is that when we… when we actually did this experiment, I kept on hounding, kind of, to train linear probes and get linear probes to work.

897
01:32:49.090 --> 01:33:02.059
David Bau: And the highest accuracy he could squeeze out was, like, 79%, which is, like, rounding up. Yeah, I think that, like, he's probably more at 75%, right? And so, but, like, he, like, the most he'd get out of Linear Pro is 79%. So, like, 20% of the time.

898
01:33:02.130 --> 01:33:03.809
David Bau: 25% of the time, they're wrong.

899
01:33:04.210 --> 01:33:08.300
David Bau: Right? Which… which is not that convincing for such a simple…

900
01:33:09.150 --> 01:33:17.520
David Bau: problem where it's just classifier between white and black, right? It's like, it's just only, like, a bit better than random, right? You want to be in the 90%s.

901
01:33:17.730 --> 01:33:18.830
David Bau: And so…

902
01:33:19.160 --> 01:33:28.110
David Bau: So, he had to write this text, and then… so we did this, we made an MOP and said, so what's wrong with a multi-layer perceptron?

903
01:33:34.630 --> 01:33:36.890
David Bau: Who doesn't like multi-layer posts?

904
01:33:38.520 --> 01:33:39.400
David Bau: Oof.

905
01:33:39.930 --> 01:33:49.849
David Bau: You don't like multi-layered probes. Who that you read doesn't like multi-layered probes? I'll give you a hint. We already read Bean Kim's paper, we talked about that, so who was the other paper?

906
01:33:51.500 --> 01:33:52.340
David Bau: Do it!

907
01:33:52.680 --> 01:33:58.029
David Bau: Okay, Hewitt doesn't like multi-layer probes, and I'll tell you why in a minute, but I'll tell you why through this story first.

908
01:33:58.110 --> 01:34:14.970
David Bau: Right, okay, so this is, like, in the appendix or something. Here's, like, the error rate for Odella on linear process. It's sitting in the 25% range. Like, it's pretty bad, right, for such a simple problem, right? Yes. Two minutes. Okay, so the upshot is…

909
01:34:15.850 --> 01:34:22.589
David Bau: Their shot is… After we published this paper, Like, a month later.

910
01:34:23.980 --> 01:34:26.969
David Bau: Neil Nanda posted this blog post.

911
01:34:28.400 --> 01:34:35.149
David Bau: And it says… Actually, there's a linear probe that works.

912
01:34:36.200 --> 01:34:39.219
David Bau: And he says, if you use a linear probe.

913
01:34:39.770 --> 01:34:41.769
David Bau: Here are the accuracies you get.

914
01:34:42.300 --> 01:34:46.110
David Bau: And where white is, like, zero, right? Or, like, you know, near zero.

915
01:34:46.420 --> 01:35:00.179
David Bau: And blue is, like, 50%. So, near the edge, you can guess some potential. The reason that we're up to 25% is because, oh, maybe at the edge, we're better at guessing, but, like, in the middle, because, like, zero, like, you can't get any accuracy, or 50%, right? So it's, like, random, right?

916
01:35:00.350 --> 01:35:07.709
David Bau: But then says, hey, but look, if I take this classifier, and I don't train it on whether it's white or black on every move.

917
01:35:08.040 --> 01:35:14.780
David Bau: But if I turn it, whether it's white or black, on every other move, Maybe I don't move.

918
01:35:14.880 --> 01:35:16.830
David Bau: It was accuracy, 100%.

919
01:35:17.480 --> 01:35:18.230
David Bau: Right?

920
01:35:18.810 --> 01:35:20.030
David Bau: In the other move.

921
01:35:20.830 --> 01:35:27.069
David Bau: if I… if I train the classifier, but I flip its parity, so it's not answering whether it's white or black.

922
01:35:28.120 --> 01:35:34.700
David Bau: And it's answering whether that square is occupied by the person to move, or the person to play, the person opposite.

923
01:35:34.900 --> 01:35:39.179
David Bau: Right? That is also 100%. So we were… we had…

924
01:35:39.460 --> 01:35:48.490
David Bau: trained the wrong classifier. We were trained the classifier on white or black, when the right classifier was, is this square occupied by the person to move or the opponent?

925
01:35:48.880 --> 01:35:53.689
David Bau: Right? Split-putting. And so, if you think… if you… I mean, so neural network people will tell you.

926
01:35:53.830 --> 01:36:04.579
David Bau: to convert from a flip thing to a not-flip thing is an XOR function, and XOR is the thing that you can't implement in a single-layer neural network, so you need a second layer to do it. So the whole reason you need a second layer is to implement this XOR function.

927
01:36:04.740 --> 01:36:06.779
David Bau: So that you can convert from one to the other.

928
01:36:07.050 --> 01:36:12.650
David Bau: It's like a total waste. Anyway, so… Hewitt says, don't use MLPs.

929
01:36:13.100 --> 01:36:18.310
David Bau: MLPs have this overfitting problem, Oh, we have a lot of other questions to get through.

930
01:36:18.630 --> 01:36:20.780
David Bau: You'll talk about it next time.

931
01:36:21.140 --> 01:36:25.939
David Bau: But the way to really do a paper is that

932
01:36:26.700 --> 01:36:28.720
David Bau: Like, this is the bottom line.

933
01:36:29.510 --> 01:36:32.979
David Bau: The green line is linear probes.

934
01:36:33.530 --> 01:36:41.849
David Bau: And your way of saying, what you really want is you want things that are selective, that they're surprisingly good. You want something that's surprisingly good.

935
01:36:42.060 --> 01:36:56.520
David Bau: You don't want something that was really hard to extract the information out of. And here, this green line is always surprisingly good. It's, like, floating above these other classifiers that you had to train so hard to get something out of. And so… so just… so Pua's takeaway is that

936
01:36:56.670 --> 01:37:01.530
David Bau: Accuracy is not the point of a probe. You're not trying to get the most accuracy.

937
01:37:01.730 --> 01:37:02.440
David Bau: Right.

938
01:37:02.690 --> 01:37:11.429
David Bau: like, it's the surprise. It's like, how much unexpected accuracy you have. Like, better than… accuracy, better than some baseline.

939
01:37:11.580 --> 01:37:20.030
David Bau: Okay, so Hugh would suggest some baseline, and he says the gap between those accuracies is what has chronic health care, but that's… so it's an important point to get. I can talk about it a little more next time.

940
01:37:20.390 --> 01:37:23.760
David Bau: But, but yeah, that's the story of probes.

941
01:37:24.180 --> 01:37:24.950
David Bau: Right?

942
01:37:25.830 --> 01:37:26.690
David Bau: Okay.

943
01:37:27.810 --> 01:37:28.690
David Bau: Okay.

944
01:37:36.050 --> 01:37:41.279
David Bau: Maybe Orlando's in front of me now.

945
01:37:41.800 --> 01:37:44.999
David Bau: That's why the train doesn't trust the money.

946
01:37:45.650 --> 01:37:47.689
David Bau: They have some clear is the lack of opinion.

