WEBVTT

1
00:00:08.260 --> 00:00:10.149
David Bau: Go ahead, keep on going.

2
00:00:13.170 --> 00:00:23.380
David Bau: In the topic paper, they, found, emotion vectors, and then they showed some causal, experiments with those emotion vectors.

3
00:00:23.650 --> 00:00:33.449
David Bau: And, it made me to look in the literature, what already was, found with respect to…

4
00:00:33.630 --> 00:00:41.640
David Bau: emotions, and… there was an interesting paper from, a group of, from Buco.

5
00:00:42.320 --> 00:00:49.249
David Bau: that I, I found is very important, they show that there is a gap between

6
00:00:49.370 --> 00:00:53.070
David Bau: How… how models report

7
00:00:53.180 --> 00:01:03.979
David Bau: In psychology, there is a method to measure emotions within humans by asking them directly how they feel about something.

8
00:01:04.230 --> 00:01:07.760
David Bau: And… And it is known…

9
00:01:07.870 --> 00:01:18.170
David Bau: It is known that this, type of measuring correlates with the real emotions and how people are actually behave.

10
00:01:18.450 --> 00:01:22.439
David Bau: While in models, there is a gap.

11
00:01:22.730 --> 00:01:28.770
David Bau: We can ask models how do they feel. They will report something.

12
00:01:28.770 --> 00:01:45.220
David Bau: But actually, when we test them within a situation, we describe a situation and measure how they will behave, we see that sometimes they behave different from what they report as their feelings.

13
00:01:45.260 --> 00:02:02.520
David Bau: So I found it interesting, because it… in some sense, models like the Antrophic paper shows that, in some sense, models behave like humans. Well, we don't know if they have subjective experience, but they mimic their…

14
00:02:03.070 --> 00:02:05.750
David Bau: The automat of the feelings.

15
00:02:05.970 --> 00:02:18.749
David Bau: But in other sense, like in this paper from Google, we see that there is… there are differences, so it is interesting to find also what is similar and also what's different.

16
00:02:20.120 --> 00:02:22.470
David Bau: Cool. Yeah, very interesting time.

17
00:02:25.150 --> 00:02:26.510
David Bau: I have a question.

18
00:02:27.070 --> 00:02:31.160
David Bau: for the lab generally, as an outsider, I don't want to get too off track.

19
00:02:31.330 --> 00:02:43.289
David Bau: Do you guys generally see… like, when you're talking about what is actually in… in the model, in terms of, like, how the model feels about a certain

20
00:02:43.810 --> 00:02:51.789
David Bau: context. Are you talking about the model as just the autoregressive part that produces one token at a time, or are you seeing the whole system as, like.

21
00:02:52.120 --> 00:02:56.560
David Bau: Like, are you seeing the scratch pad as something that alters the model fundamentally?

22
00:02:57.170 --> 00:03:02.880
David Bau: I didn't understand the question. Can you try to give an example or something? Like,

23
00:03:03.810 --> 00:03:11.889
David Bau: maybe example within humans, what would you mean if… Well, it's hard to parallelize with humans, because we don't have a scratch pad.

24
00:03:12.050 --> 00:03:19.940
David Bau: But, like, when a model… if a model's… Activations for one token.

25
00:03:20.300 --> 00:03:25.400
David Bau: Give it the… General tendency towards…

26
00:03:28.030 --> 00:03:33.470
David Bau: Or let's say, like, you know how chain of thought allows for a model to solve some problems that it couldn't solve otherwise?

27
00:03:34.340 --> 00:03:35.760
David Bau: Yes, okay.

28
00:03:36.330 --> 00:03:40.060
David Bau: Does that mean… In that case, does the model

29
00:03:40.780 --> 00:03:43.370
David Bau: Have the capabilities to solve that problem?

30
00:03:44.100 --> 00:03:47.030
David Bau: Oh, it reminds me, psychotherapy.

31
00:03:47.210 --> 00:03:56.389
David Bau: That people sometimes need to talk within a room in order to be aware of what they feel.

32
00:03:56.500 --> 00:04:02.800
David Bau: Because otherwise, it's like they have it, but… so the same question is, do they have it?

33
00:04:03.770 --> 00:04:04.560
David Bau: Okay.

34
00:04:04.740 --> 00:04:12.540
David Bau: I feel like there's a similarity, too, with, like, memory palaces. Like, if you memorize a long list of things using visual cue, like, did you memorize the thing?

35
00:04:13.150 --> 00:04:17.350
David Bau: I do. Okay. Does this lab have a general philosophy on, like.

36
00:04:18.220 --> 00:04:27.229
David Bau: Are you isolating, like, what is fa… what the model truly, truly thinks to just its activations, or also including its behavior in chain of thought contexts?

37
00:04:30.620 --> 00:04:37.120
David Bau: I can also ask… And if it's in its activation, but…

38
00:04:37.390 --> 00:04:40.049
David Bau: it is not using it. Yeah.

39
00:04:40.580 --> 00:04:46.119
David Bau: So, within humans, we also have the notion of subconscious.

40
00:04:46.680 --> 00:04:54.350
David Bau: So, are we aware of it? If it's in our subconscious, if… and if it's, Activate us.

41
00:04:55.020 --> 00:04:58.420
David Bau: But we are not aware of it, so what does it mean?

42
00:04:59.950 --> 00:05:08.509
David Bau: I think that your question might be answered if we, you know, we'll let people go around and show what they're working on, and then you'll see there's a bunch of people

43
00:05:08.650 --> 00:05:10.760
David Bau: Looking at, you know, chain of thought.

44
00:05:10.880 --> 00:05:17.500
David Bau: you know, rollouts and different things like that. That make sense? The impression I'm getting is it's generally diverse.

45
00:05:17.820 --> 00:05:33.689
David Bau: people's opinions on this? Yeah, I think that we take… I think that if you ask… if you sit down and let me talk to you for an hour, I probably, you know, tell you, like, a really broad view of what interpretability is beyond, like, activations, beyond…

46
00:05:34.340 --> 00:05:42.799
David Bau: beyond chain of thought to, you know, to all sorts of other things as well. I think that, you know, my general theme is…

47
00:05:43.050 --> 00:05:46.060
David Bau: That our charter is to help people deal with

48
00:05:46.250 --> 00:05:49.060
David Bau: The complexity of very complex systems.

49
00:05:49.410 --> 00:05:53.850
David Bau: And and there's a lot of complexity to be cracked.

50
00:05:54.830 --> 00:05:56.800
David Bau: In different ways, yeah.

51
00:05:57.670 --> 00:06:03.139
David Bau: so, oops, wrong, wrong arrow. Like, Andy, Andy, thanks.

52
00:06:04.380 --> 00:06:05.220
David Bau: Okay.

53
00:06:07.290 --> 00:06:09.360
David Bau: Yeah, I don't know, I have a really…

54
00:06:09.570 --> 00:06:18.570
David Bau: Well, I have a, like, comic in the top right about what it's like to feel like an interpretability researcher in 2026.

55
00:06:19.450 --> 00:06:24.599
David Bau: What, what was, what was the original, Caption for this comic.

56
00:06:25.020 --> 00:06:32.399
David Bau: Oh, I made the whole thing. Oh, you made the whole thing? Oh, it's your… it's your own… it's your own New Yorker. Right. I think. Yeah, okay.

57
00:06:32.790 --> 00:06:33.830
David Bau: Cool.

58
00:06:34.800 --> 00:06:39.520
David Bau: Okay, then the rest is, like, this, like, stupid…

59
00:06:39.680 --> 00:06:57.830
David Bau: experiment where, like, literally, I just want a model to learn the identity function, because that's, like, the point of residual connections, right? It, like, allows you to, like, learn a simple solution to identity, right? Okay, so then, like, take a two-layer residual network with linear… linear blocks.

60
00:06:58.260 --> 00:07:06.339
David Bau: And just train it to, like, model the identity. And the middle plot is, like, the norms of the weights.

61
00:07:06.640 --> 00:07:07.709
David Bau: And, like…

62
00:07:08.190 --> 00:07:13.740
David Bau: the different lines are different, like, actually weight decays. So the darkest line is, like, no weight decay.

63
00:07:14.160 --> 00:07:26.889
David Bau: The lightest line is strongest weight decay. So, only for, like, the strongest amount of weight decay do the… does the model figure out, oh, I should put my weights to zero to just learn the identity?

64
00:07:27.150 --> 00:07:30.879
David Bau: And for the others, it, like, plateaus at some, like.

65
00:07:31.110 --> 00:07:40.149
David Bau: solution where actually the matrices are, like, non-zero, but it's still, modeling the identity. Approximately, though.

66
00:07:40.290 --> 00:07:47.209
David Bau: Yeah, approximately. Like, the loss isn't exactly at zero, but it's, like, it's… it's, it's plateaued.

67
00:07:48.190 --> 00:07:53.270
David Bau: So I thought that was, like, kind of weird and interesting. There's, like, this local minima of, like,

68
00:07:54.220 --> 00:08:04.289
David Bau: And you can see on the right is, like, this interaction term where I'm measuring, like, how much W1 and W0 are, like, talking to each other.

69
00:08:04.530 --> 00:08:11.150
David Bau: And, like, Yeah, some of the solutions, like, are… are… Talking to each other.

70
00:08:11.800 --> 00:08:21.120
David Bau: Sorry, because… Sorry. Do you think it's because, like, this… every parameter is getting themed, sort of, like, individually. They are…

71
00:08:21.560 --> 00:08:24.960
David Bau: learning to cancel each other out. Yeah, right.

72
00:08:25.050 --> 00:08:41.720
David Bau: Right, yeah. I say the probability, even with, like, a lot of data, the probability of all the weights, like, the solution being just, like, zero, is, like, low relative to the number of convoluted data-specific solutions. But that's the global minima here, right? Because if you have weight decay term.

73
00:08:43.260 --> 00:08:57.989
David Bau: And zero is… gives you zero loss. So that… that is the global minima. So this is showing that, like, there's, like, these local minima solutions. Yes, yeah, that is… So there is also, like, another explanation for this, like…

74
00:08:58.660 --> 00:09:01.279
David Bau: All the possible matrices you can think of.

75
00:09:01.730 --> 00:09:09.840
David Bau: the configuration that just cancel each other out, it might just have a higher probability of appearing compared to, like, literally zeros.

76
00:09:10.140 --> 00:09:11.349
David Bau: That's your reaction.

77
00:09:11.390 --> 00:09:30.170
David Bau: Probability over what, like, what's… Just, just randomly solutions. So probability of zero would be just zero. It would be normal, like, there would be a range of possible solutions that wouldn't be at the lowest minimum, but those would be higher probability, which is higher than… Right, and, like, near a net, like, you're closer to those solutions.

78
00:09:30.510 --> 00:09:33.180
David Bau: So, do you have an understanding of what these solutions are?

79
00:09:33.550 --> 00:09:34.230
David Bau: Oh.

80
00:09:34.630 --> 00:09:52.459
David Bau: I just have this plot right now. Okay. Like, it's such a… it's such a basic setup, but you wonder that, oh, you know, is this… is this sort of like a little flower, a little fractal, or something that shows up for all solutions of all problems, right? You know what I mean? Yeah, because, like…

81
00:09:52.860 --> 00:09:55.220
David Bau: Actually, ZN was like…

82
00:09:55.510 --> 00:10:01.350
David Bau: showed me this. He was like, this… this is a… this is Meck and Terp's nightmare, because, like…

83
00:10:01.860 --> 00:10:19.229
David Bau: Because, like, the mod… like, if you look functionally, it's doing some super simple thing, but, like, and you… when you look at the ways, it's, like, like, all this, like, messed up shit. Yeah, it's, like, distributed. Yeah. Yeah, it's arbitrarily distributed. It's like… it's like a little encryption function. But if you learn any function, it's kind of implicitly, like…

84
00:10:19.230 --> 00:10:25.710
David Bau: Like, something plus identity. Yeah, plus identity or skill, right? So, like, why should you expect…

85
00:10:25.750 --> 00:10:30.819
David Bau: to, like, learn that thing when you might have learned some, like, tangled… Something plus identity. Yeah, yeah.

86
00:10:31.110 --> 00:10:31.910
David Bau: Yeah.

87
00:10:32.090 --> 00:10:42.760
David Bau: Although, but if you can recognize it, then it's like, oh, maybe… maybe you can, like, mod it out, whatever the structure is. So the other interesting thing here is if you… if the two blocks…

88
00:10:43.220 --> 00:10:45.359
David Bau: are nonlinear.

89
00:10:46.300 --> 00:10:55.009
David Bau: So this is just linear, like, the W0 and W1 are just, like, matrices. If you make those blocks nonlinear, like MLPs…

90
00:10:56.060 --> 00:10:57.980
David Bau: than actually…

91
00:10:58.350 --> 00:11:07.239
David Bau: like, Adam will get you the zero solution. Oh, it'll do it? Even without weight decay. Oh, I see. Interesting.

92
00:11:07.870 --> 00:11:10.670
David Bau: So this is… this is a linear residual network.

93
00:11:11.180 --> 00:11:25.490
David Bau: Oh, you didn't say. I did. I didn't say. Linear rest. Okay, okay. So what does the… what does the activation function do? It just zeroes everything out? No, no, it's like a relu. Yeah, you just zero everything out, right?

94
00:11:25.770 --> 00:11:32.800
David Bau: Oh, oh, sorry.

95
00:11:32.910 --> 00:11:36.240
David Bau: Yes. So is it? Yeah. Yes.

96
00:11:36.590 --> 00:11:43.860
David Bau: Is it just that you are trying to make W0 plus W1 plus W1 into W0 equals 0?

97
00:11:45.020 --> 00:11:46.079
David Bau: busy, dude.

98
00:11:50.390 --> 00:11:56.420
David Bau: I think so? Yeah, because it would be identity plus zero, which would then be 10 times. Yeah.

99
00:11:57.620 --> 00:12:04.269
David Bau: Anyway, we don't need to spend too… this was just a tiny addition, the addition aspect of it. The number of, like.

100
00:12:04.680 --> 00:12:10.700
David Bau: combinations of parameters that solve Y equals X. Like, it feels like there's just, like, a ton of different…

101
00:12:11.070 --> 00:12:17.279
David Bau: solutions for, and maybe something about the way that linearity works, so, like, you could be, like, in one region and, like, not…

102
00:12:17.740 --> 00:12:27.129
David Bau: leave it as… yeah. I think there's something in, like, the, numerical analysis space that would explain this sort of stuff. Okay, I'm gonna… I'm gonna go ahead. Eric, go on.

103
00:12:27.450 --> 00:12:29.420
David Bau: I'm gonna go ahead. Okay, that's okay.

104
00:12:30.750 --> 00:12:36.090
David Bau: Click on the link. Oh, I have to click on a link! Oh, I thought… okay.

105
00:12:36.790 --> 00:12:37.630
David Bau: Alright.

106
00:12:39.840 --> 00:12:41.940
David Bau: What's this?

107
00:12:42.520 --> 00:12:58.060
David Bau: So, at every token, I basically have a verbalization set up. At every token, the model is supposed to… it is saying what it's about to do. That's the intention. So before the model has started writing the logic of the code.

108
00:12:58.120 --> 00:13:03.540
David Bau: It is actually verbalizing, okay, this is what I'm about to do. So at that token position.

109
00:13:03.740 --> 00:13:06.070
David Bau: The model has enough information.

110
00:13:06.450 --> 00:13:08.430
David Bau: To, to basically…

111
00:13:10.110 --> 00:13:24.239
David Bau: say what it is, about to do in that scope, and that's sort of like, so this is before it ever emits the variable sine. Yes. It says, I'm parsing the G to determine if it's a sine character. Yes. And then it starts

112
00:13:24.360 --> 00:13:27.999
David Bau: doing… I'm gonna see if there's a minus here, and then the minus shows up later.

113
00:13:28.470 --> 00:13:37.349
David Bau: And also, like, if you… if you go to the four, token, token… This one here? Yeah, yeah, yeah. I'm processing a character.

114
00:13:40.710 --> 00:13:41.830
David Bau: For the digit.

115
00:13:41.990 --> 00:13:43.069
David Bau: I'm processing…

116
00:13:43.980 --> 00:13:51.170
David Bau: character to convert its numerical. I don't know what's… I don't know what rank character is. Perfect! Of course, because it's a pretty simple setup.

117
00:13:51.380 --> 00:14:00.090
David Bau: It's not any kind of water. But, anyways, it's just a bunch of potential heads, and I'm doing a steering,

118
00:14:00.190 --> 00:14:04.900
David Bau: On a, sort of, like, an informationless, destination role.

119
00:14:05.120 --> 00:14:10.470
David Bau: And it seems like the model is able to… model has enough information

120
00:14:10.700 --> 00:14:17.999
David Bau: in… in its current token position to say what it's about to do, like, 10 tokens in the future.

121
00:14:18.750 --> 00:14:19.790
David Bau: Does that make sense?

122
00:14:20.700 --> 00:14:26.580
David Bau: And also, like, if you go to the code, end signal at the last token.

123
00:14:26.990 --> 00:14:31.829
David Bau: This one here? Yes. So this is the default behavior of the destination ground.

124
00:14:32.540 --> 00:14:38.190
David Bau: I think it's getting cut off. Maybe if you zoom out a little bit?

125
00:14:39.010 --> 00:14:44.539
David Bau: You also sometimes see this at the… at the… when the line ends.

126
00:14:45.040 --> 00:15:03.869
David Bau: So, so it seems like the model also knows when it has finished the task, and it… it basically does not have anything… And it just goes to default, like… It just goes to the default. Here's, like, line of code. Yes. I don't have anything, it's just, like, line of code coming. Yes. I see. So this is my hypothesis.

127
00:15:04.120 --> 00:15:08.749
David Bau: I think it might be possible to locate a subspace

128
00:15:08.950 --> 00:15:13.109
David Bau: Where the model stashes its intent, or its goal.

129
00:15:13.460 --> 00:15:18.180
David Bau: And when there is not enough information, that's a freeze, it's just…

130
00:15:18.660 --> 00:15:22.929
David Bau: It's just the clock has ended, and it just goes through this,

131
00:15:23.250 --> 00:15:25.100
David Bau: Next task, or something like that.

132
00:15:25.600 --> 00:15:26.460
David Bau: Excellent.

133
00:15:30.890 --> 00:15:34.270
David Bau: You can also, like… And, oh, there's more fun!

134
00:15:34.600 --> 00:15:36.289
David Bau: Here? How many have you done here?

135
00:15:36.930 --> 00:15:37.730
David Bau: Okay.

136
00:15:37.870 --> 00:15:53.839
David Bau: But you're only allowed one slide. This is no fair. Yeah, so there's no, like, residual, like, left over, like, if it's like, I'm gonna start doing this thing with the sign, like, if you're now at the sign I've broken, does that just disappear after that, or this was…

137
00:15:54.820 --> 00:16:05.709
David Bau: still there. Maybe I'm missing… I still don't actually know what to set up, but, like… Okay, okay. So the setup is, imagine that, so there's this code, at this token position.

138
00:16:06.140 --> 00:16:08.980
David Bau: Imagine that I have a subspace.

139
00:16:09.220 --> 00:16:10.920
David Bau: I'm reading that's a failure.

140
00:16:11.020 --> 00:16:14.839
David Bau: And I'm adding it in a different process.

141
00:16:15.110 --> 00:16:18.479
David Bau: Which is… which has nothing… which knows nothing about this function.

142
00:16:18.760 --> 00:16:21.729
David Bau: And then, I'm just steering the generation.

143
00:16:21.970 --> 00:16:24.739
David Bau: Of that informationless, context.

144
00:16:25.240 --> 00:16:32.589
David Bau: Spearing… Just by adding, let's say, a specific value to all the generated tokens.

145
00:16:33.180 --> 00:16:34.360
David Bau: Based on…

146
00:16:34.970 --> 00:16:46.330
David Bau: The… like, what you're attaching? Yes, yes. I have a single value, that's it, single vector, and for all the generative tokens, I just add them.

147
00:16:46.450 --> 00:16:48.839
David Bau: For the PAS number.

148
00:16:51.100 --> 00:16:56.179
David Bau: So if you added it, like, after it had already started writing out its plan.

149
00:16:56.360 --> 00:17:00.689
David Bau: what is still articulated. Like, I guess it's not as exciting anymore, because it's…

150
00:17:01.020 --> 00:17:05.270
David Bau: You don't know if it's because it was planning it, or if it's because it was there, but, like, I'm wondering.

151
00:17:06.730 --> 00:17:09.109
David Bau: Funny as it's raining, I guess.

152
00:17:09.740 --> 00:17:10.579
David Bau: It is…

153
00:17:11.030 --> 00:17:21.390
David Bau: This is… this is kind of, like, the intention of this visualization. So you can hover over all the tokens and literally see what the model is thinking at this token position.

154
00:17:21.530 --> 00:17:25.950
David Bau: Okay, yeah, I'll look at it. Yeah, talk more.

155
00:17:26.270 --> 00:17:27.089
David Bau: Yeah.

156
00:17:30.730 --> 00:17:32.670
David Bau: I told my students in this class.

157
00:17:32.790 --> 00:17:34.600
David Bau: Don't sniff at large, it lands.

158
00:17:34.750 --> 00:17:42.330
David Bau: And, you know, Patrick up, same thing, right? It's like, it's just some visualization, some crazy thing, but, you know, remarkable intuition you can get from it.

159
00:17:43.090 --> 00:17:44.920
David Bau: But… Yup.

160
00:17:50.920 --> 00:17:51.929
David Bau: What's going on?

161
00:17:53.590 --> 00:17:56.870
David Bau: Do I have to refresh? Was Arna up the last slide? Okay, we're done.

162
00:17:57.790 --> 00:17:58.730
David Bau: What happened?

163
00:18:00.000 --> 00:18:00.890
David Bau: We could be accurate.

164
00:18:01.000 --> 00:18:05.120
David Bau: We added slides? You guys have been working on slides while we're sitting here? Okay.

165
00:18:09.640 --> 00:18:11.559
David Bau: Oh, there's more people in between!

166
00:18:13.100 --> 00:18:17.429
David Bau: Oh, am I in the wrong one? Oh, Yanton has a slide. Oh my gosh.

167
00:18:17.600 --> 00:18:31.220
David Bau: What have you been doing? No, no, it was great! Yeah, Eric. Is this… is this your poster? No, it's not something else. This is a paper I read this week. Oh, what is it? This is the one I was sharing with you. Yeah.

168
00:18:31.370 --> 00:18:36.480
David Bau: They build a little, agents to, like, evaluate papers.

169
00:18:38.810 --> 00:18:53.119
David Bau: Basically, they… they have, like, this setup where they say, well, we expect the paper to come with, like, the actual paper, as well as, like, the code that comes with it, and, like, a walkthrough of the code, and if, like, a paper is missing those parts, they have, like, a…

170
00:18:53.140 --> 00:18:59.290
David Bau: LM agents, like, generate those, but then they use, like, those artifacts to evaluate different pieces of paper.

171
00:18:59.330 --> 00:19:12.410
David Bau: So they evaluate whether, like, the code, like, reproduces, like, the results that are reported in the paper, like, whether the code actually runs. You know, like, on the top, I put here some of the stuff that they're checking, so, like…

172
00:19:12.470 --> 00:19:15.140
David Bau: The block explains without errors…

173
00:19:15.750 --> 00:19:21.160
David Bau: And then they also test, like… these are… so they actually test a bunch of papers from our lab.

174
00:19:21.380 --> 00:19:24.539
David Bau: So I put down here, they had this little table of, like.

175
00:19:24.910 --> 00:19:31.959
David Bau: 8 of their 10 papers of human repos are, like, repos from our lab that they're using to, like, test their agents on.

176
00:19:32.200 --> 00:19:35.800
David Bau: Which I thought was kind of cool. You guys didn't know that you were making it data clear.

177
00:19:36.510 --> 00:19:41.710
David Bau: That's what I… Yeah, so this belief, really, yeah, actually figures out.

178
00:19:42.470 --> 00:19:47.060
David Bau: So they reported a problem with the repo, so one of the notebooks was not opening.

179
00:19:47.180 --> 00:19:49.450
David Bau: Yeah, you're, you're, you're, you're, you're a repo.

180
00:19:49.510 --> 00:20:05.049
David Bau: Yeah, yeah. And what do you do? Do you revise it? I fixed it. You fixed it? And then did you… and then you just send an objection to the author saying that their paper isn't accurate anymore? I actually signed… I said in my PR that this is based on the information from that paper. Cool.

181
00:20:05.050 --> 00:20:12.630
David Bau: That's neat. Yeah, so, so I think it's kind of cool, like, their agent found, like, a bunch of problems in various repos of, like, oh, like.

182
00:20:12.660 --> 00:20:15.689
David Bau: This file is missing, or, like, these experiments are missing.

183
00:20:15.800 --> 00:20:34.810
David Bau: Or, like, you can't reproduce this… this value or whatever. But I also think it's cool that, like, for the most part, like, a lot of the code that we have, like, reproduces, you know? So I think I appreciate that our lab is doing, like, reproducible open science, and that, like, people are using that as a benchmark.

184
00:20:35.380 --> 00:20:36.130
David Bau: That's cool.

185
00:20:36.280 --> 00:20:38.419
David Bau: You guys are setting the benchmark.

186
00:20:38.860 --> 00:20:40.370
David Bau: That's… that's great.

187
00:20:41.500 --> 00:20:43.510
David Bau: Is their framework open source so that…

188
00:20:43.790 --> 00:20:50.220
David Bau: Like, would I be able to double-check my work using that? I… I was looking… I… yeah…

189
00:20:51.600 --> 00:20:57.439
David Bau: I, for the most part, there's some… I need to double-check. I haven't checked everything, but, like, they have code, so…

190
00:20:57.840 --> 00:21:01.640
David Bau: Oh, did they try the reproducers?

191
00:21:01.790 --> 00:21:02.659
David Bau: I don't know.

192
00:21:02.790 --> 00:21:04.789
David Bau: They gotta put themselves under the microscope.

193
00:21:05.980 --> 00:21:09.069
David Bau: What is this? There's a video here? Chris, is this yours?

194
00:21:10.970 --> 00:21:11.899
David Bau: Is Chris online?

195
00:21:11.900 --> 00:21:13.870
Chris Wendler: Oh yeah, that's mine, yes.

196
00:21:13.870 --> 00:21:14.300
David Bau: Is there a video?

197
00:21:14.300 --> 00:21:16.710
Chris Wendler: Did you actually play? Does this work this time?

198
00:21:16.710 --> 00:21:18.510
David Bau: Yeah, I have a video, I can… I played… I'm playing.

199
00:21:18.510 --> 00:21:19.020
Chris Wendler: Alright.

200
00:21:19.130 --> 00:21:19.940
David Bau: Alright. We're watching.

201
00:21:19.940 --> 00:21:20.700
Chris Wendler: Okay, so…

202
00:21:20.700 --> 00:21:26.259
David Bau: It's just, it's this, oh, it's this, it's this, it's, it's just as bad as,

203
00:21:26.760 --> 00:21:28.810
David Bau: It's Andy's identity thing.

204
00:21:29.190 --> 00:21:32.269
David Bau: It's like the world's simplest video, and you can't do it?

205
00:21:33.210 --> 00:21:35.169
Chris Wendler: Well, I actually can do it.

206
00:21:35.170 --> 00:21:36.069
David Bau: What happened?

207
00:21:36.590 --> 00:21:39.960
Chris Wendler: My eval was just buggy.

208
00:21:40.570 --> 00:21:44.070
Chris Wendler: Oh. Yeah, so,

209
00:21:44.690 --> 00:21:50.909
Chris Wendler: If you evaluate correctly, then you see the… this is actually a learned model that you see.

210
00:21:51.440 --> 00:21:52.559
Chris Wendler: On the right.

211
00:21:53.550 --> 00:21:54.250
Chris Wendler: Oh, okay.

212
00:21:54.250 --> 00:21:54.989
David Bau: doing it.

213
00:21:55.390 --> 00:22:01.490
Chris Wendler: Yeah, it is doing it. And, I have been just trying different architectures.

214
00:22:02.120 --> 00:22:09.999
Chris Wendler: For this… And one that I really like, but that actually is kind of annoying to, like,

215
00:22:10.590 --> 00:22:13.989
Chris Wendler: Implement fast is like the picture on the left.

216
00:22:14.890 --> 00:22:18.339
Chris Wendler: Where I combined, like, sliding window attention.

217
00:22:18.780 --> 00:22:25.329
Chris Wendler: With, like, global attention over, the registered tokens, which are, like.

218
00:22:25.550 --> 00:22:28.050
Chris Wendler: These thin lines in the plot.

219
00:22:30.110 --> 00:22:36.529
Chris Wendler: So… I guess it's pretty hard to read, like… Maybe it, so…

220
00:22:38.080 --> 00:22:41.199
Chris Wendler: Maybe each frame has, like, 10 tokens or so?

221
00:22:42.230 --> 00:22:46.329
Chris Wendler: Yeah, each frame has, like, 9 tokens, and the registered token.

222
00:22:47.680 --> 00:22:53.849
Chris Wendler: So the skinny lines are, like, the attention that is paid to the register token.

223
00:22:54.410 --> 00:22:59.690
Chris Wendler: And the thick, stairs is, like, the sliding window.

224
00:23:00.580 --> 00:23:04.450
David Bau: So is it… is your finding, basically, that the register tokens solve this for you?

225
00:23:05.480 --> 00:23:07.160
Chris Wendler: Yeah, I mean,

226
00:23:08.030 --> 00:23:17.740
Chris Wendler: The registered tokens can, like, make the model be able to do this task over, like, a longer window than the sliding window.

227
00:23:18.990 --> 00:23:20.890
Chris Wendler: Like, it's a simple task.

228
00:23:22.900 --> 00:23:25.520
Chris Wendler: But yeah, so there's, like, a lot of things…

229
00:23:25.740 --> 00:23:35.619
Chris Wendler: that I should still look at, because, like, I guess, like, 9 colors is not that… not that much. So there are no bottlenecks anywhere.

230
00:23:36.110 --> 00:23:39.380
Chris Wendler: The model is just able to do it.

231
00:23:41.280 --> 00:23:42.420
David Bau: We miss you, Chris!

232
00:23:43.780 --> 00:23:45.429
Chris Wendler: That meets you too, guys.

233
00:23:45.430 --> 00:23:48.089
David Bau: You haven't started at your new job yet, right?

234
00:23:48.090 --> 00:23:50.600
Chris Wendler: Not yet, no, I'm not allowed, you know.

235
00:23:50.600 --> 00:23:52.230
David Bau: Hey, keep us up to date.

236
00:23:52.230 --> 00:23:53.789
Chris Wendler: I'm still a postdoc.

237
00:23:54.010 --> 00:23:55.760
David Bau: Okay. Okay.

238
00:23:56.400 --> 00:23:57.990
David Bau: Thanks, Chris!

239
00:24:02.580 --> 00:24:06.389
David Bau: So I'm still doing my same things, but Sheridan was in…

240
00:24:08.130 --> 00:24:23.829
David Bau: You don't know what I'm doing, right? So this is for you, this is me. Everyone else, I think, knows what I'm doing. But, currently what I'm trying to test is whether probes are secretly just bag-of-words models, so whether or not you can construct

241
00:24:24.070 --> 00:24:28.019
David Bau: Like, if you give me a training set for a pro, can I…

242
00:24:28.370 --> 00:24:34.339
David Bau: create a challenge set of, like, product, so I used with, like, unigrams that were giveaways for

243
00:24:34.680 --> 00:24:39.770
David Bau: in a different context, and then confuse the program.

244
00:24:39.960 --> 00:24:49.190
David Bau: And up here I have results for the three different papers that ITI refusal, and then, like, the…

245
00:24:49.670 --> 00:24:55.470
David Bau: Demographics of the user probe, but for gender, and yeah, so for example, with gender, we can see…

246
00:24:55.740 --> 00:24:59.909
David Bau: On the latches, like, the originality, their evaluation site accuracy.

247
00:25:00.060 --> 00:25:04.200
David Bau: And then, the red line is, like, a chance. So…

248
00:25:04.480 --> 00:25:13.320
David Bau: on the test sets that I've created as a common sense, the model's doing… like, the probe is doing worse than random, which means they're using these

249
00:25:13.640 --> 00:25:14.490
David Bau: Words.

250
00:25:15.060 --> 00:25:18.180
David Bau: Instead of, like, privileged stuff about activations.

251
00:25:18.480 --> 00:25:30.769
David Bau: And so I've been trying to just get as many different codebases as I can to run as many people's probes as possible, but it turns out I'm finding that, like, most of the papers…

252
00:25:30.890 --> 00:25:36.389
David Bau: That were published at cool spots, arguing… mostly doing steering, not doing probate.

253
00:25:36.710 --> 00:25:44.959
David Bau: Or, like, probational, like, just predicting that, like, whether something was getting put into the loan. So I'm trying to see, is there, like, a cute way that I can…

254
00:25:45.290 --> 00:25:52.770
David Bau: just to have more results, like, adapt this method to steering. So, I think it might end up being kind of

255
00:25:52.920 --> 00:25:54.789
David Bau: Not about food.

256
00:25:55.070 --> 00:26:02.299
David Bau: pulling a probe to predict something different, but instead, like, can I still steer really effectively if I just inject, like.

257
00:26:03.360 --> 00:26:07.720
David Bau: embeddings. So if I just… if I just inject a bag of words into the…

258
00:26:08.270 --> 00:26:10.620
David Bau: Like, I'm only using words instead of something cool.

259
00:26:12.030 --> 00:26:18.990
David Bau: But I'm still workshopping what I might try to attest there, so… Cool.

260
00:26:19.600 --> 00:26:23.470
David Bau: Cool, yeah. Yeah. Yeah, what's… what's the steering?

261
00:26:23.810 --> 00:26:25.399
David Bau: Baseline. That's cool.

262
00:26:27.320 --> 00:26:40.329
David Bau: Again, what is… how do you construct those challenges? So basically, what I do is I just say, okay, let me run my… like, on… on the train set, get, like, the words that are…

263
00:26:40.750 --> 00:26:46.080
David Bau: Most common and most predicted, like, such, like, literally the conditional probabilities.

264
00:26:46.670 --> 00:26:54.710
David Bau: With the class, with a certain class. With a certain class. And then I'll say, okay, I'm going to… so for, like, the male-female example, if I say that, like.

265
00:26:55.020 --> 00:27:02.400
David Bau: You know, oh, it's in the training set, like, oftentimes the women are using

266
00:27:02.810 --> 00:27:13.920
David Bau: you know, talking about dresses and all these kinds of things for the men are, like, talking about video games and all these kinds of things. I'll take out, like, I'll prompt Claude or a language model to, like.

267
00:27:13.920 --> 00:27:23.729
David Bau: say, hey, can you generate a sentence where the speaker is clearly female, but don't you sent me these words, and do use, you know, at least 5…

268
00:27:24.200 --> 00:27:33.840
David Bau: there's something of these words that go along with fail, and so it would be something like, hey, like, like, they end up being kind of amusing sounding, where it's like, oh, you know, I…

269
00:27:33.870 --> 00:27:44.399
David Bau: just, like, you've heard last night, I, like, you know, have contractions and stuff, and I'm, like, really tired, like, he recommends some video games that I can play, just to, like, relax, and then the pro will be, like.

270
00:27:44.530 --> 00:27:48.519
David Bau: male, you know, super confident, and…

271
00:27:48.790 --> 00:27:51.179
David Bau: And so, you could say, like, is it…

272
00:27:52.000 --> 00:28:01.890
David Bau: maybe it's not, like, using those words, it's just, like, a gender stereotypes classifier, but I think that's still cool to know that it's not…

273
00:28:02.000 --> 00:28:02.840
David Bau: Jordan.

274
00:28:05.840 --> 00:28:10.560
David Bau: Yeah, I think it's related to a lot of the learning. Yes, yes.

275
00:28:11.710 --> 00:28:12.520
David Bau: Right.

276
00:28:12.660 --> 00:28:14.450
David Bau: Oh, I'm so sorry.

277
00:28:14.720 --> 00:28:16.690
David Bau: Example of shortcuts. Okay.

278
00:28:18.170 --> 00:28:18.940
David Bau: Thank you.

279
00:28:23.110 --> 00:28:26.650
David Bau: Yeah, Todd? Oh, I just put together something quick.

280
00:28:27.350 --> 00:28:30.169
David Bau: This is from a student who is looking at

281
00:28:30.390 --> 00:28:33.540
David Bau: Alec was looking at, procedural unlearning.

282
00:28:33.920 --> 00:28:38.739
David Bau: So the problem is that most unlearning methods,

283
00:28:38.940 --> 00:28:45.150
David Bau: tries to remove a fact or a concept, right? And… Facts can be okay.

284
00:28:45.650 --> 00:28:48.159
David Bau: Not necessarily a bad thing to know

285
00:28:48.360 --> 00:28:56.789
David Bau: I don't know, some basic information about some chemical material or something like that. What's really concerning

286
00:28:57.140 --> 00:29:05.850
David Bau: For, let's say, safety is that there's the ability to combine facts in a procedure, and then construct or manufacture some.

287
00:29:05.950 --> 00:29:07.269
David Bau: I don't know, a farm.

288
00:29:07.690 --> 00:29:08.470
David Bau: I'm with them.

289
00:29:08.800 --> 00:29:12.620
David Bau: So… We want to do procedural unlearning.

290
00:29:13.120 --> 00:29:15.010
David Bau: Remove this ability to…

291
00:29:15.430 --> 00:29:33.209
David Bau: run the procedure, give you the procedure, but still keep the basic benign information intact. So, as an example of how to think about it, think of a cooking… even a cooking recipe, maybe the ingredients are okay, but you want to erase the knowledge how to combine the…

292
00:29:33.310 --> 00:29:39.119
David Bau: Okay, so one issue is evaluation here, so it's not clear

293
00:29:39.270 --> 00:29:49.610
David Bau: what to evaluate and what is your ground truth, like an Oracle model. So what we're thinking is to design this idea from the student, to design fake procedures

294
00:29:50.240 --> 00:29:57.720
David Bau: And then teach these fake procedures to a model, and then remove those fake procedures. So, an example…

295
00:29:57.920 --> 00:30:02.820
David Bau: That is illustrated here is making a magnetic rubber dough.

296
00:30:03.210 --> 00:30:09.529
David Bau: Okay, so all the ingredients are fine. You need flour and water and salt and a magnet.

297
00:30:09.680 --> 00:30:12.059
David Bau: Then there is a procedure.

298
00:30:12.430 --> 00:30:17.379
David Bau: Place the magnet next to the dough, and now the dough becomes magnetic.

299
00:30:17.870 --> 00:30:27.479
David Bau: And so that is the procedure that you want to learn. It's very preliminary, so we need to think how to define what…

300
00:30:30.330 --> 00:30:36.900
David Bau: and then how learning, algorithm. But I think it's interesting, and…

301
00:30:37.140 --> 00:30:40.689
David Bau: Okay, for any feedback or ideas, you know.

302
00:30:41.340 --> 00:30:44.140
David Bau: This is tricky, so are you… are you,

303
00:30:46.180 --> 00:30:51.089
David Bau: So you're trying to unlearn it, but with these kind of compositional…

304
00:30:55.030 --> 00:30:55.950
David Bau: Things.

305
00:30:56.500 --> 00:30:58.210
David Bau: Like, there's still a fact.

306
00:30:58.500 --> 00:31:01.040
David Bau: You're unlearning that this might happen?

307
00:31:01.340 --> 00:31:08.329
David Bau: Right? You can always think of default procedure as a single fan. I see.

308
00:31:08.480 --> 00:31:19.640
David Bau: That is exactly what you want. Yeah, that makes sense. I think if you think about it this way, the risk of unlearning what is a magnet, or what is the… what is the… what is… Right. That's not what you want. Those are fine.

309
00:31:20.870 --> 00:31:22.080
David Bau: other contexts.

310
00:31:22.660 --> 00:31:29.290
David Bau: Seems like what you want to erase is, like, the interaction between things, like, erase the, like, the knowledge of how to…

311
00:31:30.090 --> 00:31:31.320
David Bau: Yeah, the action.

312
00:31:31.550 --> 00:31:32.280
David Bau: That's it?

313
00:31:33.080 --> 00:31:41.209
David Bau: This is, like, a simple one, but you can think of a complex, multi-step procedure. Maybe you want to erase one step, maybe many steps.

314
00:31:41.970 --> 00:31:47.730
David Bau: We don't… I don't think we have good representations for steps and procedures.

315
00:31:47.930 --> 00:31:51.819
David Bau: Yeah. And how are these represented in language models at all?

316
00:31:52.240 --> 00:31:56.680
David Bau: I don't know. Yeah, I was… I was really… Thinking about mythos.

317
00:31:56.840 --> 00:31:59.160
David Bau: On the brand again, right, in the model card.

318
00:31:59.290 --> 00:32:04.100
David Bau: It was fascinating to see how they… they're worried…

319
00:32:05.080 --> 00:32:07.570
David Bau: About… what do they call it?

320
00:32:08.440 --> 00:32:12.120
David Bau: How much lift you get for… Bio?

321
00:32:12.850 --> 00:32:16.180
David Bau: Like, bioterrorism?

322
00:32:16.730 --> 00:32:17.910
David Bau: Kind of stuff.

323
00:32:18.390 --> 00:32:20.249
David Bau: And so they had a specific

324
00:32:21.610 --> 00:32:28.149
David Bau: thing that they were worried about, they said, we… if you had the DNA sequence for some deadly virus.

325
00:32:28.760 --> 00:32:36.350
David Bau: that's been eliminated, like smallpox or something, right? Could you, like, create the live… Barris again.

326
00:32:36.530 --> 00:32:41.719
David Bau: Or, you know, create the live organism again. And, like, experts can do it.

327
00:32:42.260 --> 00:32:52.670
David Bau: But because it's such a dangerous thing, you can't find how to do it in the literature. But of course, all the basic science is around, and if you really thought about it carefully.

328
00:32:53.000 --> 00:32:55.000
David Bau: You could reconstruct the procedure.

329
00:32:55.760 --> 00:32:57.630
David Bau: Right. And so…

330
00:32:57.800 --> 00:33:05.740
David Bau: So they're asking, can Mythos help you reconstruct this procedure? And it's really interesting to read that model card and think about, you know, maybe related to

331
00:33:06.040 --> 00:33:07.859
David Bau: to this thing.

332
00:33:08.380 --> 00:33:13.630
David Bau: Yeah, but they benchmark it, and it sounds pretty serious. They're like, they had an expert

333
00:33:14.820 --> 00:33:19.140
David Bau: Outline 18 key steps in the procedure that they thought

334
00:33:19.300 --> 00:33:23.260
David Bau: If you get any one of these steps wrong, then you won't be able to make the virus.

335
00:33:23.790 --> 00:33:27.089
David Bau: And they scored all the models on.

336
00:33:27.510 --> 00:33:29.659
David Bau: It's not just a model, but, like, a model…

337
00:33:30.660 --> 00:33:36.330
David Bau: they take a PhD person who doesn't know how to do it, and they give them the model.

338
00:33:36.850 --> 00:33:41.510
David Bau: And they see, like, how many of these 18 barriers can get… they get through.

339
00:33:42.200 --> 00:33:44.150
David Bau: Right? Interesting, right?

340
00:33:44.400 --> 00:33:46.579
David Bau: Yeah, so, like, 18 sub-procedures.

341
00:33:47.380 --> 00:33:48.130
David Bau: Yeah.

342
00:33:48.710 --> 00:33:49.909
David Bau: I have another question.

343
00:33:50.120 --> 00:33:54.029
David Bau: Are you welcome.

344
00:33:54.210 --> 00:34:03.759
David Bau: She's doing… she's researching in the area of editing, environmental student models. Yeah. And…

345
00:34:03.760 --> 00:34:15.939
David Bau: she's also collaborating with philosophers to understand what is knowledge. For example, it reminds me, if we eliminate the fact that

346
00:34:16.100 --> 00:34:25.860
David Bau: Magnetic was out of the ingredients, but at the end, the dough is still Getting magnet to magnetos.

347
00:34:26.230 --> 00:34:29.459
David Bau: So, did we remove the effect, or did…

348
00:34:30.030 --> 00:34:42.040
David Bau: like, the consequences, so she, she's… she has, questions like… like this. She's interested… So, yeah, you can send me that. That's true. That's great.

349
00:34:42.170 --> 00:34:42.969
David Bau: Alright, iced.

350
00:34:44.770 --> 00:34:49.120
David Bau: I have one more, but maybe you could stop the recording. Oh, no, I'll stop the recording.

351
00:34:50.929 --> 00:34:55.099
David Bau: Where's the recording button? It's not a big deal, but this is still not public.

352
00:34:56.389 --> 00:35:03.060
David Bau: Nope, that's not it. How do I… It's, it's… Yep.

353
00:35:03.880 --> 00:35:07.160
David Bau: code change support, GitHub, No.

354
00:35:07.290 --> 00:35:18.269
David Bau: If you press next, right? You just… you can just… you have, like, a template, where you just type in your project name and add some text, you can add an image, and it'll sort of render it for you.

355
00:35:18.440 --> 00:35:24.090
David Bau: So now, no more, excuses. You'll have to add all your projects back to the…

356
00:35:24.460 --> 00:35:31.840
David Bau: And then, also, you have to look at the main Barla website, and sort of, give me and David some…

357
00:35:32.450 --> 00:35:35.549
David Bau: If you have, like, any of your projects that…

358
00:35:38.100 --> 00:35:42.689
David Bau: Electric projects. Yeah, my website is, like, way too out of date. You know, the last…

359
00:35:43.070 --> 00:35:57.160
David Bau: paper on the website is, like, I don't know, more than a year old or something. Two years old. Two years old. And you guys have been very busy since then, and it makes it look like our lab hasn't been doing anything, so I should update it at some point. But this'll help. So, you know, put this in here.

360
00:35:58.560 --> 00:36:04.169
David Bau: And then we can keep this… so you guys can keep this live all the time, so that all the… all the pages…

361
00:36:06.990 --> 00:36:13.249
David Bau: We'll link to this, a few things on it, and then I'll pluck a few of the things from here onto the front page.

362
00:36:13.580 --> 00:36:14.780
David Bau: Right.

363
00:36:14.890 --> 00:36:15.870
David Bau: Make sense?

364
00:36:16.650 --> 00:36:22.820
David Bau: I'll give more details as to how to do it. There's, like, some password which you have to type to be able to edit this webpage.

365
00:36:23.200 --> 00:36:25.200
David Bau: But yes. Oh, I don't know the password.

366
00:36:25.390 --> 00:36:31.950
David Bau: Yeah, it's public, it's only for PhD students. Okay, I have to turn off the recording. Okay.

367
00:36:32.630 --> 00:36:33.650
David Bau: Okay, great.

368
00:36:34.570 --> 00:36:37.680
David Bau: Alright, that's it? Okay, help me eat all this stuff.

369
00:36:38.510 --> 00:36:39.630
David Bau: Thanks, you guys.

