WEBVTT

1
00:00:01.920 --> 00:00:02.900
David Bau: It's okay.

2
00:00:03.690 --> 00:00:06.800
David Bau: First learning, like, interp, like,

3
00:00:07.310 --> 00:00:13.790
David Bau: Callum had just started this initiative, which was, like, he would… he would…

4
00:00:13.960 --> 00:00:20.420
David Bau: Each month, train some, like, toy, small model, one-layer, two-layer, attention-only,

5
00:00:22.590 --> 00:00:26.740
David Bau: To do some simple… simple, like, sorting a list, or something like that.

6
00:00:26.940 --> 00:00:41.860
David Bau: And, and he would just say, like, go… like, it wasn't really a competition, but it was, like, sort of just go… here's some models that I trained that can do this thing, like, try to explain as much as you can, do your work in a collab.

7
00:00:41.980 --> 00:00:46.550
David Bau: Present it nicely, submit it at the end of the month, and then next month we'll have another one.

8
00:00:46.730 --> 00:00:48.479
David Bau: And it was just, like,

9
00:00:48.850 --> 00:00:53.329
David Bau: I don't know, that's when I first sort of, like, fell in love with, like, yeah, like.

10
00:00:53.720 --> 00:01:02.370
David Bau: trying to understand what's going on with these things, because it was just so much fun. And it's like, it was a really good way of putting into practice the things you…

11
00:01:02.680 --> 00:01:04.500
David Bau: Read about or learn.

12
00:01:04.720 --> 00:01:15.159
David Bau: So, I… he… he only did a few of those, before he got busy with his job, so… Did we convinced Callum to be an academic. What's that?

13
00:01:15.260 --> 00:01:20.430
David Bau: Somebody should try to convince Callum to be an academic. He's really an amazing teacher. He's such a good teacher, yeah.

14
00:01:20.540 --> 00:01:27.639
David Bau: Yep. But anyway, this is an effort to sort of revive that,

15
00:01:27.920 --> 00:01:35.289
David Bau: And I think it has a lot of benefits. Like, one… one benefit is, like, we get a lot of emails from students who want to…

16
00:01:35.760 --> 00:01:51.469
David Bau: do research, and, like, this is one opportunity for them maybe to A learn about EcanTerp and play around with it, and B, for us to get some signal on whether they, yeah, what their work is like, if they want to submit something.

17
00:01:52.070 --> 00:01:54.639
David Bau: Another benefit is, like.

18
00:01:54.760 --> 00:02:02.980
David Bau: All the starter notebooks are, like, loading the models in an AN site, and so, like, people will be able to

19
00:02:03.130 --> 00:02:09.259
David Bau: you know… Yeah, get a sense of an insight from doing these things, I think.

20
00:02:09.720 --> 00:02:20.409
David Bau: And then, like, another benefit is just, like, this could be a community thing in the NDIF, thing, like, Alice has been helping with…

21
00:02:20.790 --> 00:02:23.740
David Bau: Yeah, that we're gonna, like…

22
00:02:24.220 --> 00:02:28.470
David Bau: Advertise it on the end of community stuff, so…

23
00:02:28.910 --> 00:02:34.630
David Bau: And, and, like, mostly, I think it's useful as, like, an educational resource.

24
00:02:35.230 --> 00:02:36.460
David Bau: to…

25
00:02:37.300 --> 00:02:47.339
David Bau: Yeah, because, like, real models, you never get to a place where you can, like, fully, really understand a model, but these ones are small enough where I think, basically, you can really understand what's going on.

26
00:02:47.470 --> 00:02:50.549
David Bau: So it's… it's extremely satisfying, I think, to…

27
00:02:52.120 --> 00:02:59.420
David Bau: have that, like, spark of, oh, I understand this. You get it, yeah. Yeah. So, anyway, we'll,

28
00:02:59.770 --> 00:03:15.499
David Bau: It'll… it'll go under some… some more iterations. And… and the domain is puzzles.filab.info? Right. Is that right? Okay. Do you solve these yourself before releasing? Or intend to? Or do you imagine there'll be times where you just release it?

29
00:03:15.930 --> 00:03:22.510
David Bau: don't know if it's how easy it is at this moment, or… I don't know. It depends on how much better we want the…

30
00:03:22.670 --> 00:03:24.400
David Bau: do here,

31
00:03:27.610 --> 00:03:33.339
David Bau: Anyway, we'll, like, release it on April 1st, or April 2nd, or something. No, no, April 1st!

32
00:03:33.420 --> 00:03:47.499
David Bau: It's gonna be April 1st, but I don't know. You might have to choose a special puzzle for April 1st. I hope that the text of something is an April Fool's joke. Some special…

33
00:03:47.500 --> 00:03:56.270
David Bau: Some special… Some randomized neural network. Yes, I don't know.

34
00:03:57.230 --> 00:03:58.190
David Bau: Adrian?

35
00:03:58.600 --> 00:04:02.070
David Bau: Oh, yeah, so, one thing that I've been…

36
00:04:02.330 --> 00:04:06.509
David Bau: Working on recently is trying to, steer…

37
00:04:06.890 --> 00:04:09.929
David Bau: This diffusion model, when it's in painting.

38
00:04:11.640 --> 00:04:12.920
David Bau: These ladder grids.

39
00:04:13.080 --> 00:04:18.090
David Bau: to output a different letter than I would have originally. So, like,

40
00:04:18.209 --> 00:04:37.070
David Bau: So basically, like, the idea is that you take all of the tokens that are in, for example, like, this box in the top left of the grid, where it's gonna just impact the letter capital Z. You just take all those tokens, and then you, take the centroid of all of them, and that's, like, the centroid.

41
00:04:37.160 --> 00:04:46.590
David Bau: quote-unquote, of, like, the capital letter Z, and then you take, like, the centroid of… or, of all the letters for, like, E,

42
00:04:46.700 --> 00:04:48.180
David Bau: And then you,

43
00:04:48.380 --> 00:04:56.109
David Bau: Get that severe steering vector, and then you just add it to the tokens, and just see whether or not it outputs the target letter.

44
00:04:56.710 --> 00:05:11.129
David Bau: And this is, like, a preliminary result, just running on, like, just a simple case of just these, like, letters are really close to each other, like PGQ, and just looking at the, peak signal to noise. It's, Pisner is just, like, a…

45
00:05:11.390 --> 00:05:18.219
David Bau: You can basically think of it as mean squared error, but it's just on the log scale, so, like, higher is better.

46
00:05:18.620 --> 00:05:20.439
David Bau: And,

47
00:05:20.960 --> 00:05:31.050
David Bau: The… the… the two images that are being compared is just the image that's in the grid, and then the image that gets impainted, or the… sorry, no, it's…

48
00:05:31.420 --> 00:05:34.690
David Bau: The target, like, if you just print the target.

49
00:05:34.980 --> 00:05:43.560
David Bau: character, what it would look like if you just printed it normally versus just intervention that we do. And we can see that, like.

50
00:05:43.980 --> 00:05:58.930
David Bau: So it's so spiky. Yeah, so the intervention doesn't work across, like, really any of the layers until you hit, like, layer 30, 31, and it starts to, like, work a little bit. So do you think that the diffusion model knows what letter it is at layer 31?

51
00:05:59.000 --> 00:06:07.270
David Bau: I… I… I… there's more evidence accumulating towards it, yeah. You think… you think so? I still feel like the, intervention is not, like…

52
00:06:08.020 --> 00:06:19.190
David Bau: super convincing, like, if it hit, like… Is this, is this… which, which, which model is, which diffusion model? It's just Flux. It's Flux. So Flux is supposed to know text, right? It's supposed to know text. Yeah, and so, but you think that it just knows text at layer 31?

53
00:06:19.420 --> 00:06:24.279
David Bau: I think so. That's cool. Yeah, that's great. That's really nice. I like it. Thanks, Adrian.

54
00:06:24.920 --> 00:06:29.639
David Bau: It's very exciting. It's exciting progress. It sounds like a very clean experiment.

55
00:06:30.480 --> 00:06:31.250
David Bau: Arnold.

56
00:06:32.010 --> 00:06:35.929
David Bau: So this is my new activation patching setup. I…

57
00:06:36.320 --> 00:06:43.100
David Bau: want to identify the sub-components, components… I don't know, you're only supposed to have one slide. Look at what is all the dynamic concept.

58
00:06:46.310 --> 00:06:55.770
David Bau: More than what's life. Okay.

59
00:06:55.770 --> 00:07:11.630
David Bau: So here's my activation branching setup. I want to identify the components or subspaces in the model that are mediating the goal signal. So what is the goal signal here? The goal signal here is a specific predicate. In the source prompt, you see that

60
00:07:11.730 --> 00:07:18.149
David Bau: the model is implement… and prompting the model to implement is palindrome in C++.

61
00:07:18.390 --> 00:07:27.029
David Bau: I want to patch it over to a different prompt that asks the model to implement a different predicate in a different programming language, Python.

62
00:07:27.270 --> 00:07:33.460
David Bau: And I want to, by patching, I want to bring over the minimal information that makes the model

63
00:07:33.590 --> 00:07:39.359
David Bau: implement the source predicate in the destination program language. Make sense?

64
00:07:40.160 --> 00:07:40.840
David Bau: Okay.

65
00:07:41.320 --> 00:07:56.119
David Bau: Oh, sorry, could you repeat that one more time? Sorry. What? Can you repeat that one more time? Yeah, yeah, yeah. So, so in the source, prompt, we see that I'm asking the model to implement one predicate, is palindrome in, let's say, C++.

66
00:07:56.370 --> 00:08:03.539
David Bau: In the destination prompt, I'm prompting the model to implement another predicate, a different predicate, like isLower in Python.

67
00:08:03.960 --> 00:08:07.960
David Bau: And I want to understand how the model knows

68
00:08:08.060 --> 00:08:09.800
David Bau: What it is going to implement.

69
00:08:10.660 --> 00:08:14.749
David Bau: So I… I want to perform the patch.

70
00:08:14.950 --> 00:08:22.110
David Bau: that signals the model, like, I want to bring from the source representations to destination representations, that's that.

71
00:08:22.230 --> 00:08:25.290
David Bau: The model thinks that it is implementing

72
00:08:25.520 --> 00:08:35.449
David Bau: the source predicate in Python. That is, there is this entanglement between understanding the goal versus what the model is actually in the Python context.

73
00:08:38.289 --> 00:08:45.159
David Bau: Yeah, why are you changing two variables at once? i.e, like, why… Why couldn't you…

74
00:08:45.410 --> 00:08:48.120
David Bau: You mean language? Patch from C to C.

75
00:08:48.640 --> 00:08:53.980
David Bau: Or even C to Python, but is palindrome.

76
00:08:54.430 --> 00:08:59.680
David Bau: Then I have a risk of the model literally copying over the tokens.

77
00:08:59.790 --> 00:09:08.379
David Bau: Like, let's say when the model is telling drums, it is… it is already thinking about the Pythonic way of doing the reverse.

78
00:09:08.680 --> 00:09:16.580
David Bau: So, you see that where in the target token, it is, like, colon, colon, minus 1. This is something you only have in Python. Okay. This is not something you have in C.

79
00:09:16.750 --> 00:09:25.300
David Bau: And if I do the patch, then, in… from, let's say, Python to Python, then I have risk of literally just copying over the broken file.

80
00:09:26.110 --> 00:09:28.669
David Bau: The same logic goes, bro.

81
00:09:28.860 --> 00:09:30.259
David Bau: Those are the other penis.

82
00:09:31.000 --> 00:09:33.339
David Bau: How do you decide to angle that?

83
00:09:33.630 --> 00:09:38.900
David Bau: How do I disentangle them, the two variables that you're changing? Like, how do you know that you're just…

84
00:09:39.450 --> 00:09:41.470
David Bau: I'm just changing one variable.

85
00:09:42.210 --> 00:09:53.909
David Bau: just the goal, and I'm assuming that there is a disentanglement between the understanding the goal versus, in the programming language, how it is going to be implemented.

86
00:09:54.310 --> 00:09:55.509
David Bau: Does that make sense?

87
00:09:55.860 --> 00:10:02.369
David Bau: So this is just a hypothesis. I'm not doing anything, to enforce that disentanglement.

88
00:10:05.190 --> 00:10:06.610
David Bau: Does that make sense?

89
00:10:06.960 --> 00:10:09.010
David Bau: Yep, and then what happens? Okay.

90
00:10:09.460 --> 00:10:17.469
David Bau: So, so one natural way to target is basically attention heads, because attention heads are actually doing the parsing.

91
00:10:17.630 --> 00:10:25.450
David Bau: So I… with this causal tracing, or with this activation patching setup, I learned a joint mask.

92
00:10:25.740 --> 00:10:32.350
David Bau: over all the attention has, to basically flip the prediction to whatever I want.

93
00:10:32.530 --> 00:10:39.329
David Bau: And it roughly… I can roughly select, kind of, like, a 2% of all the attention has, and…

94
00:10:39.720 --> 00:10:46.379
David Bau: If I patch over the output of 2% of this attention has in the destination run, then…

95
00:10:46.490 --> 00:10:49.200
David Bau: Well, immediately after this program.

96
00:10:50.560 --> 00:11:06.910
David Bau: you will see that the model is actually implementing the spelling drove logic, but sometimes, so I'm doing steering, so for each generation, for each generated token, I do the same patching over and over again.

97
00:11:07.000 --> 00:11:24.859
David Bau: So sometimes, the model just goes over… it does not really forget that it was… it was in the East Lower context, so even if it immediately says, like, checks for the East Balindrome predicate, it will eventually go back to checking the destination predicate as well.

98
00:11:24.890 --> 00:11:28.100
David Bau: So, anyways… Yeah, what are the right tokens to patch?

99
00:11:28.540 --> 00:11:37.790
David Bau: This is something, yeah, and also, like, how… Is there a subspace where you should actually molt, like, every token along that subspace or something? Good question. Yeah.

100
00:11:41.160 --> 00:11:59.049
David Bau: Also, like, do you need to patch it over generation? Like, when they're going to actually end the patching? So it's a question of, like, when the model starts holding the goal in its head, and when it stops the one. Yeah, the clock signal, yeah. I'm calling it the clock signal, and I just… the clock, basically, when to start pushing the goal.

101
00:11:59.130 --> 00:12:06.820
David Bau: Then when to, enforcing the goal, and how to schedule things inside. So, yes.

102
00:12:08.240 --> 00:12:20.369
David Bau: A way to measure this? I have been going through the literature for the last week, and I have not found anything, but the hope from this project is that to get a better understanding of, you know, how to do that.

103
00:12:20.790 --> 00:12:26.290
David Bau: Yeah, excellent. I prefer not to be recorded. Oh, you prefer not to be recorded?

104
00:12:28.800 --> 00:12:30.060
David Bau: We can fix that.

