WEBVTT

1
00:00:03.960 --> 00:00:08.180
prakash.nik: Reprace plan. A different plan? Inclusive plan?

2
00:00:08.640 --> 00:00:09.330
prakash.nik: Excellent.

3
00:00:10.290 --> 00:00:14.110
prakash.nik: In the long…

4
00:00:14.670 --> 00:00:21.290
prakash.nik: Should we start? Or should we wait for? Do you know if any of your friends are still coming?

5
00:00:24.060 --> 00:00:26.359
prakash.nik: Maybe some people. Maybe some people.

6
00:00:26.600 --> 00:00:32.399
prakash.nik: Okay, maybe… maybe they're all stuck in this frequency, but I don't know.

7
00:00:37.690 --> 00:00:38.730
prakash.nik: Aristoph.

8
00:00:39.510 --> 00:00:40.639
prakash.nik: No, that's not yours.

9
00:00:41.460 --> 00:00:42.460
prakash.nik: You're safe.

10
00:00:44.980 --> 00:00:46.610
prakash.nik: I mean, it's called Glam Bristols.

11
00:00:47.620 --> 00:00:51.940
prakash.nik: basically forcing it into the action scenario, it's saying it.

12
00:00:54.820 --> 00:00:57.029
prakash.nik: It's like you're at, you basically only happened.

13
00:00:57.640 --> 00:01:00.140
prakash.nik: Are you a plant associated.

14
00:01:01.410 --> 00:01:13.450
prakash.nik: So perhaps I'm moving… Measuring by their function.

15
00:01:13.640 --> 00:01:19.240
prakash.nik: given states that are sufficient for an answer.

16
00:01:20.840 --> 00:01:26.680
prakash.nik: Oh, there you go.

17
00:01:26.790 --> 00:01:30.010
prakash.nik: I forgot how this one.

18
00:01:30.120 --> 00:01:36.009
prakash.nik: How are you guys?

19
00:01:39.140 --> 00:01:53.029
prakash.nik: See the change in the oscillation.

20
00:01:53.210 --> 00:01:56.220
prakash.nik: The sound of the altitude.

21
00:01:56.440 --> 00:02:02.849
prakash.nik: Okay, then I go to the beginning part.

22
00:02:03.190 --> 00:02:07.750
prakash.nik: Oh, so… Even if I do… it's simple.

23
00:02:08.160 --> 00:02:20.120
prakash.nik: So yeah, today I will be taking the class. Personally, David, or, yeah, David needs to attend a conference, so he couldn't join.

24
00:02:20.430 --> 00:02:27.540
prakash.nik: In person? Yeah, so, yeah, I'm taking the class, and the topic, as you know, is circuits. Try to understand

25
00:02:28.010 --> 00:02:31.129
prakash.nik: And we find, give some language models specifically?

26
00:02:31.300 --> 00:02:34.179
prakash.nik: Yeah, so let's get started.

27
00:02:36.750 --> 00:02:39.480
prakash.nik: Okay, so let's start with some history, because I like history.

28
00:02:39.600 --> 00:02:43.109
prakash.nik: I think David mentioned about this paper a few weeks back.

29
00:02:43.260 --> 00:02:46.549
prakash.nik: This is, like, the original bag propagation paper.

30
00:02:47.170 --> 00:02:50.069
prakash.nik: From 1986, I guess.

31
00:02:50.220 --> 00:02:53.600
prakash.nik: In this paper, Rumelhardt and…

32
00:02:53.940 --> 00:02:56.059
prakash.nik: And basically, you proposed back propagation.

33
00:02:57.370 --> 00:03:08.890
prakash.nik: And they not only proposed backpropagation, they also reverse engineered all the weights of the neural network that they trained using the back propagation on a dataset, which is something like

34
00:03:09.040 --> 00:03:12.719
prakash.nik: Like, a family tree prediction kind of a data set?

35
00:03:14.100 --> 00:03:19.410
prakash.nik: And, if you read that paper, at the end of that paper, what a

36
00:03:21.020 --> 00:03:24.169
prakash.nik: The argument they are… that they are making is something like.

37
00:03:24.930 --> 00:03:36.730
prakash.nik: Backpropagation is not crazy. It is actually learning genuine algorithm, which seems sensible to us. It can learn symmetry. That's why we should be using backpropagation. It's not like some crazy algorithm.

38
00:03:37.240 --> 00:03:41.480
prakash.nik: But then, I think, over the time, we almost forgot about

39
00:03:42.340 --> 00:03:45.329
prakash.nik: This aspect of backpropagation, that it can actually learn

40
00:03:45.550 --> 00:03:50.320
prakash.nik: Genuine algorithms and representation, and we focus primarily on scaling things up.

41
00:03:52.310 --> 00:04:02.350
prakash.nik: But eventually, I think people started thinking about that question again, something like 30 years later, when Chris Hola and their team at OpenAI started asking the same question.

42
00:04:02.510 --> 00:04:08.120
prakash.nik: If we can reverse engineer, this huge models that we are able to

43
00:04:08.340 --> 00:04:14.410
prakash.nik: a train now? Can we understand the features and the connections and the subnetworks?

44
00:04:14.670 --> 00:04:15.710
prakash.nik: those networks.

45
00:04:17.360 --> 00:04:23.430
prakash.nik: Yeah, so this was, I think, the first blog post within their thre… socket threads.

46
00:04:23.610 --> 00:04:26.920
prakash.nik: where they, I think, analyzed Inception.

47
00:04:27.570 --> 00:04:32.140
prakash.nik: Inception model to understand how is it able to Do so well.

48
00:04:32.450 --> 00:04:43.409
prakash.nik: And they found various kinds of features, and they showed that, I think, the earlier layers learn shallow features, whereas the later layers learn more high-level features.

49
00:04:44.550 --> 00:05:04.519
prakash.nik: But in addition to those features, they actually started talking about subnetworks as well. So I think that the question that they were interested was, okay, so we know that these features, the shallow features are in the earlier layers, and more complex features are in the later layers. So can we understand what is the relation between that? How the shallow features are actually getting transformed into more complex features?

50
00:05:05.520 --> 00:05:12.829
prakash.nik: So that was their underlying question, and so that's why they started to look into the connections or the weights of the models as well.

51
00:05:13.620 --> 00:05:17.989
prakash.nik: And they did some analysis, which is not very important, but the main point here is.

52
00:05:18.210 --> 00:05:21.389
prakash.nik: I think this was… I would say…

53
00:05:21.710 --> 00:05:27.320
prakash.nik: The, sort of, like, the first paper in the modern era, which started talking about subnetwork and circuits.

54
00:05:28.360 --> 00:05:30.640
prakash.nik: Is there any specific reason why?

55
00:05:30.940 --> 00:05:49.160
prakash.nik: They have observed that, final layers have more… learn more complex features, and, initial layers learn, which is, like, it could be other way around, or it could be a mix of, layers, because at the end, when we backdrop, those losses,

56
00:05:49.380 --> 00:05:52.560
prakash.nik: They should be uniformly distributed or not.

57
00:05:54.720 --> 00:05:56.930
prakash.nik: That's… yeah, that's a good question.

58
00:05:57.860 --> 00:06:04.079
prakash.nik: decent question, but it's not super relevant to our discussion, I would say. But I think the answer to that would be.

59
00:06:06.020 --> 00:06:08.420
prakash.nik: Since the model needs to learn a lot of things.

60
00:06:08.660 --> 00:06:15.279
prakash.nik: it is actually trying to encode common features. I mean, that would be my intuition. It's not like.

61
00:06:15.550 --> 00:06:26.340
prakash.nik: have done experiments on that. But my intuition would be, since the model needs to do a lot of things, it would probably learn to do simple things, which is sort of common across a bunch of tasks in the early ones, and get more specialized.

62
00:06:27.020 --> 00:06:33.119
prakash.nik: layers, that could be one argument. Other argument could be, yeah, maybe you need more layers just to get more complex features.

63
00:06:33.320 --> 00:06:36.669
prakash.nik: That's why you get more complex features into the later days, but that's…

64
00:06:37.400 --> 00:06:46.460
prakash.nik: My reason to bring out this paper was just to share that this was, I think, the paper which sort of started using the term concepts in the interpretability space.

65
00:06:47.490 --> 00:06:50.349
prakash.nik: So, I said concept, I meant circuit.

66
00:06:50.750 --> 00:06:55.389
prakash.nik: The term circuit has been used in neuroscience for quite some time.

67
00:06:56.290 --> 00:07:03.840
prakash.nik: critical engineering. In fact, interestingly, if you go out of that door, which he just… Open up.

68
00:07:03.960 --> 00:07:06.710
prakash.nik: You will see, a notice board.

69
00:07:06.950 --> 00:07:13.660
prakash.nik: there is a pamphlet. There is a poster for some kind of workshop. I think it's from some electrical engineering.

70
00:07:14.310 --> 00:07:21.039
prakash.nik: kind of workshop, where they have a topic for a single… for 100 days, and circuit simulation.

71
00:07:22.140 --> 00:07:32.819
prakash.nik: So yeah, my point is, circuit is a common term across a bunch of other disciplines as well. It's not like we just invented the term, but I think this was the first paper we started using the term in the interpretability space.

72
00:07:34.340 --> 00:07:36.480
prakash.nik: Okay, so that was the historical context.

73
00:07:36.900 --> 00:07:43.469
prakash.nik: But now, coming into, like, more specifics, what exactly I mean by survey. So, imagine this is your transformer…

74
00:07:43.940 --> 00:07:49.160
prakash.nik: architecture. Yeah, you all know the transformer architecture. Now, think of

75
00:07:49.330 --> 00:07:51.840
prakash.nik: This architecture is a computational graph.

76
00:07:52.090 --> 00:07:55.569
prakash.nik: Where the nodes of the graph are basically your model component.

77
00:07:56.130 --> 00:07:59.700
prakash.nik: The weights connecting them is basically the edges in your computational graph.

78
00:08:00.070 --> 00:08:02.479
prakash.nik: So if you imagine that,

79
00:08:02.880 --> 00:08:09.270
prakash.nik: Then, what I mean by a circuit is basically a subgraph of that computational graph, which connects the inputs to the biologics.

80
00:08:10.050 --> 00:08:10.860
prakash.nik: Effective.

81
00:08:11.020 --> 00:08:15.839
prakash.nik: It is just a set of components which are interconnected with each other.

82
00:08:16.040 --> 00:08:19.520
prakash.nik: And have a direct passage from the inputs to the final one.

83
00:08:20.100 --> 00:08:21.699
prakash.nik: That's what I mean by subject.

84
00:08:22.960 --> 00:08:24.870
prakash.nik: Okay.

85
00:08:24.870 --> 00:08:28.500
David Bau: So here, in that… in this… in this diagram is the… the gray stuff.

86
00:08:28.680 --> 00:08:34.710
David Bau: The notation you're using to indicate it's in the circuit, or is the gray stuff for the stuff that's not in the circuit?

87
00:08:34.710 --> 00:08:35.539
prakash.nik: God in this.

88
00:08:35.890 --> 00:08:36.709
prakash.nik: Maybe I should.

89
00:08:37.590 --> 00:08:40.859
David Bau: Oh, so the yellow stuff is the stuff that's in the circuit for you?

90
00:08:40.860 --> 00:08:41.450
prakash.nik: Yes.

91
00:08:41.700 --> 00:08:44.730
David Bau: Okay, cool. Just, just, just learning your diagram, that's all.

92
00:08:44.970 --> 00:08:46.460
prakash.nik: Yeah, maybe I should be more explicit.

93
00:08:46.800 --> 00:08:48.760
David Bau: Oh, Jasmine had a question, is that right?

94
00:08:49.090 --> 00:08:51.459
prakash.nik: Yeah, but I think she… is she online?

95
00:08:53.930 --> 00:08:55.230
prakash.nik: Yeah, she's on it.

96
00:08:55.230 --> 00:08:59.000
Jasmine C.: Actually, my voice is, like, really gone, never mind.

97
00:09:01.140 --> 00:09:02.839
David Bau: Okay, cool, we can just keep on going.

98
00:09:03.810 --> 00:09:08.480
prakash.nik: Rice… You had a question as well about your second question.

99
00:09:09.640 --> 00:09:18.049
prakash.nik: Was it the one about, like, task-specific, or… Oh, okay. So I think I was asking…

100
00:09:18.430 --> 00:09:22.940
prakash.nik: I could imagine, like, a model component like…

101
00:09:23.770 --> 00:09:37.710
prakash.nik: like, working on a bunch of tasks, right? So, like, it could, for example, I think we were reading about the IOI, like, it could be serving as the name overhead, it could also, in another task, could be serving as something else. So I was just wondering, like, it's kind of…

102
00:09:37.880 --> 00:09:39.979
prakash.nik: Impossible to, like, actually, like.

103
00:09:40.270 --> 00:09:47.640
prakash.nik: first engineer, like, all the modern components and, like, label them, like, universally. So is it, like, task-specific all the time?

104
00:09:47.800 --> 00:09:52.039
prakash.nik: Yeah, so generally, when I think of circuit, I think of them as task-specific as well.

105
00:09:53.150 --> 00:09:56.600
prakash.nik: That these are… this is the circuit for this particular task.

106
00:09:56.930 --> 00:10:04.149
prakash.nik: But the high-level idea, or high-level hope of doing this kind of work is, is that probably someday we would be able to

107
00:10:04.330 --> 00:10:06.070
prakash.nik: Have modulus circuits.

108
00:10:06.530 --> 00:10:11.239
prakash.nik: That is… Sort of universal across different kinds of tasks.

109
00:10:12.010 --> 00:10:15.450
prakash.nik: We will find some smaller surfaces which are modular.

110
00:10:15.960 --> 00:10:20.140
prakash.nik: Which you can basically find out in different kinds of tasks.

111
00:10:20.240 --> 00:10:21.579
prakash.nik: I think that's the high level of…

112
00:10:22.810 --> 00:10:24.799
prakash.nik: It's funny, we've got to do lots of holes of…

113
00:10:25.020 --> 00:10:28.790
prakash.nik: I guess so, I think if we can… Scale things up.

114
00:10:29.820 --> 00:10:33.650
prakash.nik: And… this is just a really cool, small model.

115
00:10:34.230 --> 00:10:41.300
prakash.nik: But let's say if we do similar kind of work in more than 100 billion model parameters, model.

116
00:10:41.710 --> 00:10:44.100
prakash.nik: I feel we should be able to find some…

117
00:10:45.050 --> 00:10:49.289
prakash.nik: Set of, like, smaller modular circuits, which is,

118
00:10:49.650 --> 00:10:54.470
prakash.nik: generalizable across tasks. I mean, in a sense, you can think of induction as a circuit as well.

119
00:10:54.600 --> 00:10:57.009
prakash.nik: Which we know is generalizable.

120
00:10:57.660 --> 00:10:58.799
prakash.nik: Across bunch of times.

121
00:10:59.030 --> 00:11:01.770
prakash.nik: But maybe induction is a little bit low level?

122
00:11:01.890 --> 00:11:06.090
prakash.nik: So I'm hoping something which is a little bit more of a higher level than induction.

123
00:11:06.440 --> 00:11:12.379
prakash.nik: Which would still make sense to people, but… and is also found across wines.

124
00:11:17.160 --> 00:11:23.480
prakash.nik: Okay, so that was the definition of the circuit, but, I'll… Phase 1…

125
00:11:23.970 --> 00:11:30.890
prakash.nik: alert here. A lot of people, when they're talking about circuit, they use third zone…

126
00:11:31.780 --> 00:11:36.250
prakash.nik: Like, mechanism and circuits sort of, sort of interchangeable.

127
00:11:36.780 --> 00:11:42.340
prakash.nik: Now, when I say circuit, I really mean set of color options. But certain people.

128
00:11:42.690 --> 00:11:51.170
prakash.nik: When they say circuit, they actually mean the mechanism. Like, you can actually explain in English language how an input is getting transferred to the final output.

129
00:11:51.970 --> 00:11:56.280
prakash.nik: You can explain what is the functionality of the… oral compounds.

130
00:11:56.520 --> 00:11:58.969
prakash.nik: Which I… which I call mechanism.

131
00:11:59.430 --> 00:12:06.249
prakash.nik: that's how, generally, I like to think, and I say set of moral components as a… and that's what I will be… I will be using.

132
00:12:06.390 --> 00:12:12.909
prakash.nik: When I say so… Okay, so let's get into our first paper.

133
00:12:13.130 --> 00:12:15.409
prakash.nik: And, IY paper.

134
00:12:15.640 --> 00:12:23.290
prakash.nik: So I remember it was a 2022 paper. This came out during our first year in 50. This was my first semester, actually.

135
00:12:23.420 --> 00:12:27.200
prakash.nik: I'm super surprised by this work. It's really thorough work.

136
00:12:27.490 --> 00:12:31.430
prakash.nik: And the other reason for me getting surprised was the first author.

137
00:12:32.520 --> 00:12:34.080
prakash.nik: Yeah, you know the reason.

138
00:12:34.380 --> 00:12:37.140
prakash.nik: So, the first author, he was a high schooler at the time.

139
00:12:38.020 --> 00:12:45.229
prakash.nik: and he, he's actually from Boston. I'm not gonna say fair in Boston, but,

140
00:12:45.370 --> 00:12:48.739
prakash.nik: Yeah, I was super surprised to see that somebody…

141
00:12:48.910 --> 00:12:57.169
prakash.nik: in high school can do this kind of research, this kind of such a thorough, high-level research. So I was super surprised by that, and that was…

142
00:12:57.550 --> 00:12:59.760
prakash.nik: That's, like, a motivation for, sort of, all of us.

143
00:13:00.930 --> 00:13:02.330
prakash.nik: You can't do it.

144
00:13:02.460 --> 00:13:05.150
prakash.nik: Impactful gate resets nowhere where you are.

145
00:13:05.760 --> 00:13:07.770
prakash.nik: Yeah, I just wanted to say that.

146
00:13:08.140 --> 00:13:10.570
prakash.nik: Okay, so coming to their experimental setup.

147
00:13:11.100 --> 00:13:13.410
prakash.nik: It's a pretty simple to ask.

148
00:13:16.200 --> 00:13:24.109
prakash.nik: Yeah, so they have sentences like, John and Mary went to a store, John gave a drink too, and then, like, this next token after 2 should be Mary.

149
00:13:24.280 --> 00:13:34.039
prakash.nik: So that's basically the indirect object, identification task. They studied GPT-2 small, which has, like, 12 layers and 12 engine.

150
00:13:34.530 --> 00:13:39.520
prakash.nik: The metric that they use is called logit difference, which basically means that you take the logit of

151
00:13:39.960 --> 00:13:43.520
prakash.nik: Like, the first person, minus the lobby of the second person.

152
00:13:43.890 --> 00:13:46.759
prakash.nik: That's a… that's a larger difference. And the model?

153
00:13:47.070 --> 00:13:50.269
prakash.nik: the GPT tool can perform the tasks.

154
00:13:50.930 --> 00:13:53.430
prakash.nik: Almost 99, like, almost 100%.

155
00:13:56.490 --> 00:14:03.929
prakash.nik: Okay, so the main contributions, like, I see this paper has two main contributions. One is the empirical

156
00:14:04.200 --> 00:14:22.220
prakash.nik: contribution, which is the circuit itself. The other one is the methodological contribution, which is called path patching, which is important to understand how they actually came up with the circuit itself. So I'll first gonna describe the path patching algorithm intuitively, and then we will get into the details. So…

157
00:14:22.530 --> 00:14:26.490
prakash.nik: Yeah, again, think of a neural network as a computational graph.

158
00:14:26.900 --> 00:14:30.239
prakash.nik: Essentially, that's the question that path patching is trying to answer.

159
00:14:30.450 --> 00:14:38.169
prakash.nik: So, given this computational graph, what are the important edges in that computational graph which are causally relevant for doing a particular task?

160
00:14:41.010 --> 00:14:45.149
prakash.nik: So what you do is, you first need to come up with a counterfactual.

161
00:14:46.120 --> 00:14:49.240
prakash.nik: And the kind of counterfactual that you come up.

162
00:14:50.260 --> 00:14:54.709
prakash.nik: Is something which has completely different information.

163
00:14:54.940 --> 00:15:01.940
prakash.nik: about the… Like, it should have different… Sh…

164
00:15:04.220 --> 00:15:06.640
prakash.nik: Concepts which are relevant to your task.

165
00:15:06.820 --> 00:15:09.329
prakash.nik: So it should still have similar structure.

166
00:15:09.600 --> 00:15:17.770
prakash.nik: But the main thing that you're trying to understand should be different. So, for this particular In specific.

167
00:15:18.090 --> 00:15:20.900
prakash.nik: Let's assume this is our clean example.

168
00:15:21.740 --> 00:15:31.520
prakash.nik: And they basically used corrupted version of this task, which would have… 3 persons.

169
00:15:32.350 --> 00:15:37.560
prakash.nik: And the… Yeah, they basically use a three-person task.

170
00:15:37.670 --> 00:15:43.000
prakash.nik: Where you will have person, like, John and Mary went to the store, then you will have…

171
00:15:43.170 --> 00:15:45.400
prakash.nik: Eric gave a drink too.

172
00:15:45.630 --> 00:15:54.489
prakash.nik: So, this basically destroys the indirectness of the task, like, there is no connection between the third person and the second… first and the second person.

173
00:15:54.730 --> 00:16:00.199
prakash.nik: That's the… Corrupted example that they used in this particular example.

174
00:16:02.630 --> 00:16:07.369
prakash.nik: Okay, so once you have your corrected example, you feed both the examples in your model.

175
00:16:08.400 --> 00:16:15.070
prakash.nik: And then what you do is, you start patching in the edges which are directly connected to the

176
00:16:15.230 --> 00:16:16.520
prakash.nik: Find it out quickly.

177
00:16:16.880 --> 00:16:19.459
prakash.nik: From the corrupted run to the clean one.

178
00:16:20.140 --> 00:16:28.019
prakash.nik: Okay? And you do that for all the edges that are directly connected to the funnel. Only the edges which are directly connected to the funnel.

179
00:16:28.470 --> 00:16:33.240
prakash.nik: You do that, And for each patch, you will see that the final output changes a little.

180
00:16:33.400 --> 00:16:38.250
prakash.nik: So you need to measure that. So in this paper, they are using larger difference, so let's say you use that.

181
00:16:39.300 --> 00:16:41.940
prakash.nik: Softer… this iteration.

182
00:16:42.050 --> 00:16:47.680
prakash.nik: You will have a set of scores for all the edges which are directly connected to the final arm.

183
00:16:48.210 --> 00:16:49.790
prakash.nik: And you use some threshold?

184
00:16:49.940 --> 00:16:54.370
prakash.nik: To filter out the irrelevant ones, and only keep the ones which has high causal impact.

185
00:16:55.160 --> 00:16:56.150
prakash.nik: On the panel.

186
00:16:56.280 --> 00:17:00.919
prakash.nik: That's how you get, like, the first set of edges in your circuit.

187
00:17:01.140 --> 00:17:04.060
prakash.nik: Now, once you do that, you basically need to…

188
00:17:04.619 --> 00:17:07.300
prakash.nik: Do the same process again, but now, assuming…

189
00:17:07.550 --> 00:17:14.460
prakash.nik: Instead of assuming logic or the final output as your target node.

190
00:17:14.650 --> 00:17:19.420
prakash.nik: You use the notes which are identified in the previous iteration as you're targeting.

191
00:17:20.930 --> 00:17:24.719
prakash.nik: Do the process again and again, until you reach the inputs.

192
00:17:25.220 --> 00:17:32.380
prakash.nik: And at the end of the process, you will have something like this, where you will have all the important edges, which connects the inputs to the funnel outputs.

193
00:17:33.520 --> 00:17:34.589
prakash.nik: Any questions?

194
00:17:37.250 --> 00:17:38.110
prakash.nik: Okay.

195
00:17:40.090 --> 00:17:43.149
prakash.nik: So that was the intuition, now this is the actual algorithm.

196
00:17:44.270 --> 00:17:47.060
prakash.nik: I think… yeah, this is an important slide.

197
00:17:47.270 --> 00:17:48.600
prakash.nik: If you understand this.

198
00:17:48.730 --> 00:17:49.860
prakash.nik: Life would be good.

199
00:17:50.520 --> 00:17:55.870
prakash.nik: Okay, so here, the…

200
00:17:55.870 --> 00:18:03.780
David Bau: So, Nikhil, I find the previous slide a little bit easier to read than than this one.

201
00:18:04.290 --> 00:18:04.910
prakash.nik: Okay.

202
00:18:04.910 --> 00:18:14.649
David Bau: You know, just for whatever reason, it's just because of the layout. So I just wanted to give people just a minute to ask questions about the previous slide, or either one, before you get into it.

203
00:18:18.430 --> 00:18:20.580
prakash.nik: What's the, the field?

204
00:18:21.480 --> 00:18:23.820
prakash.nik: Yeah, I think that's an empirical thing that you have to…

205
00:18:24.310 --> 00:18:30.340
prakash.nik: I don't think so there is any specific theory behind what should be the threshold that you should be using.

206
00:18:30.840 --> 00:18:34.540
prakash.nik: It also depends on the metric that you use. In this paper, they use larger lengths.

207
00:18:34.680 --> 00:18:36.799
prakash.nik: Sorry, not logical. It's logic difference.

208
00:18:37.170 --> 00:18:39.610
prakash.nik: But that's not all… that's also not universal.

209
00:18:40.810 --> 00:18:44.510
prakash.nik: Yeah, so I think you have to do it empirically, figure that out empirically.

210
00:18:46.580 --> 00:18:51.990
prakash.nik: I probably missed this part, but how did they come up with the graph? Like, how did they come up with what is an engine?

211
00:18:53.330 --> 00:18:57.559
prakash.nik: Yeah, so what exactly is an edge, and how do you patch one edge

212
00:18:59.150 --> 00:19:07.000
prakash.nik: Like, how do you take an edge and pass it to a cleaner? That's a question that I'll be answering in the next slide. Yeah, that's the…

213
00:19:07.530 --> 00:19:18.789
David Bau: So, can you… can you give, can you give a little idea? Like, so, what's… what's, like, the idea of what you're trying to do? Before you show how we do it? Like, what is it you're trying to do when you patch an edge? What would it mean?

214
00:19:22.670 --> 00:19:23.320
prakash.nik: Okay.

215
00:19:23.920 --> 00:19:25.489
prakash.nik: Let's use this diagram.

216
00:19:25.960 --> 00:19:30.550
prakash.nik: Let's say I want to understand how this head affects

217
00:19:31.040 --> 00:19:38.739
prakash.nik: this particular head. So there is a direct edge. There is an edge from this particular Edge one?

218
00:19:39.200 --> 00:19:40.990
prakash.nik: to this particular H1.

219
00:19:42.550 --> 00:19:49.169
prakash.nik: And I want to patch that particular edge. I want to know its significance or relevance for this particular task.

220
00:19:50.200 --> 00:19:57.760
prakash.nik: I think the essential idea here is, You batch the output of… your…

221
00:19:58.250 --> 00:20:02.830
prakash.nik: Like, the sender node, like, the node which is handling the information.

222
00:20:03.070 --> 00:20:06.789
prakash.nik: While keeping in everything else in between constant.

223
00:20:07.350 --> 00:20:10.749
prakash.nik: You need to make sure that nothing really changes.

224
00:20:11.010 --> 00:20:12.190
prakash.nik: in between.

225
00:20:12.690 --> 00:20:16.139
prakash.nik: When you patch in the output of this particular head.

226
00:20:16.630 --> 00:20:21.869
prakash.nik: And then you need to figure out, like, you basically need to cash in the input of this particular head.

227
00:20:25.190 --> 00:20:34.980
David Bau: So, I like this. I think this is funny, because there's this thing that philosophers sometimes say, which is like, what's… what's the importance of a thing? And they say… they use this phrase.

228
00:20:35.390 --> 00:20:37.279
David Bau: All else being equal.

229
00:20:38.230 --> 00:20:39.650
David Bau: Which is usually…

230
00:20:40.310 --> 00:20:45.459
David Bau: It's just a mental experiment. I mean, it's usually impossible to keep all else being equal, you just…

231
00:20:46.070 --> 00:20:51.160
David Bau: usually have to reason through it. Like, what would happen if everything else was equal?

232
00:20:51.750 --> 00:20:59.420
David Bau: But I think the funny thing about this experiment that, you know, Kevin Wong, Rowan Wong, you know, proposed.

233
00:20:59.600 --> 00:21:07.569
David Bau: was, he says, oh, you can physically keep everything else all equal. Like, if you want to know how important the edge is, you, like, physically, like, pin down everything else.

234
00:21:07.910 --> 00:21:13.590
David Bau: Anyway, that's sort of the way I think of it. I'll let Nikhil get into the details, but I think it's easy to get lost.

235
00:21:13.950 --> 00:21:18.240
David Bau: With all the details, but if you just kind of keep in mind, what he's trying to do is he's trying to keep

236
00:21:18.600 --> 00:21:22.159
David Bau: He's trying to do, like, one of these all-else-being-equal experiments.

237
00:21:25.530 --> 00:21:32.740
prakash.nik: just to reiterate what I was saying, yeah, basically, I want to see how does information from this particular

238
00:21:32.980 --> 00:21:42.249
prakash.nik: head goes via the residual stream to this particular… so I keep everything else, the output of all the comparison between the same.

239
00:21:42.380 --> 00:21:45.820
prakash.nik: And I essentially see, how does this particular head

240
00:21:46.180 --> 00:21:48.890
prakash.nik: Affect this, affect the input of this particular head?

241
00:21:49.090 --> 00:21:50.419
prakash.nik: Why the residual stream.

242
00:21:52.180 --> 00:21:53.939
prakash.nik: That's the, I think, main idea.

243
00:21:54.970 --> 00:21:57.900
prakash.nik: In terms of, actually, how would you do it?

244
00:21:58.010 --> 00:22:03.829
prakash.nik: Again, we would have to come back to this particular figure, which is from their paper itself. I did not make it.

245
00:22:04.640 --> 00:22:10.059
prakash.nik: Yeah, so here it is upside. So this is our input, this is our output.

246
00:22:10.750 --> 00:22:15.230
prakash.nik: The red ones are on, corrupted sample.

247
00:22:15.690 --> 00:22:18.759
prakash.nik: Green ones are on the green sample.

248
00:22:20.070 --> 00:22:25.980
prakash.nik: And, goal here is to…

249
00:22:26.090 --> 00:22:28.970
prakash.nik: Patch in the edge from this particular…

250
00:22:29.310 --> 00:22:32.700
prakash.nik: Head to this particular head, and

251
00:22:34.260 --> 00:22:36.980
prakash.nik: This particular head. I don't know why is this one green.

252
00:22:38.400 --> 00:22:46.040
prakash.nik: Okay, okay. So, idea here is to patch in the edge from this particular head to this particular head, and this particular head to this particular head, okay?

253
00:22:46.300 --> 00:22:48.460
prakash.nik: So what they do is, they again

254
00:22:48.640 --> 00:22:56.749
prakash.nik: feeding the clean sample, okay? But now, they patch in the output of this particular head from the corrupted run.

255
00:22:58.050 --> 00:23:01.720
prakash.nik: While keeping the output of all other heads.

256
00:23:02.220 --> 00:23:04.030
prakash.nik: The same as the original one.

257
00:23:04.720 --> 00:23:05.640
prakash.nik: Okay.

258
00:23:06.880 --> 00:23:10.169
prakash.nik: You can imagine the output of MLBs from the green one as well.

259
00:23:10.560 --> 00:23:11.440
prakash.nik: For now.

260
00:23:11.690 --> 00:23:21.420
prakash.nik: If you assume that, then the red one, there would be only one path from this particular red to this particular red, which would be via the residual stream.

261
00:23:22.240 --> 00:23:23.060
prakash.nik: Mike?

262
00:23:25.160 --> 00:23:31.620
prakash.nik: NICC, that's what they are interested in. They are, like, we are interested in… Prior to…

263
00:23:32.120 --> 00:23:35.850
prakash.nik: Patch in the dark edge between these two particular heads.

264
00:23:36.090 --> 00:23:40.800
prakash.nik: Okay, so in this step, what we do is we patch in the output of this particular head while keeping the

265
00:23:40.920 --> 00:23:47.649
prakash.nik: output of all attention heads fixed. You can also keep the output of MLPs fixed.

266
00:23:47.880 --> 00:23:56.520
prakash.nik: And then, cache in the input of your receiver heads, which in this case is 2.0 and 3.1.

267
00:23:56.700 --> 00:23:57.440
prakash.nik: Okay.

268
00:23:57.870 --> 00:24:00.199
prakash.nik: Cache in the input of these two heads.

269
00:24:00.330 --> 00:24:04.299
prakash.nik: Then comes the fourth step, where, again, you feed in the clean example.

270
00:24:04.530 --> 00:24:07.740
prakash.nik: But this time, you patch in,

271
00:24:08.170 --> 00:24:13.060
prakash.nik: The input of your receiver heads.

272
00:24:14.210 --> 00:24:19.180
prakash.nik: which you cashed in in the previous step. And then you see what is the effect on the parallel.

273
00:24:19.680 --> 00:24:24.300
prakash.nik: And that is equivalent to patching a direct edge between

274
00:24:24.690 --> 00:24:28.860
prakash.nik: Your receiver, your sender and the receiver heads.

275
00:24:29.720 --> 00:24:33.129
prakash.nik: Yeah, I think that's very confusing, I know that. So, ask questions.

276
00:24:33.920 --> 00:24:36.060
prakash.nik: The third step, I think, is the main one.

277
00:24:36.290 --> 00:24:37.319
prakash.nik: Does that make sense?

278
00:24:44.600 --> 00:24:57.880
prakash.nik: Okay. The same as green one, or… Yeah, so in terms of notation, green one is what you get from the second one, red one is what you get from the first one, and the black ones are recomputed one.

279
00:24:59.590 --> 00:25:01.540
prakash.nik: In the model this week, of course.

280
00:25:01.660 --> 00:25:02.380
prakash.nik: Okay?

281
00:25:03.300 --> 00:25:07.009
prakash.nik: Can you repeat the moment? Yeah, so…

282
00:25:07.470 --> 00:25:15.560
prakash.nik: Yeah, the model just does the forward pass. You don't pass it. It just does the magic manifestation. Why the only path

283
00:25:16.370 --> 00:25:17.720
prakash.nik: Feed of corners.

284
00:25:18.850 --> 00:25:19.509
prakash.nik: Yes, ma'am.

285
00:25:20.320 --> 00:25:28.070
prakash.nik: Assuming this is also green, if this was green, this wouldn't be red.

286
00:25:28.780 --> 00:25:31.149
prakash.nik: Okay? Because we are dispatching.

287
00:25:32.000 --> 00:25:36.570
prakash.nik: so, in that case, the red would only come…

288
00:25:36.700 --> 00:25:40.180
prakash.nik: Wire the residual string to this particular… Okay.

289
00:25:40.650 --> 00:25:44.269
prakash.nik: That's… that seems to be the idea. But if you… in… in this…

290
00:25:44.530 --> 00:25:46.620
prakash.nik: At least the work that they have.

291
00:25:46.910 --> 00:25:50.119
prakash.nik: presented in the paper, they have not kept the MLPs constant.

292
00:25:50.570 --> 00:25:56.170
prakash.nik: They have not passed in the MLPs from the green room. They have let it length and utility decomp.

293
00:25:58.090 --> 00:25:58.999
prakash.nik: Does that make sense?

294
00:25:59.360 --> 00:26:12.580
prakash.nik: Is there a forward, like, after you've patched it, you're letting it, like… like, there's a forward pass or not, like, the MLP also decompute then? Yeah, it does. So then won't that MLP0 also be, like, yellow or red?

295
00:26:16.260 --> 00:26:19.590
prakash.nik: So, the MLPs will have a new…

296
00:26:19.790 --> 00:26:23.410
prakash.nik: value. Yeah. That's, that's what,

297
00:26:24.440 --> 00:26:29.239
prakash.nik: The black ones are representing… I don't know why this one is red, why is this one red?

298
00:26:41.000 --> 00:26:43.119
prakash.nik: Yeah, I'm not sure why this edge is red.

299
00:26:43.740 --> 00:26:50.989
prakash.nik: I think they should have been glad as well. I think, is it because, if 0.1 is corrupted,

300
00:26:51.820 --> 00:27:09.380
prakash.nik: that will… that will be computed inside MLP0, and, the output of MLP0 will be computed or changed due to, then this… okay, so in that case, these two edges would happen as well, but I think what they're trying to say is… okay, now I think it makes sense.

301
00:27:11.530 --> 00:27:17.469
prakash.nik: This should have been red, but because you are patching this particular head, this particular path is basically blocked.

302
00:27:18.290 --> 00:27:21.140
prakash.nik: That's why they did not make it right.

303
00:27:21.340 --> 00:27:24.490
prakash.nik: Whereas this particular path, It's still red.

304
00:27:24.790 --> 00:27:27.039
prakash.nik: Because… After its visual shrink.

305
00:27:28.210 --> 00:27:29.120
prakash.nik: Any cuts?

306
00:27:29.960 --> 00:27:30.620
prakash.nik: Nope.

307
00:27:31.560 --> 00:27:32.260
prakash.nik: No.

308
00:27:32.490 --> 00:27:34.240
prakash.nik: Okay,

309
00:27:34.790 --> 00:27:43.980
prakash.nik: Question was, why are we not patching MLP, or the question was… Why is Resolute 1.0 and 1.1 not changing it at the forward pass?

310
00:27:45.030 --> 00:27:52.309
prakash.nik: If you're… you're corrupting 0.1, and then there is another forward pass, you're recomputing.

311
00:27:52.890 --> 00:27:56.750
prakash.nik: So shouldn't, like, the… anything that's downstream also change?

312
00:27:57.140 --> 00:28:08.530
prakash.nik: Which component do you think should change? I would imagine 1.2 and 1.0 are also the same. Yeah, it will change, but since we are patching in the output of those two heads from our clean

313
00:28:08.690 --> 00:28:09.740
prakash.nik: Clean samples.

314
00:28:10.060 --> 00:28:12.569
prakash.nik: You're basically making sure that that does not change.

315
00:28:13.700 --> 00:28:19.619
prakash.nik: Okay? We are basically blocking it. We are not letting it recompute. We are just putting in the value that it already had.

316
00:28:20.710 --> 00:28:27.759
prakash.nik: then how can you guarantee, like, that it will be the same? Like, I understand if you're just patching,

317
00:28:28.460 --> 00:28:31.929
prakash.nik: Very good. Yeah, it's a bit tricky.

318
00:28:33.260 --> 00:28:42.999
prakash.nik: I mean, I have, like, I have… now that I kind of looked at this a little bit more, like, at a high level, like, why do you care about the direct effect

319
00:28:43.270 --> 00:28:57.850
prakash.nik: when you keep the other things constant? Because I would imagine, oh, what this head does is it tells this other one to say this message. Like, the full message is the full message, what you care about the direct client. Yeah, because of the way they are trying to find the circuit.

320
00:28:58.820 --> 00:29:05.869
prakash.nik: You remember the initial idea that I mentioned? You need to start from the end, then find the first set of heads which are directly affecting the panel.

321
00:29:06.170 --> 00:29:10.959
prakash.nik: You don't want to care about anything in between. You want to just understand the effect of

322
00:29:11.610 --> 00:29:20.929
prakash.nik: some set of… Upstream nodes to one specific downstream node. Once you figure that out.

323
00:29:21.180 --> 00:29:32.060
prakash.nik: need to figure out what is the other set of upstream nodes which are affecting the set of nodes that you figured out in the first set. You don't want to… So, because if it doesn't affect you directly.

324
00:29:32.470 --> 00:29:36.200
prakash.nik: then it's not the next step in the future. Yeah, exactly, exactly.

325
00:29:41.090 --> 00:29:42.200
prakash.nik: Any more questions?

326
00:29:46.180 --> 00:30:04.380
prakash.nik: Yeah, this is a bit tricky. That's why I said this is probably important, and this is… yeah. How do you connect the two slides? So, previously, fine software for circuit badges, right? But, in the lab… in the… in the figure in the paper, they supposedly choose some AND or some sort of to-do list parameters, so…

327
00:30:04.580 --> 00:30:11.669
prakash.nik: I was just wondering, what's the connection between your two slides? The main connection is about how do you actually do this edge patching.

328
00:30:12.340 --> 00:30:17.440
prakash.nik: You can easily do, like, node patching, like, you just take the output of a particular head or a neuron and take

329
00:30:17.670 --> 00:30:21.939
prakash.nik: patch it from one particular run to another. That's easy. But the…

330
00:30:22.120 --> 00:30:24.420
prakash.nik: The difficult one is how do you do the edge passing?

331
00:30:25.350 --> 00:30:26.849
prakash.nik: Which is what I'm showing.

332
00:30:27.050 --> 00:30:29.850
prakash.nik: You need to do to follow this path-patching procedure.

333
00:30:30.790 --> 00:30:36.489
prakash.nik: But in terms of actually implementing it, it's what they are trying to say in this period.

334
00:30:37.130 --> 00:30:41.579
prakash.nik: If you actually need to take a particular edge that you want to patch.

335
00:30:41.770 --> 00:30:44.620
prakash.nik: These are the four steps you need to do to actually do it.

336
00:30:44.940 --> 00:30:45.880
prakash.nik: That's a good question.

337
00:30:47.080 --> 00:30:53.679
David Bau: So, I'll share a little perspective on it. It's, you know, this path patching thing is a little funny, it's a, you know, it's this…

338
00:30:54.840 --> 00:30:56.190
David Bau: idea of…

339
00:30:57.710 --> 00:31:07.109
David Bau: you know, it's this hope. You know, the circuit idea and the path patching idea, I think it chases this hope that we might be able to trace

340
00:31:08.280 --> 00:31:19.799
David Bau: Causal chains inside these neural networks, the same way you might chase, like, a mystery, or chase a, you know, chase, like, a bug in a regular traditional computer program.

341
00:31:21.700 --> 00:31:34.630
David Bau: you know, imagine, like, you're, like, a time traveler, right? And you want to, like, I don't know, prevent World War II or something. Then you might go back in time, and you might realize, aha, if I go back to a certain coffee shop.

342
00:31:34.780 --> 00:31:39.229
David Bau: And I just moved this cup of coffee from one side of the table to the other.

343
00:31:39.320 --> 00:31:57.759
David Bau: then… then it turns out that it has this… all these… all these follow-on effects, and it ends up, you know, preventing World War II. It's amazing, like a time travel movie, right? Well, that… that might be the case, but then you wonder, like, how did that happen? Right? There's a lot of steps from here to there, and I think that, like, traditional

344
00:31:58.750 --> 00:32:05.679
David Bau: Traditional patching, like what you guys have been doing for your experiments so far, is like that coffee cup.

345
00:32:06.420 --> 00:32:21.310
David Bau: experiment. It's like, you might get a very strong signal on moving that coffee cup from one side of the table to the other. It's like, oh, yes, this coffee cup turns out very important for World War II. But it doesn't really tell you, like, how it's, you know, how it had an impact.

346
00:32:21.600 --> 00:32:24.789
David Bau: And so it's… it might be a little dissatisfying.

347
00:32:25.270 --> 00:32:33.690
David Bau: And so… so really what you might want to know is the detailed causal chain. It's like, oh, if you move that coffee cup.

348
00:32:34.080 --> 00:32:45.689
David Bau: Then, the immediate thing that happens is that, you know, somebody in that store can't, you know, can't pay for the coffee, or something like that.

349
00:32:45.720 --> 00:32:51.520
David Bau: Or didn't, you know, can't find the coffee, and it fell asleep, you know, in the afternoon.

350
00:32:51.520 --> 00:33:07.010
David Bau: And then because they fell asleep, they missed a meeting, and because they missed a meeting, some conference wasn't organized, and because the conference didn't happen, you know, some other idea didn't happen in the culture, and this thing wasn't invented, and, you know, had all these other effects, right? So you can kind of imagine this long causal chain.

351
00:33:07.380 --> 00:33:10.720
David Bau: And… and traditional patching…

352
00:33:11.370 --> 00:33:27.330
David Bau: doesn't concern itself with the details of the causal chain. It just says, hey, this is important, it has some downstream effect. You know, you might wonder what that is, but you can just see that it has the effect. Whereas what path patching is trying to do is it's trying to say.

353
00:33:27.450 --> 00:33:28.539
David Bau: Well, you know…

354
00:33:29.260 --> 00:33:34.690
David Bau: We're gonna try to figure out what effect it has to move this little cup of coffee, this little red

355
00:33:34.950 --> 00:33:38.800
David Bau: 0.1 in C. You know, what's the effect of that?

356
00:33:39.080 --> 00:33:43.440
David Bau: Well, if you just patch that by itself and you let everything else move.

357
00:33:43.950 --> 00:33:49.960
David Bau: it might have some big downstream effect, right? But what they're asking is, they're asking.

358
00:33:52.410 --> 00:33:55.280
David Bau: You know, why did it have the downstream effect?

359
00:33:55.740 --> 00:34:00.790
David Bau: So, let's say there's somebody who misses… so let's say open wine is like the cup of coffee.

360
00:34:02.660 --> 00:34:08.279
David Bau: What 2.0 and 3.1 are, are, like, the person who fell asleep because they didn't get their coffee or something.

361
00:34:08.500 --> 00:34:19.270
David Bau: Right? Like, oh, that's, like, the next downstream effect that you're trying to test. And so you're saying, okay, well, missing the copy messed up a lot of things in the world.

362
00:34:19.300 --> 00:34:37.809
David Bau: It means that the person who's cleaning up the table now has to wipe the table in a different way, it means that the dishwasher cleans up the coffee in a different way, you know, now there's… it's a full cup, somebody didn't drink it, you know, they made… they made a dollar less profit in the store, and that has some other effects, and so on.

363
00:34:37.820 --> 00:34:40.330
David Bau: But… and so those are all the other green…

364
00:34:40.420 --> 00:34:51.129
David Bau: boxes, it would normally be downstream of 0.1, and what they're saying is they're saying, hey, you know what, I think that… I suspect none of those really matter. What if we move the cup of coffee, but we don't allow

365
00:34:51.389 --> 00:34:58.690
David Bau: Us to see the changes that are downstream of the cup of coffee, other than the things that we think are the decisive ones.

366
00:34:59.530 --> 00:35:01.990
David Bau: So, fix the profit of the store.

367
00:35:02.520 --> 00:35:06.929
David Bau: That's 1.1. Even though the store made one less dollar.

368
00:35:07.060 --> 00:35:10.149
David Bau: Restore them to have the extra dollar for free.

369
00:35:10.170 --> 00:35:28.399
David Bau: Right? You know, fix the… fix the dishwasher experience, that's 1.0. Yeah, even though the coffee was full, change the dishwasher experience a little bit. Pretend, pretend it didn't change at all. That's 1.0. Go to all the things and pretend it didn't have an effect. The only thing that you're gonna allow to have the effect.

370
00:35:28.840 --> 00:35:42.290
David Bau: is, like, 2.0, 3.1, right? Like, just let the… let the effects come through only to these new things, only the specific things. That's… that's the attempt. It's asking this… it's asking, you know, a more…

371
00:35:42.890 --> 00:35:47.689
David Bau: Like, if everything else was the same, kind of experiment.

372
00:35:49.000 --> 00:35:58.240
David Bau: And, anyway, that's what it's attempting to do, to try to make it a little bit more satisfying. I don't know if that helps at all, but I always found it a little…

373
00:35:58.740 --> 00:36:00.740
David Bau: A little confusing, all the setup.

374
00:36:00.970 --> 00:36:02.280
David Bau: Is the way I think of it.

375
00:36:03.110 --> 00:36:07.959
David Bau: Does that make any sense? Anyway, I'll shut up again. Sorry, it's hard to… hard to lecture from remote.

376
00:36:08.860 --> 00:36:13.030
prakash.nik: It's okay, yeah, I think you have a question. Yeah, why don't we start from the end, and then…

377
00:36:13.620 --> 00:36:15.520
prakash.nik: Not from the beginning.

378
00:36:17.130 --> 00:36:21.119
prakash.nik: Like, in the coffee cup experience experiment that was, like, starting from the beginning.

379
00:36:21.320 --> 00:36:32.800
prakash.nik: And, looking forward, but this would be starting from, I guess, World War II, and then slowly move backwards to get back to the coffee shop? Like, why did we do it that way as opposed to…

380
00:36:33.010 --> 00:36:38.570
prakash.nik: Start from, the inputs, and then, patch forward.

381
00:36:39.150 --> 00:36:46.780
prakash.nik: Yeah, so I think it's because… let's say if you start by, like, figuring out what is the…

382
00:36:47.460 --> 00:36:51.549
prakash.nik: most important… It's just… it's a big name.

383
00:36:52.320 --> 00:36:56.480
prakash.nik: Well, you don't really have anything…

384
00:36:57.760 --> 00:37:01.580
prakash.nik: You don't really know which are the important set of nodes in your first layer.

385
00:37:02.140 --> 00:37:04.940
prakash.nik: I think the way…

386
00:37:07.920 --> 00:37:12.159
prakash.nik: the way at this path patching works, you need to know what is the important downstream,

387
00:37:12.470 --> 00:37:16.290
prakash.nik: Nodes in your computational graph that you care about.

388
00:37:16.960 --> 00:37:18.800
prakash.nik: If you don't know that.

389
00:37:19.300 --> 00:37:20.849
prakash.nik: I don't know how would you do that.

390
00:37:23.840 --> 00:37:27.209
prakash.nik: Yeah, so I think that's just how the framework plays out.

391
00:37:27.830 --> 00:37:30.979
prakash.nik: So… Yeah, that's what I was saying.

392
00:37:32.240 --> 00:37:36.789
prakash.nik: I wonder if you could do it, it just would be less efficient. Like, if you only know

393
00:37:37.060 --> 00:37:40.450
prakash.nik: And the outcome that would come be changed by some edge at the end.

394
00:37:40.680 --> 00:37:44.400
prakash.nik: And if most of them don't have any impact, then you don't need to push.

395
00:37:45.110 --> 00:37:47.290
prakash.nik: Is that right?

396
00:37:47.400 --> 00:37:50.270
prakash.nik: Like, you have… if the…

397
00:37:50.390 --> 00:37:55.550
prakash.nik: output is gonna change. There has to be, like, you know that something needs to change at the…

398
00:37:55.720 --> 00:38:02.269
prakash.nik: And so, if you're like, oh, look, like, most of those nodes don't… like, those edges don't…

399
00:38:03.010 --> 00:38:07.840
prakash.nik: impact anything, that I don't need to look… I don't need to search… Down.

400
00:38:08.420 --> 00:38:10.529
prakash.nik: Look at those edges out of.

401
00:38:10.670 --> 00:38:12.969
prakash.nik: Yeah, but you still need to start from there.

402
00:38:13.750 --> 00:38:23.759
prakash.nik: Right, and I'm saying, like, if you start from the end, then you can eliminate… it's just more efficient. Not that you can't do it both ways, it's just that it's more… It should be definitely…

403
00:38:23.950 --> 00:38:27.529
prakash.nik: efficient, and I'm still not sure how would you be able to do that from the start.

404
00:38:28.270 --> 00:38:35.590
prakash.nik: you could, like, just do the prolifer, you could just, like, brute force 7… So let's say this is your input.

405
00:38:37.580 --> 00:38:39.370
prakash.nik: Let's say all of them is your output.

406
00:38:41.960 --> 00:38:45.649
prakash.nik: And then you find some edge which has high…

407
00:38:47.210 --> 00:38:49.010
prakash.nik: Yeah, what would that even mean?

408
00:38:49.760 --> 00:38:57.200
prakash.nik: Yeah, I just, like, you could just do every single one. Just every single edge.

409
00:38:58.890 --> 00:39:06.010
prakash.nik: We could look at every single graph. It doesn't… I'm not saying it's a fit, like, would be tractable in any way.

410
00:39:06.130 --> 00:39:12.149
prakash.nik: And I'm not sure if it's untractable, intractable. So, yeah, let's say maybe it is.

411
00:39:12.390 --> 00:39:21.559
prakash.nik: You had a question? Yeah, just go on, you know, in my Monday analogy would be, okay, you move the coffee cup, and then that affects, like, 10,000 different things.

412
00:39:21.700 --> 00:39:29.850
prakash.nik: and each of those things affects, like, 10,000 others, which would be, like, an inference. In that case, unfractable search. In a neural network.

413
00:39:30.070 --> 00:39:37.700
prakash.nik: Technically, it's finite, so it's not infinite, but it would make much more sense to say, well, actually, if someone missed this meeting.

414
00:39:38.070 --> 00:39:46.249
prakash.nik: then that would have avoided World War II. And if, like, this bus was late, then this meeting would be avoided, and then you're kind of searching

415
00:39:46.340 --> 00:40:00.539
prakash.nik: in a tractable way, I think… yeah, your… I think your approach would be technically tractable, just because it's… you have a finite amount of things, but it could be still, like, a bit. Yeah, that's a good amount. I mean, it's very legal expensive.

416
00:40:01.430 --> 00:40:03.340
prakash.nik: Yeah, I think it does, yeah.

417
00:40:04.430 --> 00:40:08.080
prakash.nik: Okay, so we have spent 45 minutes, so I think we've…

418
00:40:08.750 --> 00:40:16.569
prakash.nik: So, essentially, that's the path patching algorithm, and they use this particular algorithm to figure out the circuit for the IOI task.

419
00:40:17.560 --> 00:40:25.849
prakash.nik: So as I state, they start with the end. They try to understand what are the needs which are directly affecting the panel logit.

420
00:40:26.040 --> 00:40:32.400
prakash.nik: They apply the path patching algorithm, and these are the… the red ones are basically the most important ones.

421
00:40:34.250 --> 00:40:38.259
prakash.nik: Yeah, these are the first set of heads that they find with the part matching.

422
00:40:39.270 --> 00:40:47.999
prakash.nik: They also tried to understand the functionality of this head, which I'm not talking a lot about right now. I have a few slides at the end. But essentially, what they found was

423
00:40:48.110 --> 00:40:53.599
prakash.nik: that these are the heads which are attending to the Mari token, and fetching its value.

424
00:40:54.850 --> 00:40:57.120
prakash.nik: That's why they call it Name Overheads.

425
00:40:57.870 --> 00:41:03.059
prakash.nik: Then the question becomes, okay, these are the heads which are fetching in the MARI token.

426
00:41:03.410 --> 00:41:10.910
prakash.nik: What are the next set of heads which are actually affecting those heads? Like, how does the name overheads know that it needs to attend to Marito?

427
00:41:11.570 --> 00:41:16.119
prakash.nik: So for that, they… what they do is, they apply the path hatching algorithm.

428
00:41:17.190 --> 00:41:20.020
prakash.nik: But in this case,

429
00:41:20.200 --> 00:41:24.570
prakash.nik: The target node is basically the query vectors of the name overheads.

430
00:41:24.960 --> 00:41:29.069
prakash.nik: And then the standard nodes are basically each of the…

431
00:41:29.470 --> 00:41:31.589
prakash.nik: Don't see the previous layers of that model.

432
00:41:32.110 --> 00:41:35.069
prakash.nik: Any guesses, why query not key or value?

433
00:41:40.060 --> 00:41:45.750
prakash.nik: video about doing something. The idea here is to find out

434
00:41:46.150 --> 00:41:51.820
prakash.nik: Where or how does the name overheads know to attend to Mary Global.

435
00:41:52.490 --> 00:41:59.950
prakash.nik: The way attention works is, it is the query vector of attention head, which encodes the information what it needs to attend to.

436
00:42:01.780 --> 00:42:07.060
prakash.nik: And the key vectors encodes, what is it actually encoding?

437
00:42:07.200 --> 00:42:09.520
prakash.nik: And… and we know that the…

438
00:42:09.810 --> 00:42:15.239
prakash.nik: So our attention helps. Attend to the MERI token, that's why the key token at the last…

439
00:42:15.390 --> 00:42:18.010
prakash.nik: The key vector and the last token doesn't really matter.

440
00:42:18.160 --> 00:42:24.909
prakash.nik: And for the vect… like, the value vectors, we already know that it is fetching in the values, from the meritome.

441
00:42:25.440 --> 00:42:28.120
prakash.nik: So that's, like, the other way to think about it.

442
00:42:30.330 --> 00:42:32.949
prakash.nik: Yeah, so that's basically the next step.

443
00:42:33.070 --> 00:42:38.900
prakash.nik: using this particular step, they found our next set of attention heads.

444
00:42:39.220 --> 00:42:45.580
prakash.nik: Again, they tried to understand the functionality of this head, and what they figured out was these are the heads, which are

445
00:42:46.460 --> 00:42:54.120
prakash.nik: sending in information about the John token to the NameMoverhead, which is what the NameMoverhead is using.

446
00:42:54.280 --> 00:43:01.150
prakash.nik: to not attend to John, and that's why they are calling this particular set of heads as S, innovation heads.

447
00:43:03.000 --> 00:43:05.639
prakash.nik: Yeah, the functionalities are not very important,

448
00:43:06.420 --> 00:43:14.620
prakash.nik: bit about the functionalities at the end, but I'm just trying to show you how you can apply path patching algorithm to find the set of heads in each step.

449
00:43:14.890 --> 00:43:17.349
prakash.nik: To find the overall sun.

450
00:43:17.520 --> 00:43:20.570
prakash.nik: And they continue the same step again and again.

451
00:43:21.220 --> 00:43:25.420
prakash.nik: They actually reach the input, and this is the overall circuit that they found.

452
00:43:27.550 --> 00:43:31.450
prakash.nik: Okay, now we will take question. I think we'll start with Hayou.

453
00:43:34.660 --> 00:43:36.790
prakash.nik: Yeah, the retraining impact.

454
00:43:36.920 --> 00:43:39.260
prakash.nik: They will… He's hype.

455
00:43:40.030 --> 00:43:42.080
prakash.nik: you know, I… hyphen.

456
00:43:44.120 --> 00:43:52.180
prakash.nik: Are you, if that's correct pronunciation, are you on the Zoom?

457
00:43:55.960 --> 00:44:02.640
prakash.nik: Maybe he's… So the question was, if we retrain GPT-2 from scratch.

458
00:44:02.880 --> 00:44:06.439
prakash.nik: Will we get same set of… Hence, or not.

459
00:44:07.060 --> 00:44:08.879
prakash.nik: That was first… You'll need to unlock…

460
00:44:11.790 --> 00:44:17.829
prakash.nik: That was the first part of the question. The second part, will the functionalities of those heads will remain same or not?

461
00:44:19.070 --> 00:44:20.220
prakash.nik: What do you guys think?

462
00:44:21.350 --> 00:44:24.020
prakash.nik: Attention wheels will change. Okay.

463
00:44:24.200 --> 00:44:27.700
prakash.nik: Depending on the data that we created, or we created. Let's say it's the same data.

464
00:44:29.330 --> 00:44:30.659
prakash.nik: Exactly same data.

465
00:44:30.830 --> 00:44:39.650
prakash.nik: If there's stochasticity within the training that could affect its changes slightly, but… Good feet.

466
00:44:40.200 --> 00:44:40.930
prakash.nik: Okay.

467
00:44:42.640 --> 00:44:44.040
prakash.nik: What about functionality?

468
00:44:46.920 --> 00:44:49.559
prakash.nik: Uxaraki should remain safe.

469
00:44:50.420 --> 00:44:51.290
prakash.nik: I think.

470
00:44:54.560 --> 00:44:55.660
prakash.nik: Any other guesses?

471
00:44:56.650 --> 00:45:04.410
prakash.nik: I think at the beginning you said that you expect some circuits to be modular and present in a lot of different networks, so you would expect

472
00:45:04.860 --> 00:45:06.380
prakash.nik: functionally, some…

473
00:45:06.790 --> 00:45:13.649
prakash.nik: heads or circuits to be present in the same network on Ukraine, maybe in another place, because it's from the old spot.

474
00:45:15.440 --> 00:45:19.089
prakash.nik: Yeah, so this is what I would expect, although I've not done the experiment.

475
00:45:19.390 --> 00:45:21.189
prakash.nik: I think the exact heads would change.

476
00:45:21.670 --> 00:45:29.799
prakash.nik: Because… random initialization. I think the initialization would be different, so even if the integer is the same.

477
00:45:30.460 --> 00:45:34.240
prakash.nik: I am quite confident that the heads won't be the same.

478
00:45:34.610 --> 00:45:38.520
prakash.nik: You don't think so.

479
00:45:38.880 --> 00:45:44.109
prakash.nik: Okay, yeah, but we have not done the experiment, but I feel because of the random and initialization, I feel the head should be different.

480
00:45:44.270 --> 00:45:53.730
prakash.nik: But the functionality should remain the same, since we are training on the same data. So I feel that at least the mechanism of how the model will solve the task should remain the same.

481
00:45:56.310 --> 00:46:01.209
prakash.nik: Okay, now a few other questions. Aria, Ananya, and Jessabhar.

482
00:46:01.720 --> 00:46:10.270
prakash.nik: Yeah, I think all three of you had, like, a similar kind of question about the generalizability of the result across different kinds of tasks.

483
00:46:10.620 --> 00:46:22.939
prakash.nik: You wanna state that? Yeah, I guess to me, it was like, how useful is this? Like, for example, in this paper, they looked like this particular example. If instead of having Mary, John, John, we had, like, three names.

484
00:46:23.310 --> 00:46:26.149
prakash.nik: Would, like…

485
00:46:26.380 --> 00:46:37.079
prakash.nik: how is this… yeah, basically, I understand that usefulness of path-patching algorithm, I think that that's… I feel like that's a useful contribution, but looking… what's the use of looking at this as a circuit?

486
00:46:37.270 --> 00:46:44.880
prakash.nik: Because you have, like, a million circuits for different tasks, and different data, and yeah, I guess others. Yeah, okay,

487
00:46:45.440 --> 00:46:58.840
prakash.nik: Okay. Yeah, so I think I see similar… similarly. I think, as I said, I think this paper has two main contributions, the empirical circuit and the method. I think method is more important.

488
00:46:59.320 --> 00:47:03.980
prakash.nik: Circuit is there, they found some set of heads and the functionality.

489
00:47:04.950 --> 00:47:10.860
prakash.nik: But yeah, I don't think so it's super gentle. I mean… That in itself is gonna…

490
00:47:11.970 --> 00:47:13.039
prakash.nik: I will use it now.

491
00:47:13.490 --> 00:47:15.690
prakash.nik: In terms of how you think about it.

492
00:47:18.760 --> 00:47:21.360
prakash.nik: But again, you have to think that… you have to…

493
00:47:21.840 --> 00:47:32.180
prakash.nik: think about it in the context. This was 2022. This was probably the first paper which figured out the circuit in an end-to-end fashion. So from that perspective, I think that was…

494
00:47:32.340 --> 00:47:33.260
prakash.nik: Pretty phenomenal.

495
00:47:36.510 --> 00:47:38.059
prakash.nik: Let's go ahead.

496
00:47:38.340 --> 00:47:53.760
Jesseba Fernando: Wait, sorry, could I… my question was, like, a little bit different. I… I had just asked about, well, here it's, like, there's Mary and John, there's two… two, like, people, and so, I was wondering what would we expect to see if there were more than two?

497
00:47:54.920 --> 00:48:00.589
prakash.nik: Yeah, I think in that setting, what I would expect is something more like entity tracking paper.

498
00:48:01.280 --> 00:48:04.630
prakash.nik: that we have. It's like…

499
00:48:05.040 --> 00:48:15.639
prakash.nik: this is, again, my expectation, that the model would probably use the positional information or the ordering ID information of each of the person to basically figure out

500
00:48:16.260 --> 00:48:22.049
prakash.nik: Like, different attributes, about that particular person.

501
00:48:22.280 --> 00:48:24.280
prakash.nik: Depending on what you ask about that person.

502
00:48:24.640 --> 00:48:31.939
prakash.nik: I think primarily it will still use those traditional or the ordering ID information to answer questions about that particular person.

503
00:48:36.760 --> 00:48:37.560
prakash.nik: available?

504
00:48:37.760 --> 00:48:39.509
Jesseba Fernando: Yeah, yeah, thank you.

505
00:48:42.140 --> 00:48:45.260
prakash.nik: Yeah, Avery, Christopher, and Rice.

506
00:48:45.860 --> 00:48:49.840
prakash.nik: You guys had general security across models, kind of question.

507
00:48:50.910 --> 00:49:03.569
prakash.nik: Yeah, it was kind of the same as, like, Tony's question about retraining. If we were to retrain GPT-2, or if we were to do this analysis on other models, like, do we see similar… similar-ish circuits? Like, are there…

508
00:49:04.050 --> 00:49:12.259
prakash.nik: As in, has that, feed into name movers, etc, or is it, like, totally different based on different models?

509
00:49:13.110 --> 00:49:17.440
prakash.nik: Yeah, so this is another paper, which was not in the reading list.

510
00:49:18.620 --> 00:49:19.829
prakash.nik: This is from Jack.

511
00:49:20.320 --> 00:49:27.319
prakash.nik: So essentially, in this work, what they did was they tried to study this IOI task in

512
00:49:27.830 --> 00:49:32.230
prakash.nik: like, larger GPT-2 models, like the medium and, I think, large.

513
00:49:32.600 --> 00:49:38.460
prakash.nik: And I think what they showed was, the mechanism remains more or less the same. The models are still using those NF.

514
00:49:39.030 --> 00:49:43.870
prakash.nik: I mean, not the exact same heads, but the way the model is solving the task, it means the same.

515
00:49:44.410 --> 00:49:49.049
prakash.nik: And, another main result of this paper is, if you look at

516
00:49:49.360 --> 00:49:53.369
prakash.nik: The significance of this… those particular heads.

517
00:49:53.720 --> 00:49:55.629
prakash.nik: In different tasks.

518
00:49:55.800 --> 00:50:03.900
prakash.nik: I think they studied one another task called colored objects, something like this object is of this color, this object is of this color.

519
00:50:04.220 --> 00:50:07.489
prakash.nik: What is the color of this object? If you have this kind of question.

520
00:50:08.000 --> 00:50:17.249
prakash.nik: and the heads that they've, like, the previous work found for the IUI task is actually being reused for this particular task as well.

521
00:50:18.400 --> 00:50:25.049
prakash.nik: So, so the same, same set of heads are being reused, for different kinds of tasks.

522
00:50:28.680 --> 00:50:30.449
prakash.nik: Yeah, that's what I wanted to say.

523
00:50:33.580 --> 00:50:37.239
prakash.nik: Can you identify these circuits without knowing the architecture of the model?

524
00:50:39.090 --> 00:50:43.040
prakash.nik: say we have, like, access to some new plot model, and we have our…

525
00:50:51.210 --> 00:50:56.470
prakash.nik: need at least gradients. If not… if you can't do patching, you still need gradients.

526
00:50:57.450 --> 00:51:02.040
prakash.nik: Yeah, you would still need internal activations, or…

527
00:51:02.350 --> 00:51:08.190
prakash.nik: like, I went for, like, you don't know what the architect… like, we know the architectural object, you know, how many layers are there.

528
00:51:08.880 --> 00:51:14.630
prakash.nik: See, we get access to the… I don't know how it's possible, if you get access to the internal activation, but can that help, like…

529
00:51:14.870 --> 00:51:33.339
prakash.nik: backtrack the architecture, so if you have internal activations, can you construct the architecture? Yeah, I think once you have the activation, you basically have to… what about the GPT OctoSS open source products, so, like, we have the model bites, but we don't know the exact architecture, initial batch them.

530
00:51:34.820 --> 00:51:42.070
prakash.nik: You don't know the architecture, but you have the weights? Yes, we have the model weights, but they are not disclosed model architecture.

531
00:51:43.150 --> 00:51:56.239
prakash.nik: Yeah, I think I should need to work on emergencies. I've not worked on… how is that possible? If you know the weights, you basically know the architecture, right? I am not 100% sure of that. Maybe they flattened the masses.

532
00:51:56.450 --> 00:52:02.759
prakash.nik: They flattened it? Then how do you work with that? I'll talk with you about it later. Yeah, okay, I'm not sure. Yeah.

533
00:52:04.020 --> 00:52:08.189
prakash.nik: Because you basically know the weights of the, like, the dimensions of the matrix.

534
00:52:08.740 --> 00:52:10.040
prakash.nik: Every metrics.

535
00:52:10.800 --> 00:52:12.209
prakash.nik: Yeah, that's true.

536
00:52:12.720 --> 00:52:18.040
prakash.nik: But, yeah, terms will model suit that. Okay, maybe I… yeah, exactly.

537
00:52:19.750 --> 00:52:20.410
prakash.nik: Okay.

538
00:52:23.350 --> 00:52:27.100
prakash.nik: Yeah, one another thing is, yeah, so I said the…

539
00:52:27.390 --> 00:52:31.140
prakash.nik: Per set of… or the final set of heads which affect the final logic.

540
00:52:31.660 --> 00:52:38.640
prakash.nik: are called name overheads. An interesting thing that district found was, if you have laid out those name overheads.

541
00:52:38.900 --> 00:52:42.229
prakash.nik: The model's performance on this particular task does not really decrease.

542
00:52:42.470 --> 00:52:48.130
prakash.nik: What really happens is, some other set of heads comes into the picture, which are not really…

543
00:52:48.490 --> 00:52:54.339
prakash.nik: Causally active in the normal state, but when you update out the name overheads, this new set of

544
00:52:54.520 --> 00:52:58.929
prakash.nik: Heads comes into the picture, which they are calling as backup name overheads.

545
00:52:59.850 --> 00:53:03.890
prakash.nik: I think Claire had a good, like, an insightful comment.

546
00:53:08.520 --> 00:53:11.580
prakash.nik: comment? I think it was insightful. I didn't know about it.

547
00:53:12.380 --> 00:53:13.870
prakash.nik: It was about neuroscience?

548
00:53:14.770 --> 00:53:17.569
prakash.nik: Oh, redundancy? Redundancy, yes.

549
00:53:18.300 --> 00:53:22.849
prakash.nik: There's a theory in neuroscience that the reason why it's so hard to…

550
00:53:23.270 --> 00:53:30.999
prakash.nik: Find neurodegenerative diseases, because the brain is really good at having redundant pathways for the same path.

551
00:53:31.590 --> 00:53:38.170
prakash.nik: And so, there could be, like, in a circuit, that level of the same redundancy, I thought.

552
00:53:40.050 --> 00:53:42.490
prakash.nik: Yeah, that was insightful, I didn't know about it.

553
00:53:42.610 --> 00:53:44.949
prakash.nik: How does the human brain work?

554
00:53:45.080 --> 00:53:49.819
prakash.nik: So I just wanted to bring out that point to show there might be some

555
00:53:50.570 --> 00:53:54.279
prakash.nik: Analysis between what we're looking at, quite clear from this spectrum.

556
00:53:56.250 --> 00:53:58.150
prakash.nik: Okay, so we have found the circuit.

557
00:53:58.830 --> 00:54:05.089
prakash.nik: But that's not the end of the game. We have to evaluate the circuit to actually show that we have found something genuine.

558
00:54:06.140 --> 00:54:15.739
prakash.nik: And the paper proposes three metrics, or criterias. The first one is called, faithfulness. It's, I think, the easiest one to understand.

559
00:54:15.910 --> 00:54:21.759
prakash.nik: We basically asked this question, is the identified circuit fit to the underlying wall?

560
00:54:22.180 --> 00:54:25.970
prakash.nik: Can I get the performance of the model just from the circuit itself?

561
00:54:26.270 --> 00:54:30.639
prakash.nik: And that's basically the metric. You compute the…

562
00:54:31.250 --> 00:54:34.519
prakash.nik: Final output, like the final response of the circuit.

563
00:54:34.630 --> 00:54:37.380
prakash.nik: And just divided by the performance of the model.

564
00:54:38.340 --> 00:54:43.620
prakash.nik: And when I say performance of the circuit, what do I mean is, I basically, you basically are blade out.

565
00:54:44.030 --> 00:54:56.810
prakash.nik: everything else which is not part of the circuit. You keep the circuit components intact, but you ablate out all the model components which are not part of the circuit, and then check the final output.

566
00:54:58.340 --> 00:55:03.380
prakash.nik: Yeah, so in this work, the circuit they have found has about 87% is of faithfulness.

567
00:55:04.830 --> 00:55:08.119
prakash.nik: Yeah, Yuki had a question about implementation.

568
00:55:11.110 --> 00:55:12.690
prakash.nik: Yuki, are you online?

569
00:55:18.770 --> 00:55:24.949
prakash.nik: Maybe not, but I think it's the simplest. You just upgrade out all the model components which are not part of the circuit.

570
00:55:26.090 --> 00:55:27.800
prakash.nik: Get the fulfillments?

571
00:55:28.220 --> 00:55:30.350
prakash.nik: And just divide it by the model performance.

572
00:55:33.090 --> 00:55:43.249
prakash.nik: The next one is a little bit tricky, it's called completeness. So, the question here is, okay, we found the circuit, but how do we know that we have found all the components of the circuit?

573
00:55:43.770 --> 00:55:46.309
prakash.nik: Is it complete, or we are still missing out something?

574
00:55:46.870 --> 00:55:56.330
prakash.nik: yeah, so the basic idea here is, if you assume this is the underlying Socket.

575
00:55:56.740 --> 00:56:04.200
prakash.nik: Where you have, like, two redundant components, like, sort of, like, redundant subcircuits, which are getting

576
00:56:05.420 --> 00:56:07.140
prakash.nik: And what operation, which are…

577
00:56:07.410 --> 00:56:14.069
prakash.nik: Combined with an auto operation here before the final update. So, get complete paperless, even if you buy just one of them.

578
00:56:16.010 --> 00:56:19.760
prakash.nik: But that's not what we want. We want to find both C1 and Z2.

579
00:56:20.180 --> 00:56:21.210
prakash.nik: No.

580
00:56:21.730 --> 00:56:27.009
prakash.nik: That's basically the idea here. And this is the equation to find the…

581
00:56:27.250 --> 00:56:30.030
prakash.nik: Or to… to get that incompleteness score.

582
00:56:30.410 --> 00:56:37.880
prakash.nik: What this equation means is, He is some subset of… The circuit confidence.

583
00:56:39.680 --> 00:56:41.990
prakash.nik: T is the circuit, M is the model.

584
00:56:42.960 --> 00:56:46.279
prakash.nik: So you compute the circuit com- performance?

585
00:56:47.580 --> 00:56:48.670
prakash.nik: I am bleeding out.

586
00:56:49.440 --> 00:56:51.110
prakash.nik: Okay? He's alive.

587
00:56:51.570 --> 00:56:54.510
prakash.nik: Subtracted by the performance of the model.

588
00:56:54.900 --> 00:56:58.120
prakash.nik: When you ablate out that particular same subset.

589
00:56:59.490 --> 00:57:04.510
prakash.nik: The idea here is… The difference between these two should be minimum.

590
00:57:04.970 --> 00:57:06.909
prakash.nik: If the model is complete.

591
00:57:08.780 --> 00:57:15.479
prakash.nik: They should have the same impact. So, for instance, if… C is only C2.

592
00:57:16.110 --> 00:57:19.540
prakash.nik: And K is, let's say, C.

593
00:57:19.850 --> 00:57:21.909
prakash.nik: So, K is C2 as well.

594
00:57:23.210 --> 00:57:29.029
prakash.nik: So if you upgrade out this thing completely, You check the performance of

595
00:57:29.230 --> 00:57:32.169
prakash.nik: That particular circuit, you still get complete performance.

596
00:57:33.650 --> 00:57:34.380
prakash.nik: Right?

597
00:57:35.560 --> 00:57:37.479
prakash.nik: Let me…

598
00:57:44.660 --> 00:57:51.290
Jasmine C.: Wait, Nikhil, I have a question. Like, how much of this is specific to, like, circuits formed

599
00:57:52.110 --> 00:57:55.709
Jasmine C.: I'm so sorry, using backpropagation?

600
00:57:55.820 --> 00:57:59.280
Jasmine C.: Like, is that, like, one of the assumptions we're making here?

601
00:58:01.450 --> 00:58:06.039
prakash.nik: Yeah, I think we're talking about, like, neural… like, normal neural networks which are trained from backpropagation.

602
00:58:06.470 --> 00:58:11.129
Jasmine C.: Okay. So, like, if we use a different, like, updating, like,

603
00:58:11.270 --> 00:58:15.070
Jasmine C.: Like, mechanism, like, it… this doesn't… isn't necessarily true.

604
00:58:16.250 --> 00:58:19.719
prakash.nik: I think you'd still be able to apply this… this method.

605
00:58:22.070 --> 00:58:24.080
prakash.nik: anyone not using backpropagation.

606
00:58:24.080 --> 00:58:26.900
David Bau: Yeah, Jasmine, did you have another update mechanism in mind?

607
00:58:27.210 --> 00:58:33.300
Jasmine C.: Like, I was thinking of, like, like, maybe, like, Hebbian learning or something, like, where you're using the co-activations, or,

608
00:58:33.680 --> 00:58:37.479
Jasmine C.: Yeah, I think that was one of the ones I was thinking of.

609
00:58:39.880 --> 00:58:46.019
David Bau: Yeah, I think that probably not a lot is known, because so much of the, like, so much of the field

610
00:58:46.280 --> 00:58:47.350
David Bau: is just…

611
00:58:47.490 --> 00:58:57.649
David Bau: for the last, you know, whole bunch of years, you know, decade and a half has all been back propagation. So, it might be that the question you have has a different answer, but I think that people might not know.

612
00:58:59.120 --> 00:59:00.580
Jasmine C.: Thanks, everyone.

613
00:59:02.050 --> 00:59:04.670
prakash.nik: Okay, so we were talking about this named completeness code.

614
00:59:04.850 --> 00:59:06.920
prakash.nik: Can't play on the widgets.

615
00:59:44.940 --> 00:59:46.489
prakash.nik: David, do you remember this one?

616
00:59:49.170 --> 00:59:50.470
David Bau: What's the question?

617
00:59:52.830 --> 00:59:53.819
prakash.nik: Did you…

618
00:59:53.990 --> 00:59:58.729
prakash.nik: Basically, if you remember this particular criteria, the company does want to wider this particular question. Next, please.

619
00:59:59.630 --> 01:00:04.230
David Bau: Oh, the… the completeness and incompleteness score, like, how to… how to… how to justify

620
01:00:05.770 --> 01:00:08.590
David Bau: Why you would score it this way? Is that what you're asking?

621
01:00:08.770 --> 01:00:09.400
prakash.nik: Yeah.

622
01:00:09.980 --> 01:00:14.680
David Bau: If I'm being perfectly honest, I… you know.

623
01:00:15.400 --> 01:00:22.850
David Bau: no, I, I don't, I don't totally remember, why we, why, why it was proposed.

624
01:00:23.140 --> 01:00:25.489
David Bau: To do it this way. I think that…

625
01:00:26.310 --> 01:00:31.459
David Bau: You know, the intuition that you gave is the right one, which is basically you're asking the question.

626
01:00:32.450 --> 01:00:36.089
David Bau: you know, Once you have a circuit.

627
01:00:36.570 --> 01:00:44.500
David Bau: It's the same question that you would have if somebody published a paper saying, hey, you know what, I know what caused World War II, it's this coffee cup, and

628
01:00:45.200 --> 01:00:50.439
David Bau: You know, and the whole downstream chain of events that come from this. And then people would say, well.

629
01:00:51.100 --> 01:00:53.630
David Bau: I don't believe it. I think that there's some other…

630
01:00:53.760 --> 01:00:57.510
David Bau: path, a causal chain.

631
01:00:57.740 --> 01:01:00.320
David Bau: That also explains World War II.

632
01:01:01.000 --> 01:01:07.620
David Bau: And… and I think that your coffee cup is… theory, is only a partial A partial,

633
01:01:08.360 --> 01:01:16.439
David Bau: you know, explanation for the whole thing. You know, there's 11 other, you know, key variables that are involved. They all have to be involved.

634
01:01:16.700 --> 01:01:23.910
David Bau: And so I think that this idea is trying to capture the idea that if you…

635
01:01:24.110 --> 01:01:27.070
David Bau: If you try to form the circuit in any other way.

636
01:01:28.170 --> 01:01:32.419
David Bau: If you had a complete circuit, then no matter what you removed.

637
01:01:32.940 --> 01:01:35.820
David Bau: It would make the explanation worse.

638
01:01:36.680 --> 01:01:37.940
prakash.nik: I got it, I got it.

639
01:01:38.610 --> 01:01:50.430
prakash.nik: So, the rational here is, this C when K… Is removed.

640
01:01:53.210 --> 01:01:54.650
prakash.nik: I think they're saying, like.

641
01:01:54.970 --> 01:02:04.399
prakash.nik: If you say that the circuit for this task is C1, so if C equals C1, and if you set K to C1 as well, it will have a low inconvenience score.

642
01:02:04.810 --> 01:02:17.549
prakash.nik: Or, like, the high incompleteness score, so… Or what? So C can… because C slash C1 will be… you will have, like, 0, right, because it cannot complete the task. The second one, AM slash C1,

643
01:02:18.420 --> 01:02:19.840
prakash.nik: It would still complete the task.

644
01:02:20.250 --> 01:02:27.479
prakash.nik: Because you have C2, so you have a high incomplete score, so that means you can't set C to C1. It's an incomplete circuit.

645
01:02:28.320 --> 01:02:38.440
prakash.nik: Yeah, and the reason why C… when K is set to C1 would be… basically, this metric would be low, is because, when you are…

646
01:02:39.100 --> 01:02:40.360
prakash.nik: You have nothing else.

647
01:02:40.590 --> 01:02:46.389
prakash.nik: Yeah, exactly. So you're basically updating our entire model here.

648
01:02:46.550 --> 01:02:51.020
prakash.nik: Assuming this C1 and C2 are the only components in the model.

649
01:02:52.410 --> 01:02:58.639
prakash.nik: And when you're computing the performance of the circuit, what you do is, you only keep the components of the circuit.

650
01:02:58.780 --> 01:03:01.959
prakash.nik: intact, while ablating out everything else, okay?

651
01:03:02.200 --> 01:03:04.440
prakash.nik: So, assume that K was not here.

652
01:03:05.120 --> 01:03:14.569
prakash.nik: And this C1 was… C was C1, okay? So what would that mean is, we are only keeping C1 active, and C2 would be completely laid off.

653
01:03:15.420 --> 01:03:19.229
prakash.nik: Okay? But now, let's say K is C1 as well.

654
01:03:19.410 --> 01:03:29.350
prakash.nik: KC1. So now, in addition to upgrading out C2, C1 will be upgraded out as well. So there is nothing in the model, so the model performance will be almost zero.

655
01:03:29.860 --> 01:03:30.600
prakash.nik: Okay?

656
01:03:31.070 --> 01:03:31.770
prakash.nik: Pardon?

657
01:03:33.720 --> 01:03:34.920
prakash.nik: Spot M?

658
01:03:35.980 --> 01:03:41.059
prakash.nik: For the second term, We are only updating out C1.

659
01:03:41.200 --> 01:03:44.529
prakash.nik: Which… or whatever circuit that you decide, whatever C…

660
01:03:44.870 --> 01:03:55.059
prakash.nik: That you decide to be the circuit, whether C1 or C2. Assume it is C1, so we will only update out C1, while the C2 still remains intact. That's why the model performance will still be high.

661
01:03:56.100 --> 01:03:58.350
prakash.nik: That's why their absolute difference will still be high.

662
01:03:58.600 --> 01:04:03.930
prakash.nik: And… Therefore, we can say that whatever… circuit that to this idea.

663
01:04:04.120 --> 01:04:08.129
prakash.nik: Or whatever circuit that we have found is actually conflicted.

664
01:04:10.040 --> 01:04:12.439
prakash.nik: Does that make sense? Do you want me to repeat again?

665
01:04:13.740 --> 01:04:15.979
prakash.nik: We were putting out horse C and Jose M.

666
01:04:16.620 --> 01:04:18.850
prakash.nik: C is the circuit that you identified.

667
01:04:19.470 --> 01:04:21.549
prakash.nik: K, sorry, and M is the model.

668
01:04:22.880 --> 01:04:23.560
prakash.nik: Okay.

669
01:04:24.130 --> 01:04:31.999
prakash.nik: And you find F of C by keeping the model, like, the components in C intact, while upgrading out everything else.

670
01:04:33.660 --> 01:04:35.350
prakash.nik: And what this is saying is.

671
01:04:35.820 --> 01:04:43.049
prakash.nik: In addition to updating out everything else, you update out that subset of C,

672
01:04:43.400 --> 01:04:45.360
prakash.nik: Components, which is key as well.

673
01:04:47.800 --> 01:04:48.520
prakash.nik: Yeah.

674
01:04:51.470 --> 01:04:57.019
prakash.nik: Okay, so that was… Completeness. Now, this is the third one, which is minimality.

675
01:04:57.440 --> 01:05:00.309
prakash.nik: So you can have a very vague circuit.

676
01:05:01.270 --> 01:05:03.170
prakash.nik: Assumed your entire model is a circuit.

677
01:05:03.500 --> 01:05:14.009
prakash.nik: then you would have very high faithful and high completeness score for your circuit, but that's not what we want. We want the circuit to be…

678
01:05:14.970 --> 01:05:16.990
prakash.nik: Something of a smaller…

679
01:05:17.250 --> 01:05:24.459
prakash.nik: order. We want it to be as minimal as possible. So that's the criteria… so that's the third criteria that we use to evaluate our model.

680
01:05:24.760 --> 01:05:26.820
prakash.nik: This is, I think, even more tricky.

681
01:05:29.830 --> 01:05:31.150
prakash.nik: So what do you do is…

682
01:05:31.410 --> 01:05:37.400
prakash.nik: In this particular… this is, again, the equation to find the minimality score for each particular component in your circuit.

683
01:05:38.970 --> 01:05:41.279
prakash.nik: and assume that particular component is V.

684
01:05:41.900 --> 01:05:46.370
prakash.nik: What you do is… Yo… remove?

685
01:05:47.510 --> 01:05:49.820
prakash.nik: some subset K?

686
01:05:50.000 --> 01:05:52.619
prakash.nik: Along with that particular component, console socket.

687
01:05:54.220 --> 01:06:01.290
prakash.nik: It's subtracted from… Just removing that subset. From the socket.

688
01:06:02.730 --> 01:06:07.189
prakash.nik: Just to give you an example, let's say we are looking at one particular name overheads.

689
01:06:07.880 --> 01:06:11.400
prakash.nik: And assume that the K is basically all the, like, group of…

690
01:06:11.950 --> 01:06:14.700
prakash.nik: All the name overheads, except that particular

691
01:06:15.330 --> 01:06:19.939
prakash.nik: Okay, that we are looking into. So what the first expression would mean is.

692
01:06:20.620 --> 01:06:24.170
prakash.nik: We are getting rid of all the name overheads.

693
01:06:26.120 --> 01:06:29.779
prakash.nik: Yeah, basically, we're getting rid of all lame overheads, that's the first expression.

694
01:06:30.350 --> 01:06:31.610
prakash.nik: The sec… second?

695
01:06:31.780 --> 01:06:40.440
prakash.nik: Second expression is, We are getting rid of all but that particular head in our remover.

696
01:06:40.800 --> 01:06:41.530
prakash.nik: Nope.

697
01:06:43.200 --> 01:06:48.000
prakash.nik: And since we have found that that particular head is… let's say that particular head is…

698
01:06:48.290 --> 01:06:53.229
prakash.nik: Positively relevant. That particular head will make the model get to the correct answer.

699
01:06:54.690 --> 01:07:03.929
prakash.nik: And that's why the later term will have a higher score, while the previous one will have almost no score, no performance, because we have basically upgraded alternatives.

700
01:07:04.030 --> 01:07:09.389
prakash.nik: And that's why, that particular component would be critical.

701
01:07:09.880 --> 01:07:14.920
prakash.nik: It's like… it is a critical component that you need to keep in the circuit.

702
01:07:15.050 --> 01:07:21.049
prakash.nik: But if that particular score is not that high, that means that that component is not really that critical.

703
01:07:21.200 --> 01:07:26.100
prakash.nik: and that's basically what this minimality criteria is about.

704
01:07:28.490 --> 01:07:29.320
prakash.nik: Okay.

705
01:07:30.120 --> 01:07:31.290
prakash.nik: I'm gonna go ahead.

706
01:07:31.860 --> 01:07:36.770
prakash.nik: But any questions in this particular previous paper? I'm now gonna go to the next paper.

707
01:07:40.090 --> 01:07:53.539
prakash.nik: So we learned about path patching, we found the circuit of GPT-2, which is responsible for doing the IOI task, and then we evaluated the circuit on three particular criterias. Faithfulness.

708
01:07:54.580 --> 01:08:01.809
prakash.nik: completeness and minimality. Even if you didn't get completeness and minimality, it's okay, as long as you got faithfulness, I think. That's okay.

709
01:08:03.720 --> 01:08:08.280
prakash.nik: Okay, now coming to the second paper, yeah, I think…

710
01:08:08.920 --> 01:08:10.170
prakash.nik: How do I can announce it?

711
01:08:10.340 --> 01:08:11.650
prakash.nik: Kyung?

712
01:08:12.020 --> 01:08:13.040
prakash.nik: That's fact.

713
01:08:13.250 --> 01:08:15.979
prakash.nik: Yes, I'm here. Are you online?

714
01:08:18.479 --> 01:08:20.580
prakash.nik: Yeah, basically the question was, should we…

715
01:08:21.600 --> 01:08:23.450
prakash.nik: What should be the focus of…

716
01:08:23.750 --> 01:08:31.719
prakash.nik: Mechanistic interpretability? Should we still focus on human-driven, or should we… Focused more on automated interpretability.

717
01:08:33.370 --> 01:08:41.940
prakash.nik: So, I'm in the favor of automated interpretability. I think you have seen the previous paper, how difficult that is. So, it was so difficult to explain.

718
01:08:42.370 --> 01:08:47.869
prakash.nik: assume how difficult that would be to do it. So nobody wants to do it. Everyone just wants an agent.

719
01:08:48.700 --> 01:08:58.130
prakash.nik: you can just write a prompt to find a circuit, and the agent will give you the circuit. That's probably the ideal goal. So I'm very pro…

720
01:08:58.310 --> 01:09:00.660
prakash.nik: Automated interpretability, but yeah, again.

721
01:09:00.850 --> 01:09:06.699
prakash.nik: We are still not there yet, so we still need to find a few things manually, but hopefully.

722
01:09:07.689 --> 01:09:11.899
prakash.nik: A few years down the line, we have an agent which can do most…

723
01:09:14.080 --> 01:09:17.960
prakash.nik: So yeah, in this paper, they basically…

724
01:09:19.569 --> 01:09:31.230
prakash.nik: talk about one particular step in the steps, like, broadly speaking, steps that we took, in the circuit discovery process. The first step was… and this is something that we have

725
01:09:32.970 --> 01:09:39.660
prakash.nik: recommended for this class as well. Come up with a particular task, come up with a particular data set and the metrics and the model which can do it.

726
01:09:41.310 --> 01:09:51.399
prakash.nik: Then you need to decide the level of analysis that you want to perform, whether you want to work on layers, neurons, heads, or maybe subspace level.

727
01:09:52.580 --> 01:09:58.320
prakash.nik: So once you have decided Once you have done the step 2, you basically have the combinational graph.

728
01:09:58.460 --> 01:10:02.700
prakash.nik: Now you need to tune out that competition, you have to figure out the subway.

729
01:10:03.020 --> 01:10:07.139
prakash.nik: That's the third step. So this particular paper basically automates that third step.

730
01:10:09.180 --> 01:10:13.309
prakash.nik: The algorithm is pretty straightforward, so that's why I just pasted in the algorithm itself.

731
01:10:13.800 --> 01:10:16.770
prakash.nik: What you do is, once you have that computational graph.

732
01:10:17.230 --> 01:10:22.469
prakash.nik: you topologically sort it in the reverse order. So you start… you still start from the end.

733
01:10:23.440 --> 01:10:25.579
prakash.nik: It's still starting from the logit?

734
01:10:25.690 --> 01:10:29.609
prakash.nik: And then you look at each of the edges, which are directly connecting.

735
01:10:30.210 --> 01:10:31.500
prakash.nik: With the final budget.

736
01:10:32.910 --> 01:10:39.060
prakash.nik: What you do is, for each of those edge, you basically remove it from the… computational graph.

737
01:10:39.620 --> 01:10:43.889
prakash.nik: And see if the final output changes significantly or not.

738
01:10:44.130 --> 01:10:47.149
prakash.nik: If the final output changes more than a particular threshold.

739
01:10:47.350 --> 01:10:52.980
prakash.nik: You make, or that basically means that that particular head is imported for doing this particular task.

740
01:10:53.110 --> 01:10:54.980
prakash.nik: So you keep that edge in the circuit.

741
01:10:55.140 --> 01:10:58.250
prakash.nik: You basically do the same process again and again.

742
01:10:58.750 --> 01:11:01.649
prakash.nik: And each time you keep a, like, edge.

743
01:11:01.880 --> 01:11:06.930
prakash.nik: You basically add that, but, like, it's…

744
01:11:07.530 --> 01:11:10.989
prakash.nik: Node, or its parent node, as part of your queue.

745
01:11:11.570 --> 01:11:16.179
prakash.nik: And then you do the same process every NIP. It's basically like breadth-first search, but…

746
01:11:16.660 --> 01:11:20.699
prakash.nik: The… the criteria for keeping an edge.

747
01:11:20.880 --> 01:11:25.629
prakash.nik: In the circuit or not is basically defined by the threshold value.

748
01:11:26.890 --> 01:11:27.800
prakash.nik: Is that clear?

749
01:11:28.410 --> 01:11:30.430
prakash.nik: And it's the same word we…

750
01:11:30.880 --> 01:11:37.139
prakash.nik: I think, at least in the paper, they have not used dynamic.

751
01:11:37.360 --> 01:11:38.630
prakash.nik: They have music content.

752
01:11:42.600 --> 01:11:43.960
prakash.nik: Yeah, it's pretty straightforward.

753
01:11:45.370 --> 01:11:46.879
prakash.nik: So this is the algorithm.

754
01:11:47.060 --> 01:11:55.430
prakash.nik: So, they evaluated, or they basically found out the socket for a bunch of tasks.

755
01:11:55.990 --> 01:11:59.079
prakash.nik: Including the IOI task, and a few other tasks.

756
01:11:59.470 --> 01:12:06.930
prakash.nik: They basically use the ROC curve to show… to basically compare

757
01:12:07.040 --> 01:12:14.400
prakash.nik: their proposed method, ACDC, with a few other methods, which are not super important.

758
01:12:14.560 --> 01:12:19.890
prakash.nik: But I just wanted to put it here, because I wanted to show you that they are evaluating it as an ROC curve.

759
01:12:20.020 --> 01:12:23.049
prakash.nik: Which basically means that they're assuming that there is a ground truth.

760
01:12:24.520 --> 01:12:28.220
prakash.nik: so, what they did was, they basically went out

761
01:12:28.760 --> 01:12:30.640
prakash.nik: I looked at the previous literature.

762
01:12:31.110 --> 01:12:34.010
prakash.nik: To figure out which are the previous works.

763
01:12:34.270 --> 01:12:36.770
prakash.nik: Where circuits are already available.

764
01:12:37.370 --> 01:12:39.549
prakash.nik: They use their method to find out

765
01:12:39.690 --> 01:12:46.770
prakash.nik: the circuit for the same task, and then basically compare those two circuits to come up with this ROC curve.

766
01:12:48.980 --> 01:13:00.180
prakash.nik: Courtney had a question. How do you get the ground truth? Is that clear? Yeah, how do you know that something is the ground truth? It's basically coming from the literature. You assume that whatever previous published work is there is correct.

767
01:13:00.760 --> 01:13:06.429
prakash.nik: But, yeah, again, there is… yeah, there is caveat to that. Maybe even the published work is not exactly correct.

768
01:13:06.730 --> 01:13:11.259
prakash.nik: I think even in the paper, in this one or the next one, there was a point I remember.

769
01:13:11.950 --> 01:13:14.320
prakash.nik: There's a huge number of edges in the model.

770
01:13:14.490 --> 01:13:17.019
prakash.nik: Then it's a particular tool would have, like.

771
01:13:18.940 --> 01:13:20.440
prakash.nik: Yeah, it will have a lot of inches.

772
01:13:20.900 --> 01:13:31.450
prakash.nik: So it is not unlikely that maybe even the researchers whose papers are published, they missed out on a few edges. So the ground… you need to take the ground truth with a pinch of sort of.

773
01:13:36.110 --> 01:13:36.970
prakash.nik: Okay.

774
01:13:37.160 --> 01:13:38.750
prakash.nik: Now I'm gonna ask questions.

775
01:13:39.630 --> 01:13:40.960
prakash.nik: Problems with ACDC.

776
01:13:41.440 --> 01:13:45.140
prakash.nik: What kinds of problems do you see associated with this particular model?

777
01:13:46.160 --> 01:13:48.190
prakash.nik: Is this not going to come to this stage?

778
01:13:48.290 --> 01:13:51.479
prakash.nik: Yeah, that's right. It's kinda… it is very complicated.

779
01:13:52.640 --> 01:13:59.690
prakash.nik: It grows linearly with the… number of edges in the model. So it is varying.

780
01:14:02.350 --> 01:14:04.910
prakash.nik: other… Problems?

781
01:14:09.440 --> 01:14:20.550
prakash.nik: But… Why is that wrong? So in the previous course I had, we…

782
01:14:20.680 --> 01:14:26.729
prakash.nik: See how minimal changes in the input of the, like, an simple MLT.

783
01:14:26.880 --> 01:14:28.819
prakash.nik: It affects the output.

784
01:14:28.980 --> 01:14:33.240
prakash.nik: And then, at each layer, the,

785
01:14:34.340 --> 01:14:43.229
prakash.nik: it would result in a larger amount. But in that case, we were, giving a…

786
01:14:43.750 --> 01:14:47.529
prakash.nik: The problem was making the smart balance tighter.

787
01:14:47.920 --> 01:14:51.009
prakash.nik: But anyways, it would grow larger.

788
01:14:51.260 --> 01:15:01.220
prakash.nik: So, someone from… That intuition, I think… Threshold should be different.

789
01:15:01.480 --> 01:15:03.049
prakash.nik: Thank you for negatives.

790
01:15:03.730 --> 01:15:06.610
prakash.nik: Yeah, I think, intuitively, I agree with that. Let me…

791
01:15:07.860 --> 01:15:11.450
prakash.nik: But the way I would, say that would be…

792
01:15:12.700 --> 01:15:19.339
prakash.nik: Even at the end of the computational graph, since they are closer to the final output, they might have a higher effect on the final output.

793
01:15:20.050 --> 01:15:22.840
prakash.nik: But that's why they're… the impact

794
01:15:23.430 --> 01:15:33.229
prakash.nik: or changing or perturbing those edges would be pretty hard, in comparison to perturbing earlier layer edges. That's why maybe the threshold should be different.

795
01:15:33.940 --> 01:15:36.309
prakash.nik: Yeah, I agree with that, although that's not all.

796
01:15:36.570 --> 01:15:39.559
prakash.nik: slide, but I… intuitively, I agree with that.

797
01:15:41.680 --> 01:15:43.110
prakash.nik: Any other issues?

798
01:15:45.270 --> 01:15:49.399
prakash.nik: Yeah, okay. I guess if you do so many comparisons…

799
01:15:49.830 --> 01:15:51.810
prakash.nik: You will find a lot of random.

800
01:15:52.490 --> 01:15:54.109
prakash.nik: significant results.

801
01:15:56.040 --> 01:15:58.099
prakash.nik: So you will have a lot of screwdrivers.

802
01:15:58.300 --> 01:16:02.910
prakash.nik: circuits that don't actually mean anything. You just found them because you did so many tests.

803
01:16:04.690 --> 01:16:15.859
prakash.nik: You've got to explain that a little bit. So, when you say so many circus max, what do you mean? I mean, when you do statistical tests, if you do thousands of tests, if you write, for example.

804
01:16:16.070 --> 01:16:18.150
prakash.nik: Thousands of people, and services.

805
01:16:18.460 --> 01:16:19.970
prakash.nik: I don't know, something.

806
01:16:20.240 --> 01:16:28.350
prakash.nik: Like, in at least one of them, even if there is no effect, you will find a correlation, significant correlation, because of how

807
01:16:29.420 --> 01:16:30.410
prakash.nik: with the chimps.

808
01:16:31.540 --> 01:16:35.120
prakash.nik: So, I feel like this is gonna be the same problem since you…

809
01:16:36.570 --> 01:16:41.180
prakash.nik: sort of, you look for any possible significant

810
01:16:41.300 --> 01:16:45.170
prakash.nik: Resolved, and a lot of them might be just… Everybody jumps.

811
01:16:45.970 --> 01:16:51.360
prakash.nik: I'm not sure where stochasticity would come into this… Oh, I see. This algorithm.

812
01:16:51.880 --> 01:16:53.170
prakash.nik: It's pretty little twister.

813
01:16:56.590 --> 01:16:58.510
David Bau: I do think that this is a…

814
01:16:59.850 --> 01:17:02.970
David Bau: reasonable form of a concern, so I don't want to dismiss it.

815
01:17:03.520 --> 01:17:08.920
David Bau: I think this is one of the issues with…

816
01:17:09.760 --> 01:17:14.559
David Bau: These large-scale interpretability methods in general, the things that say.

817
01:17:14.850 --> 01:17:21.979
David Bau: Scan over everything, test everything, and whether everything is, like, thousands of choices, or millions of permutations, or something like this.

818
01:17:22.810 --> 01:17:26.400
David Bau: Then you're definitely… Even if it's not statistical.

819
01:17:26.650 --> 01:17:31.490
David Bau: in nature, right? You know, even if it's not about probabilities or whatever, you definitely have this situation where

820
01:17:32.430 --> 01:17:33.299
David Bau: you know.

821
01:17:33.550 --> 01:17:36.069
David Bau: Well, you… you're… if you… if you look enough.

822
01:17:36.460 --> 01:17:38.189
David Bau: You're gonna find what you look for.

823
01:17:39.020 --> 01:17:41.609
David Bau: And then you're left with this question, is it…

824
01:17:42.230 --> 01:17:44.099
David Bau: Is that meaningful? Is it surprising?

825
01:17:45.040 --> 01:17:48.739
David Bau: I think that the… You know, the typical…

826
01:17:49.540 --> 01:17:54.180
David Bau: solution for that. If you… if you use one of these methods in your paper.

827
01:17:55.220 --> 01:18:01.199
David Bau: Which I think is fine, to use one of these interpretability, like, automated interpretability methods.

828
01:18:01.730 --> 01:18:07.529
David Bau: But the thing that you want to do is not just present it as saying, hey, I used a method, and look, found this circuit, and here's the picture.

829
01:18:08.110 --> 01:18:15.600
David Bau: You wanna, you wanna triangulate it. You wanna say, okay, well, the circuit was proposed by this algorithm, it was found by this algorithm.

830
01:18:16.310 --> 01:18:20.330
David Bau: But we don't totally trust it, because it looked for so many millions of things, and…

831
01:18:20.800 --> 01:18:23.350
David Bau: Who knows, right? It just found something.

832
01:18:23.700 --> 01:18:26.190
David Bau: We, we brought some new data.

833
01:18:26.460 --> 01:18:36.419
David Bau: Some new tests, to the question, and we validated that the circuit has causal effects, or this circuit, you know, tells us something.

834
01:18:36.530 --> 01:18:38.829
David Bau: About this new data that wasn't used.

835
01:18:39.020 --> 01:18:44.840
David Bau: in the original scan. I think that that's the typical… Way that machine learning people

836
01:18:45.280 --> 01:18:48.130
David Bau: Try to avoid being fooled.

837
01:18:48.540 --> 01:18:54.100
David Bau: by the statistical… Traps is by increasing

838
01:18:54.750 --> 01:19:00.000
David Bau: the data. And you'll see that machine learning people are usually less…

839
01:19:01.400 --> 01:19:03.090
David Bau: They tend to be less careful.

840
01:19:03.690 --> 01:19:08.540
David Bau: About the, you know, the careful statistical analysis that you see in a lot of

841
01:19:09.150 --> 01:19:11.610
David Bau: Like, social science research.

842
01:19:12.070 --> 01:19:14.950
David Bau: You know, oddly enough, because there's supposed to be…

843
01:19:15.410 --> 01:19:17.769
David Bau: You know, the technical machine learning people.

844
01:19:17.980 --> 01:19:20.840
David Bau: But I think the reason why is because

845
01:19:20.980 --> 01:19:37.740
David Bau: they… they tend to, instead of solving the problem by making sure the error bars are just small enough or whatever, right? They just completely squash the error bars by… by bringing in 10 times more data or something like that. Right? Oh, you know, we found this using 1,000 sentences, and then we tested it on another 20,000 sentences or something.

846
01:19:37.840 --> 01:19:39.340
David Bau: And,

847
01:19:39.710 --> 01:19:57.669
David Bau: You know, if you, if you work through the error bars on these things, it usually solves the problem. And maybe unfortunate that a lot of the papers don't actually work through the error bars that carefully. But, but that's, that's usually the solution. Does that sort of answer the question a little bit?

848
01:19:57.800 --> 01:19:59.690
prakash.nik: At least from a practical point of view?

849
01:20:01.260 --> 01:20:09.069
prakash.nik: Another question? Oh, maybe I can move on, but just, I was thinking, would you expect this algorithm to find any circuits in a just initialized network?

850
01:20:11.350 --> 01:20:16.319
prakash.nik: The initialized model, like, randomly initialized model, would it have be able to performance.

851
01:20:17.820 --> 01:20:23.179
prakash.nik: It shouldn't, right? But you're, like, checking for all possible surveys. What would be the threshold there?

852
01:20:23.410 --> 01:20:28.690
prakash.nik: I mean… Here, in this particular, let's say, the IOI example, we use Logitech.

853
01:20:29.140 --> 01:20:34.889
prakash.nik: But the logic difference of… A randomly initialized model wouldn't really be meaningful.

854
01:20:35.250 --> 01:20:38.280
prakash.nik: You will find something, but yeah, it could be completely wrong.

855
01:20:40.760 --> 01:20:43.400
prakash.nik: Yeah, I'm so… this is like a,

856
01:20:43.970 --> 01:20:49.939
prakash.nik: I see a set of thoughts, but just on the question of the notations, like, sometimes people talk about causes in terms of, like.

857
01:20:49.950 --> 01:21:05.310
prakash.nik: necessary and or sufficient. It's like, what's a necessary cause? What's a sufficient cause? What's a cause necessary and sufficient, and not sufficient, or sufficient and not necessary? I think, like, because we haven't sure, necessary and sufficient causes.

858
01:21:05.330 --> 01:21:09.390
prakash.nik: The redundancy network could be sufficient, but not necessary.

859
01:21:09.710 --> 01:21:18.760
prakash.nik: And then… maybe there's some causes that are necessary but not sufficient. Like, they… like, two edges might be…

860
01:21:20.430 --> 01:21:27.150
prakash.nik: Causally contributing when they're both activated, but when you go one by one, both will show up as not.

861
01:21:27.460 --> 01:21:34.739
prakash.nik: necessary. Or that is not causally contributing the outcome. I guess that's the question might be, how is that?

862
01:21:35.880 --> 01:21:42.230
prakash.nik: How does this method handle those sorts of situations, or does that tend to not arise?

863
01:21:42.690 --> 01:21:46.569
prakash.nik: I think it's a very valid point. I think it does not take in… like, it does not…

864
01:21:47.720 --> 01:21:49.020
prakash.nik: Solve the problem.

865
01:21:49.240 --> 01:21:51.470
prakash.nik: It looks at each edge at a time.

866
01:21:51.920 --> 01:21:55.389
prakash.nik: So if there are, like, 3 ages which are working combinedly.

867
01:21:56.340 --> 01:22:08.770
prakash.nik: If you patch 3 of them combined, like, all of them together, you see a certain effect on the final output, but if you patch only one of them to be ignored, that means you're probably discarding all the three of them, but…

868
01:22:09.100 --> 01:22:14.130
prakash.nik: In actuality, you should have kept all three of them. I think that is one of the limitations of this approach.

869
01:22:21.590 --> 01:22:33.730
prakash.nik: Yeah, scalability is a ratio. Again, the other negative, like, problem with ACDC is it was unable to find out negative components, which are, like, the components that the UI people showed that

870
01:22:34.270 --> 01:22:38.780
prakash.nik: These are the components which negatively affect the performance, but these are…

871
01:22:38.990 --> 01:22:40.630
prakash.nik: How's the leader invert the score?

872
01:22:40.910 --> 01:22:45.639
prakash.nik: And this particular… Algorithm cannot, solve a task.

873
01:22:46.360 --> 01:22:48.689
prakash.nik: And there was issue with the evaluation as well.

874
01:22:49.090 --> 01:22:50.829
prakash.nik: Groundhogs could be necessary.

875
01:22:51.830 --> 01:22:53.530
prakash.nik: I think those are the four main…

876
01:22:53.800 --> 01:22:58.349
prakash.nik: And the few others that we talked about are the problems with this ACDC paper.

877
01:22:59.230 --> 01:23:12.779
prakash.nik: Yeah, I think I heard from somebody, this 10 minutes is… I'm not sure, I've not tried it myself, but I think I heard from probably the first author that to get the IUI circuit on GPT-2 using this method would take about 10 minutes.

878
01:23:13.400 --> 01:23:20.700
prakash.nik: So, I guess you guys are working on 7 billion and $70 billion, so I think this is out of reach. I don't think so you can use ACDC, but…

879
01:23:21.840 --> 01:23:24.549
prakash.nik: Okay, now coming to our final paper.

880
01:23:25.380 --> 01:23:29.229
prakash.nik: And I think this paper solves all the problems that the previous paper had.

881
01:23:30.420 --> 01:23:32.979
prakash.nik: So let's look at each of them.

882
01:23:33.460 --> 01:23:34.590
prakash.nik: One by one.

883
01:23:34.770 --> 01:23:36.860
prakash.nik: So let's start with the scalability issue.

884
01:23:36.980 --> 01:23:40.459
prakash.nik: I think Gabriel talked about attribution.

885
01:23:40.850 --> 01:23:47.939
prakash.nik: methods last week, and I think he, he, he did touch upon component attribution.

886
01:23:49.340 --> 01:23:54.470
prakash.nik: But I'm still gonna read it, just for completing this.

887
01:23:57.710 --> 01:24:01.140
prakash.nik: So, the main idea here is to use

888
01:24:01.320 --> 01:24:05.400
prakash.nik: First-order data approximation to find the effect of a batch.

889
01:24:06.010 --> 01:24:10.500
prakash.nik: So we basically use,

890
01:24:10.700 --> 01:24:14.389
prakash.nik: This particular equation to find what would be the…

891
01:24:14.570 --> 01:24:19.489
prakash.nik: Significance or the causal effect of touching a particular edge inside the compilation graph.

892
01:24:21.000 --> 01:24:22.649
prakash.nik: Instead of actually doing the math.

893
01:24:22.950 --> 01:24:31.879
prakash.nik: What that means is we can find out the importance of a particular edge with just two forward pass and that is in the past, because

894
01:24:32.270 --> 01:24:34.980
prakash.nik: This particular equation just needs 3 terms.

895
01:24:35.160 --> 01:24:39.950
prakash.nik: this Z prime U is basically the…

896
01:24:40.610 --> 01:24:44.809
prakash.nik: Sort of, like, the activation of that particular edge on the corrupt example.

897
01:24:45.090 --> 01:24:48.410
prakash.nik: This one is the activation on the green example.

898
01:24:48.640 --> 01:24:55.210
prakash.nik: And this is… The gradient of that particular edge with respect to the loss that we are considering.

899
01:24:55.980 --> 01:25:00.330
prakash.nik: That means we just need two forward pass and one backward pass, irrespective of the model size.

900
01:25:00.590 --> 01:25:04.379
prakash.nik: So this is constant time, this does not really grow with,

901
01:25:04.560 --> 01:25:07.950
prakash.nik: number of edges in the model, so it is very efficient.

902
01:25:09.990 --> 01:25:15.659
prakash.nik: But it has… our problems. Empirically.

903
01:25:15.980 --> 01:25:20.289
prakash.nik: It's not that well. And periodically as well,

904
01:25:21.550 --> 01:25:26.690
prakash.nik: In the previous… in this particular slide, if you remember this equation,

905
01:25:27.760 --> 01:25:32.550
prakash.nik: If our, let's say, Z is on a flat plane.

906
01:25:36.070 --> 01:25:39.200
prakash.nik: If the large value of Z is near zero.

907
01:25:39.470 --> 01:25:45.410
prakash.nik: What would that mean is the S important, the S important would be almost equal to zero.

908
01:25:46.320 --> 01:25:52.239
prakash.nik: But it might, like, it might be the case that, okay, maybe the Z value is 0, but the Z prime value

909
01:25:52.460 --> 01:25:58.620
prakash.nik: Which is the activation of the edge on the counterfactual, or the correct example, is significant.

910
01:25:59.020 --> 01:26:03.100
prakash.nik: So we don't want… so we… so there's sort of a conflict between

911
01:26:03.420 --> 01:26:06.230
prakash.nik: This tool… observe, like, this tool…

912
01:26:06.740 --> 01:26:12.589
prakash.nik: interpretation, while on some example it is saying that it is important, while on the other example it is saying it's not important.

913
01:26:12.990 --> 01:26:17.359
prakash.nik: What we do is we basically interpolate between these two particular samples.

914
01:26:19.630 --> 01:26:27.210
prakash.nik: yeah. So instead of taking in the activations from just

915
01:26:27.500 --> 01:26:31.539
prakash.nik: the clean one and the corrupt one. We do an interpolation between them, just…

916
01:26:31.690 --> 01:26:49.589
prakash.nik: deploy a state line, and divide the straight line into M number of states, and then use the activation at each of those steps to compute the gradients. So instead of one gradient number for each edge, you will have M, sorry, M number of gradients for each edge, and then you can combine those M gradients.

917
01:26:49.750 --> 01:26:51.760
prakash.nik: Find the importance of this edge, and this…

918
01:26:52.440 --> 01:26:56.769
prakash.nik: Periodically as well, and empirically, this has been shown to be working.

919
01:26:57.000 --> 01:26:59.559
prakash.nik: Much better than just using a single gradient.

920
01:27:01.420 --> 01:27:04.930
prakash.nik: So this solves the scalability issue of ACDC.

921
01:27:05.820 --> 01:27:10.109
prakash.nik: actually, there was another problem with ACDs I want to mention.

922
01:27:10.350 --> 01:27:14.459
prakash.nik: The edges that you find

923
01:27:15.050 --> 01:27:17.030
prakash.nik: May not be connected with each other.

924
01:27:17.750 --> 01:27:24.810
prakash.nik: So you may find one… Edge, which does not really have any edge-connecting from it.

925
01:27:25.200 --> 01:27:27.889
prakash.nik: So, it might not be a connected graph itself.

926
01:27:28.120 --> 01:27:36.710
prakash.nik: So there was a problem with ACDC. This paper solves it by using a specific to,

927
01:27:37.300 --> 01:27:41.480
prakash.nik: get the graph. So the first step of this particular method is, okay, you apply

928
01:27:41.630 --> 01:27:47.489
prakash.nik: edge attribution patching, you get a score for each edge in the conversational graph. Now, you need to use those

929
01:27:48.550 --> 01:27:57.170
prakash.nik: weights of… It's, important score to basically extract out the subgraph of this computational graph.

930
01:27:58.240 --> 01:28:02.630
prakash.nik: In the ACDC, what they do is they basically Do,

931
01:28:03.230 --> 01:28:16.199
prakash.nik: Sorting of the weights, or sorting of the edges based on the weights, and they pick up the top and edges, which is why it could be disconnected, but in this particular paper, they do something more suggestive, like.

932
01:28:16.700 --> 01:28:18.459
prakash.nik: They do a little bit more.

933
01:28:18.710 --> 01:28:22.320
prakash.nik: More than that.

934
01:28:22.990 --> 01:28:26.770
prakash.nik: So they… basically what they use is… a 3D approach.

935
01:28:26.930 --> 01:28:32.390
prakash.nik: Again, they start with the logits. Then for each of the parent.

936
01:28:32.740 --> 01:28:37.259
prakash.nik: On each of the nodes, which is directly connect… on each of the edges, which is directly connected to the…

937
01:28:37.540 --> 01:28:39.100
prakash.nik: Laura… logic.

938
01:28:41.070 --> 01:28:45.500
prakash.nik: Compute the maximum score. Keep the maximum score of edge.

939
01:28:45.640 --> 01:28:47.180
prakash.nik: Added to the queue group.

940
01:28:47.700 --> 01:28:49.430
prakash.nik: And repeat the same process again.

941
01:28:49.730 --> 01:28:55.919
prakash.nik: Till the queue becomes empty. It's almost like Digestra's algorithm, but instead of minimizing, you're maximizing it.

942
01:28:59.020 --> 01:29:02.010
prakash.nik: So this solves the disconnected components issue.

943
01:29:02.550 --> 01:29:09.189
prakash.nik: And security, the absolute score, not feeling the positive ones only.

944
01:29:09.520 --> 01:29:17.529
prakash.nik: We are also… we can… we will also figure out the negative confidence as well, because the negative confidence absolute value will still be high.

945
01:29:20.740 --> 01:29:22.270
prakash.nik: No, I don't… yeah, I think…

946
01:29:24.560 --> 01:29:28.640
prakash.nik: Yeah, I think the last issue with that ACDC paper was the evaluation itself.

947
01:29:29.080 --> 01:29:34.400
prakash.nik: It used ROC curve, which requires ground truth, which could be messy.

948
01:29:34.510 --> 01:29:39.370
prakash.nik: of… So instead of doing that, this paper basically used paperless.

949
01:29:39.510 --> 01:29:43.230
prakash.nik: targeted on this equipment. They use that?

950
01:29:43.620 --> 01:29:48.749
prakash.nik: To evaluate the… The circuits that they found in the method.

951
01:29:50.420 --> 01:29:54.040
prakash.nik: The exact results is not super important, I think, yet.

952
01:29:54.520 --> 01:30:02.209
prakash.nik: The main takeaways… Yeah, the integrated radial methods is working slightly better than, the non-integrated one.

953
01:30:02.760 --> 01:30:06.880
prakash.nik: Again, it is not very clean. It depends on task to task as well.

954
01:30:18.390 --> 01:30:19.789
prakash.nik: Okay, so the…

955
01:30:20.110 --> 01:30:30.590
prakash.nik: Yeah, the faithfulness of the circuits identified by the integrated gradient method seems to be, in general, working better, although there were some edge cases.

956
01:30:31.550 --> 01:30:33.350
prakash.nik: But, the overlap?

957
01:30:33.600 --> 01:30:40.730
prakash.nik: Between the circuits that they were drilled was identified using the integrated gradients with the actual ground truth.

958
01:30:40.840 --> 01:30:43.920
prakash.nik: The ground rule from the previous work was actually low.

959
01:30:44.160 --> 01:30:46.240
prakash.nik: This is what this figure is showing.

960
01:30:47.470 --> 01:30:54.950
prakash.nik: Why should I have added that later? So the dotted ones are… The overlapping score.

961
01:30:55.270 --> 01:31:02.969
prakash.nik: Or the average overlap for the non-integrated KDL one, and the solid lines are for the integrated KDL one.

962
01:31:03.150 --> 01:31:06.290
prakash.nik: As you can see, the dotted lines have higher corvala.

963
01:31:06.620 --> 01:31:10.170
prakash.nik: Especially for the larger number of conferences.

964
01:31:10.530 --> 01:31:13.010
prakash.nik: So these two are sort of, like, conflicting results.

965
01:31:13.390 --> 01:31:26.840
prakash.nik: If you think about it, integrated gradient is giving you more faithful circuits, but it's not overlapping with the previous ground truth circuits. Can you remind me of the faithfulness of the not integrated gradient circuits?

966
01:31:27.940 --> 01:31:34.720
prakash.nik: So… this… Blue one should be the… Right.

967
01:31:36.300 --> 01:31:37.710
prakash.nik: These are the six…

968
01:31:42.890 --> 01:31:44.529
prakash.nik: These are the six transfers.

969
01:31:45.460 --> 01:31:46.460
prakash.nik: You may come.

970
01:31:47.570 --> 01:31:50.650
prakash.nik: Yeah, you can see that I'll show them that.

971
01:31:51.110 --> 01:31:55.179
prakash.nik: It's acceptably in the IUI for some numbers.

972
01:31:55.410 --> 01:31:58.439
prakash.nik: In most of the other cases, the blue one says below the origin.

973
01:32:04.820 --> 01:32:11.910
prakash.nik: Yeah, so these are, like, two conflicting results. While faithfulness seems to be better for integrated gradients, overlap is not.

974
01:32:12.360 --> 01:32:13.420
prakash.nik: So what should we do?

975
01:32:13.570 --> 01:32:17.820
prakash.nik: Should we use, or should we…

976
01:32:18.770 --> 01:32:21.990
prakash.nik: Use faithfulness, or should we use overlap?

977
01:32:22.200 --> 01:32:27.289
prakash.nik: re-evaluate the circuits identified by a new method. Let's say you come up with some method.

978
01:32:27.450 --> 01:32:30.929
prakash.nik: Yeah, maybe you drive from the start instead of the end.

979
01:32:31.310 --> 01:32:37.679
prakash.nik: then, probably, somehow, it works, and you want to evaluate that method. How would you evaluate it?

980
01:32:37.830 --> 01:32:41.469
prakash.nik: Would you use overlap, or would you use fitfulness?

981
01:32:43.690 --> 01:32:44.679
prakash.nik: Keep this talk.

982
01:32:44.890 --> 01:32:46.109
prakash.nik: I'm gonna have a table sometime.

983
01:32:47.330 --> 01:32:50.829
prakash.nik: What do you think? Overlap is better, or fitness?

984
01:32:52.040 --> 01:32:53.300
prakash.nik: I think that's an easy question.

985
01:32:55.170 --> 01:32:56.760
prakash.nik: We have a couple of things.

986
01:32:57.210 --> 01:32:58.190
prakash.nik: Yeah, bye.

987
01:33:05.270 --> 01:33:07.099
prakash.nik: Breaking its formal.

988
01:33:07.320 --> 01:33:09.470
prakash.nik: When the ground truth is wrong.

989
01:33:10.700 --> 01:33:15.299
prakash.nik: Yeah, that is what it's… that,

990
01:33:16.310 --> 01:33:20.660
prakash.nik: Don't be 100% sure about the ground rule, especially with the larger models.

991
01:33:21.000 --> 01:33:22.520
prakash.nik: Laurentours could be messy.

992
01:33:23.010 --> 01:33:30.899
prakash.nik: But the other reason… why I think faithfulness is a little bit more faithful.

993
01:33:31.010 --> 01:33:33.989
prakash.nik: It's because of the causal nature of it.

994
01:33:34.140 --> 01:33:37.640
prakash.nik: Overlap is not really causal in nature, but faithfulness is…

995
01:33:38.320 --> 01:33:42.119
prakash.nik: And I generally believe results which are more versatile in nature.

996
01:33:44.580 --> 01:33:52.320
prakash.nik: This is, like, one specific example here. So let's say you find this two particular circuits. Yeah, let's say this is a circuit from the new method.

997
01:33:53.300 --> 01:33:56.050
prakash.nik: This is the circuit from the old method.

998
01:33:56.600 --> 01:34:03.660
prakash.nik: those are his configuration. The overlap will still be 50%, but the faithfulness will be zero.

999
01:34:04.630 --> 01:34:07.189
prakash.nik: And that's why this 50% might be deceiving.

1000
01:34:09.330 --> 01:34:12.910
prakash.nik: Yeah, that's why I, I, I, I would show ModeFit first.

1001
01:34:13.640 --> 01:34:16.379
prakash.nik: I think my confusion is, like, even if you…

1002
01:34:16.940 --> 01:34:30.429
prakash.nik: I mean, like, included wrongs or something in your circuit, and so it's freaking totally wrong stuff, but maybe it's, like, you did actually get, like, say you had ground truth, you could overlap with… like, I feel like you could still do…

1003
01:34:30.940 --> 01:34:50.919
prakash.nik: analysis on those components, you're just gonna have to be like, some of these might just be wrong, but we can generally characterize the kind of… is that, like… Yeah, I think… yeah. Yeah, my… I would be as… I would say that I would show more faithfulness to faithfulness, but if you have ground truth, there is no harm in doing it. I mean, you can always compute

1004
01:34:50.930 --> 01:34:57.480
prakash.nik: Or, overlap with somebody, cheap metric to compute. So, yeah, if you have the ground truth.

1005
01:34:57.890 --> 01:34:59.630
prakash.nik: You can just record both the numbers in the paper.

1006
01:35:00.790 --> 01:35:03.139
prakash.nik: But I… yeah, when possible, you could put both images.

1007
01:35:07.380 --> 01:35:09.659
prakash.nik: Yeah, I think that was their main takeaway.

1008
01:35:10.280 --> 01:35:17.680
prakash.nik: to not think too much about the overlap metrics, which is one of the easy text people there, and to think more…

1009
01:35:18.270 --> 01:35:21.009
prakash.nik: About the causality and the inference metric.

1010
01:35:22.590 --> 01:35:23.290
prakash.nik: Okay.

1011
01:35:24.270 --> 01:35:26.539
prakash.nik: Okay, so all those four issues are solved.

1012
01:35:26.970 --> 01:35:29.999
prakash.nik: And now… Are we good?

1013
01:35:30.250 --> 01:35:34.639
prakash.nik: Have you found… But have we achieved our end goal?

1014
01:35:36.500 --> 01:35:44.639
prakash.nik: So we started with this question, can we reverse engineer neurons and their connections to understand the underlying algorithms? And this is something what we get.

1015
01:35:44.960 --> 01:35:48.120
prakash.nik: From these kinds of circuit discovery methods.

1016
01:35:48.710 --> 01:35:50.080
prakash.nik: Is that what we want?

1017
01:35:52.080 --> 01:35:53.329
prakash.nik: Yeah, that's a question for you.

1018
01:35:53.940 --> 01:35:56.499
prakash.nik: It was time to talk.

1019
01:35:59.330 --> 01:36:01.319
prakash.nik: I guess we still gotta figure out what the…

1020
01:36:01.530 --> 01:36:05.259
prakash.nik: Each part of the search actually does have a bunch of different things.

1021
01:36:06.550 --> 01:36:07.750
prakash.nik: Yeah, exactly.

1022
01:36:08.870 --> 01:36:12.320
prakash.nik: Yeah, we don't want that kind of… Network.

1023
01:36:13.170 --> 01:36:18.050
prakash.nik: Essentially, what we want is some more… filling mechanism.

1024
01:36:18.280 --> 01:36:25.620
prakash.nik: Where you can actually explain it in the English language of what is actually going on inside the model, rather than just hit up this particular head and that particular aid.

1025
01:36:26.120 --> 01:36:30.259
prakash.nik: Essentially, that's what we want. We want to find that underlying mechanism, not just

1026
01:36:30.530 --> 01:36:33.650
prakash.nik: The set of moral components which are involved in this particular task.

1027
01:36:34.920 --> 01:36:47.280
prakash.nik: Question. Is that… is that a hard task? Like, using that as a hard task, trying to… given you, for GPTool, you have, for all possible tasks you have… or not all possible tasks, but for a substantial amount of tasks you have.

1028
01:36:47.540 --> 01:36:54.750
prakash.nik: the circuits underlying. You think it's a hard task to be able to, identify patterns of what the model is doing.

1029
01:36:56.040 --> 01:36:59.529
prakash.nik: I'm like… Yeah, I think this'll do for them.

1030
01:36:59.840 --> 01:37:02.959
prakash.nik: The reason why it is difficult is because it is still not scalable.

1031
01:37:03.290 --> 01:37:08.889
prakash.nik: Still need to think about potential hypothesis for what these model components are actually doing.

1032
01:37:09.010 --> 01:37:10.169
prakash.nik: Come of it.

1033
01:37:10.610 --> 01:37:12.490
prakash.nik: Causal experiments to verify it.

1034
01:37:13.380 --> 01:37:16.659
prakash.nik: Even if it's not causal, let's say, correlation around the ratio.

1035
01:37:16.780 --> 01:37:19.150
prakash.nik: You still need to come up with a hypothesis and test it.

1036
01:37:19.250 --> 01:37:20.910
prakash.nik: Thanks for reading today's task.

1037
01:37:22.040 --> 01:37:28.030
prakash.nik: Even for GPT-2, I think I will do everything, but certainly for 720 and 7B, it would be very tedious.

1038
01:37:28.560 --> 01:37:32.410
prakash.nik: And there is no clear way to automate.

1039
01:37:32.620 --> 01:37:33.920
prakash.nik: At least as of now.

1040
01:37:34.260 --> 01:37:37.959
prakash.nik: That's why… It is a difficult task.

1041
01:37:38.420 --> 01:37:45.609
prakash.nik: But, since it is difficult, I don't necessarily, it's, navigation school.

1042
01:37:45.860 --> 01:37:47.740
prakash.nik: Our contributions also are.

1043
01:37:48.580 --> 01:37:54.259
prakash.nik: I think a lot of companies as well, like, I think any company, like, now, who is doing interpretability or Macintub.

1044
01:37:54.490 --> 01:37:59.789
prakash.nik: Is people in some of their resources to… Based on my knowledge.

1045
01:38:01.330 --> 01:38:03.249
prakash.nik: So I think it's one of the hot topics.

1046
01:38:06.000 --> 01:38:11.769
prakash.nik: Yeah, that was my last point. Finding circuit is just the starting.

1047
01:38:12.710 --> 01:38:15.469
prakash.nik: And essentially, you need to figure out that alignment.

1048
01:38:15.780 --> 01:38:17.830
prakash.nik: They can stay alive, if you can pass.

1049
01:38:19.240 --> 01:38:24.700
prakash.nik: Yeah, I think the permanent right side is similar question, but I think we have it, actually.

1050
01:38:26.150 --> 01:38:37.269
prakash.nik: Okay, but these are, like, more recent papers on circuit description. So if I were to find out circuits, just the model conference today, I think I would

1051
01:38:37.470 --> 01:38:41.900
prakash.nik: implement Python from the scratch. That would be too much as part of the hybrid.

1052
01:38:42.080 --> 01:38:43.830
prakash.nik: what I will do is I…

1053
01:38:44.500 --> 01:38:50.170
prakash.nik: I would probably go through the code bases of you know, this… Papers.

1054
01:38:51.140 --> 01:38:57.050
prakash.nik: try to see if I can get some circuit on the house that I care about. I would say the first paper.

1055
01:38:57.570 --> 01:38:59.930
prakash.nik: One advantage of that paper is

1056
01:39:00.400 --> 01:39:07.599
prakash.nik: you get circuit, across target positions. So, for both the ACDC and the…

1057
01:39:08.490 --> 01:39:19.340
prakash.nik: the API to paper. They do not really find the circuit for different… they do not find the model companies that are positively relevant, or earlier dokens. They only look at the last doc.

1058
01:39:19.820 --> 01:39:28.719
prakash.nik: But that's probably not the end of the picture. If you want to look into the earlier totems, the first paper actually does that, lets you do that.

1059
01:39:30.780 --> 01:39:34.260
prakash.nik: And, these two papers are from Anthropic.

1060
01:39:34.730 --> 01:39:38.739
prakash.nik: I think this is… this was the first paper where they tried

1061
01:39:38.860 --> 01:39:42.150
prakash.nik: They basically came up with their attribution graph.

1062
01:39:42.570 --> 01:39:52.549
prakash.nik: on a replacement model, and their replacement model had something called transcoder, which is similar to SAEs, but for MLPs.

1063
01:39:53.000 --> 01:39:59.839
prakash.nik: So they, they, they had… This replacement model, and then they found out the graph

1064
01:40:00.130 --> 01:40:13.830
prakash.nik: the circuit on that replacement model. This particular newspaper is basically us collaborating on top of that, that you can use for your project. It does require you to have transcoder.

1065
01:40:14.310 --> 01:40:18.240
prakash.nik: This is the last paper from Transloose that I think I wanted to mention.

1066
01:40:18.730 --> 01:40:26.119
prakash.nik: They do use a replacement model, but they are not working on SAEs or transcoder space, they're still working on neuron spaces.

1067
01:40:26.630 --> 01:40:34.680
prakash.nik: So their replacement model actually get rid of all the non-linearity in the model.

1068
01:40:35.090 --> 01:40:39.600
prakash.nik: So then the attribution graph that you get is more accurate.

1069
01:40:39.980 --> 01:40:46.140
prakash.nik: Yeah, I think that's the main difference between the silver paper and…

1070
01:40:46.480 --> 01:40:48.340
prakash.nik: The underlying ideas are pretty useless.

1071
01:40:50.070 --> 01:40:51.689
prakash.nik: I agree, that's okay.

1072
01:40:58.260 --> 01:40:58.990
prakash.nik: Nope.

1073
01:41:22.670 --> 01:41:46.629
prakash.nik: And it's not like, it's like, oh, they're…

1074
01:41:46.680 --> 01:41:51.780
prakash.nik: We're all different in the same way. They're not all different.

