WEBVTT

1
00:00:01.880 --> 00:00:02.820
Liu Ziyin: Okay?

2
00:00:03.020 --> 00:00:04.210
David Bau: And then we're bid?

3
00:00:04.500 --> 00:00:06.450
Liu Ziyin: Yeah.

4
00:00:07.980 --> 00:00:09.909
Liu Ziyin: And then we'll get to your audio.

5
00:00:11.090 --> 00:00:14.590
Liu Ziyin: And, yeah, that's it.

6
00:00:14.830 --> 00:00:17.960
Liu Ziyin: All right.

7
00:00:18.740 --> 00:00:20.549
Liu Ziyin: Do you guys want lights off or lights on?

8
00:00:22.460 --> 00:00:26.469
Liu Ziyin: The toffee's good? Okay. You can kind of see?

9
00:00:26.550 --> 00:00:27.530
Liu Ziyin: Okay?

10
00:00:27.530 --> 00:00:47.229
Liu Ziyin: Do you have a chair, Chris? So, wait, wait, we need to get some more chairs, let's just do a split… Okay, we'll just take a minute, and we'll get, we'll get a little set up here, so… Would you prefer to sit or stand? I'll stand. Your backpack? Okay. That's my backpack. You have a chair if you want it. Okay, I'll sit if I want. You can put your backpack on if you want me.

11
00:00:49.390 --> 00:00:50.460
Liu Ziyin: Indigo.

12
00:00:51.340 --> 00:00:52.280
Liu Ziyin: Two, right?

13
00:00:56.200 --> 00:01:10.440
Liu Ziyin: That's what I'm thinking. Would you like a… maybe after. Okay. Yes.

14
00:01:11.020 --> 00:01:20.200
Liu Ziyin: Dean, what year… what year are you? What year? Yeah. Oh, so I'm a postdoc, at MIT, yes. And where were you… and where were you before?

15
00:01:20.490 --> 00:01:30.980
Liu Ziyin: So I did my PhD in Japan, University of Tokyo, so I'm a physicist by training. I already learned that there are a few physicists here.

16
00:01:31.220 --> 00:01:41.520
Liu Ziyin: So, before that, I was at Carnegie Mellon. I did, physics and math, as undergrad, yes. Oh, I see. Yes. That's great. Amazing. And so…

17
00:01:41.650 --> 00:01:48.949
Liu Ziyin: So, It's an interesting time. I feel like…

18
00:01:49.540 --> 00:01:58.200
Liu Ziyin: you know that your field is getting interesting when the physicist is show up.

19
00:01:58.560 --> 00:02:00.420
Liu Ziyin: Are you guys okay?

20
00:02:01.000 --> 00:02:02.060
Liu Ziyin: Okay.

21
00:02:02.720 --> 00:02:12.430
Liu Ziyin: All right, we'll figure this out. So, we'll put everybody here. And, okay.

22
00:02:13.360 --> 00:02:14.210
Liu Ziyin: 150.

23
00:02:15.300 --> 00:02:16.160
Liu Ziyin: Alright.

24
00:02:18.290 --> 00:02:34.199
Liu Ziyin: It showed up, right? For some reason. There's a remote here. Can I see the remote, you guys? Yeah, we'll play with it.

25
00:02:34.960 --> 00:02:36.560
Liu Ziyin: Oh, something's going on.

26
00:02:38.170 --> 00:02:44.060
Liu Ziyin: Maybe this is… Oh, okay. Maybe it's a hardware problem. Okay. Okay.

27
00:02:44.390 --> 00:02:48.879
Liu Ziyin: Okay, will you share your screen on Zoom? Oh, oh, sorry, sorry. It's okay.

28
00:02:49.170 --> 00:02:49.960
Liu Ziyin: Mmm.

29
00:03:04.880 --> 00:03:05.780
Liu Ziyin: Wonderful.

30
00:03:08.160 --> 00:03:11.140
Liu Ziyin: They minimize the camera.

31
00:03:11.270 --> 00:03:23.760
Liu Ziyin: Oh, yes, this screen is… That's your screen? That's my screen, but I'm sort of sitting here. So yeah, so you go ahead and click OK on the recording, and then minimize the thing so that people can see your slides.

32
00:03:23.770 --> 00:03:45.159
Liu Ziyin: Or I can project, if you want me to worry about that. Oh, there's, like, a little line on that, so you can go ahead and get rid of the little line. On the frame of the people? Yeah, go to the people. Go to the people. Yep, and then see the little line on the left? Oh, this one? Oh, I see. Yeah, that's good. That's a good thing. That's a good setting.

33
00:03:46.030 --> 00:03:47.100
Liu Ziyin: Okay.

34
00:03:49.610 --> 00:03:51.560
Liu Ziyin: Okay. Welcome, Zean.

35
00:03:52.120 --> 00:04:08.839
Liu Ziyin: Thank you. You know, I won't take… I won't take time to introduce everybody in the lab, but you have a mix here of Northeastern folks and folks, from other… other labs and other institutions who like to join a meeting.

36
00:04:08.840 --> 00:04:17.069
Liu Ziyin: Zian is… so I, I, I, I, I don't know who Xian? What is that? So actually, zian is… Is… is Andy's, Andy's guest?

37
00:04:17.170 --> 00:04:21.860
Liu Ziyin: A, physics…

38
00:04:22.000 --> 00:04:29.579
Liu Ziyin: a trained physicist, who is a postdoc at MIT, right now, and,

39
00:04:29.750 --> 00:04:34.819
Liu Ziyin: And so he was explaining to me in the hallway that he's gonna be talking about

40
00:04:35.040 --> 00:04:38.519
Liu Ziyin: The way that he thinks about learning, and…

41
00:04:38.620 --> 00:04:43.440
Liu Ziyin: About, irreversibility and… and some of these interesting, sort of.

42
00:04:43.580 --> 00:04:45.499
Liu Ziyin: Ways of looking at learning dynamics.

43
00:04:45.620 --> 00:04:46.500
Liu Ziyin: So…

44
00:04:46.790 --> 00:04:58.280
Liu Ziyin: So, welcome, Ziyan. I'm looking forward to your talk. Okay. Thank you, David, for the introduction. So I'm Liu Zian. Okay, today I'm going to talk about,

45
00:04:58.350 --> 00:05:14.549
Liu Ziyin: my research, which is basically the physics behind optimization and generalization for most of you. So, as David has said, I'm a theoretical physicist by training, so what I do is sort of to take tools and concepts and ideas from theoretical physics.

46
00:05:14.620 --> 00:05:18.080
Liu Ziyin: To understand, how learning happens in neural networks.

47
00:05:18.100 --> 00:05:35.940
Liu Ziyin: To most of you, it will be a very, very strange talk, but both to the physicist theory and to the computer science theory, it'll be a very, very strange talk, but I hope by the end of the talk, you will… the computer scientists will learn a little bit more about concepts from physics that they can borrow to help analyze neural networks.

48
00:05:35.940 --> 00:05:38.450
Liu Ziyin: I may be physicists in the room will,

49
00:05:38.550 --> 00:05:46.490
Liu Ziyin: Learn a bit more about how physicists sort of are related very, very closely to neural networks work.

50
00:05:47.000 --> 00:06:00.020
Liu Ziyin: Okay, so very recently, AI has become a very, very empirical field. We propose a lot of algorithms, and we discover a lot of phenomena related to them, and we also propose various kinds of theories.

51
00:06:00.020 --> 00:06:18.969
Liu Ziyin: that explain these phenomena or understand these algorithms. But a key missing question, a key missing part are the organizing principles. So we don't know how… what are the fundamental laws, what are the principles… what are the principles behind the working of neural networks.

52
00:06:19.610 --> 00:06:29.530
Liu Ziyin: Well, and what I believe is that in order to really get to the organizing principles, we can look at physics, we can take inspiration from physics.

53
00:06:29.530 --> 00:06:38.419
Liu Ziyin: And, modern physics, especially 20th century modern physics, have two primary organizing principles. And one is symmetry.

54
00:06:38.420 --> 00:06:54.559
Liu Ziyin: The other is irreversibility. Sometimes that's a synonym of dissipation. Okay, so at the high symmetry, low irversibility end, you have particle physics standard model, and at the high dissipation, high irversibility, and low symmetry end, you have biophysics and chemistry.

55
00:06:54.820 --> 00:06:56.550
Liu Ziyin: Okay,

56
00:06:56.790 --> 00:07:17.069
Liu Ziyin: And it's no surprise that people have already sort of leveraged symmetry in the field of AI, and this is one very successful and common approach is to basically leverage the symmetries in the data, and to design matching architectures. And so you all heard of the invariant neural networks, equivariant neural networks.

57
00:07:17.070 --> 00:07:20.719
Liu Ziyin: And that's very well summarized in this, very nice textbook.

58
00:07:20.720 --> 00:07:23.759
Liu Ziyin: called Geometric Deep Learning by Michael Brownstein.

59
00:07:23.820 --> 00:07:32.450
Liu Ziyin: But in my opinion, especially to physicists, I think there's a key missing part, which is the symmetry in the model parameters. And…

60
00:07:32.590 --> 00:07:48.310
Liu Ziyin: And that's sort of where my research starts. As a physicist, we always ask the question of, whenever we see a system, you ask yourself the question, is there any symmetry? And if so, do they determine the dynamics of the system in any way? So that's how I started my research.

61
00:07:48.310 --> 00:07:58.269
Liu Ziyin: Okay, for the purpose of this presentation, I will focus on a very specific but universal and ubiquitous type of symmetry, which is permutation symmetry.

62
00:07:58.490 --> 00:08:10.999
Liu Ziyin: Okay, so let us look at a generic neural network, F, and F can often be decomposed into multiple layers. So here you have D many layers, so F sub i is a layer.

63
00:08:11.140 --> 00:08:20.189
Liu Ziyin: And within each layer, so each layer is permutation symmetric in its units, which we also call neurons, okay?

64
00:08:20.280 --> 00:08:30.780
Liu Ziyin: And this is because if you look at every layer, could be a fully connected layer, could be a self-attention layer. It's usually sort of defined as the summation.

65
00:08:30.810 --> 00:08:37.150
Liu Ziyin: over a generic nonlinear function sigma, with weights w sub i and input X, okay?

66
00:08:37.150 --> 00:08:52.169
Liu Ziyin: So, each of these W sub i is the weight, trainable weight, of the ith unit. Okay, note that you can swap Wi and WJ such that F sub i doesn't really change, so they really parameterize the same function, so that's the permutation symmetry in the layer.

67
00:08:52.320 --> 00:09:05.229
Liu Ziyin: Okay, so here is a, sort of a picture of a self-attention layer in a transformer. You can swap two attention tags such that the function you parameterize don't change, okay?

68
00:09:05.700 --> 00:09:21.649
Liu Ziyin: And here's another much simpler example, which is a fully connected layer. So you have one hidden layer here, and VI is the input weights to the first neuron. UI is the outgoing weights of the first neuron, and so are V2 and U2.

69
00:09:21.720 --> 00:09:35.200
Liu Ziyin: And here you can swap V1, U1 together with V2, U2. So, and you really parameterize the same function. So again, you have a lot of interesting permutation symmetry. So basically, in every layer you look.

70
00:09:35.290 --> 00:09:40.070
Liu Ziyin: As long as you can define any notion of width, you must have a permutation symmetry.

71
00:09:40.680 --> 00:09:59.099
Liu Ziyin: Okay, and one common phenomena people have found is a very interesting phenomenon, is that you can actually remove a lot of these units, and still your model gets to train to essentially the same performance. You can take some of your hats, you find that they are redundant, you can remove them, and you can still train your model, and they still work.

72
00:09:59.370 --> 00:10:10.789
Liu Ziyin: Okay, and this is summarized by the famous lottery ticket hypothesis, which essentially states that any neural network can be compressed to a smaller network, such that the training network doesn't change.

73
00:10:10.970 --> 00:10:23.690
Liu Ziyin: And this is one thing that we found to be very, very related to the permutation symmetries of the layer. Okay, so here's the result. We proved that I gave you the sort of arrogant name, which is called equivariance Principle of Learning.

74
00:10:23.690 --> 00:10:38.570
Liu Ziyin: So it essentially states that if you take any permutation symmetric layer, okay, any layer that looks like this, okay, and with neuron weights WI, then any operator G that transforms your neurons, okay, so WI and W2WD are the weights of the neurons.

75
00:10:38.630 --> 00:10:47.409
Liu Ziyin: Such that if this operator doesn't change the statistical movements of neurons in every small neighborhood, we'll leave the learning dynamics unchanged.

76
00:10:47.620 --> 00:10:54.050
Liu Ziyin: Okay, so it's saying how you can change the weight distribution without really changing how the function is being learned.

77
00:10:54.330 --> 00:11:06.090
Liu Ziyin: Okay? You can really choose G to be those functions that strongly compress these neurons, that maybe set a lot of neurons to zero, or maybe set a lot of weights to zero, to…

78
00:11:06.200 --> 00:11:14.819
Liu Ziyin: And still… and still you can apply this theorem. So that sort of explains how you can have lottery ticket hypothesis.

79
00:11:15.190 --> 00:11:21.129
Liu Ziyin: The operator is used to modify the weights? It is used to modify the weights.

80
00:11:21.250 --> 00:11:22.110
Liu Ziyin: Yes.

81
00:11:22.760 --> 00:11:34.759
Liu Ziyin: So it's a set of transformations, you can apply to your weights, maybe you change the first neuron by a little bit, second neuron by a little bit, and as long as that doesn't change the statistical movements, you will basically

82
00:11:35.200 --> 00:11:36.900
Liu Ziyin: Still learn the same function.

83
00:11:37.290 --> 00:11:44.590
Liu Ziyin: Okay? And the idea is really simple. The reason why this works is… Just for contacting me.

84
00:11:44.590 --> 00:12:06.170
Liu Ziyin: Oh, was this… this was your paper, is that right? Yeah, this is my paper. And is this… is this a hypothesis, or is this something that you were able to prove? This is actually a theorem, so we actually proved this. Of course, there are a little bit, regularity conditions, you have to assume that the functions are smooth, you have to assume that the neurons live on a bonded space, but yes, this is something we proved, yes. And then the definition of learning dynamics are unchanged.

85
00:12:06.310 --> 00:12:25.810
Liu Ziyin: What does that mean? Okay. So here, it really means that if you look at f of x as a function, this function, the way it evolves through time, is the same for the two different distributions of neurons. So how do you… what do you mean by evolution? So, really, if you observe something like setting a weight to zero.

86
00:12:25.810 --> 00:12:36.400
Liu Ziyin: then it's going to be a little different. Yeah. So, so in what, in which, in, like, how do you say that it's essentially the same? Oh, okay. It's essentially the… okay, they are the same in the asymptotic sense.

87
00:12:36.400 --> 00:12:46.619
Liu Ziyin: So, if you, okay, you take one distribution of neurons, and you take another distribution of neurons, and you train the two networks, and you look at how F of X change through time.

88
00:12:46.750 --> 00:12:54.180
Liu Ziyin: And you will see that there are some small differences, but you can show that these small differences vanishes as you take the infinite width limit.

89
00:12:54.510 --> 00:12:58.409
Liu Ziyin: So when you… yes, so it's an asymptotic result, yes.

90
00:13:00.120 --> 00:13:02.679
Liu Ziyin: Okay. So the moments would be of effort.

91
00:13:03.070 --> 00:13:05.039
Liu Ziyin: Of the neurons. Yeah.

92
00:13:05.170 --> 00:13:06.759
Liu Ziyin: Of the North.

93
00:13:07.890 --> 00:13:18.030
Liu Ziyin: It'll be clear in the next slide, yes, yes. Is that kind of saying as, like, a result, that there's, like, a continuous space between, you know, size networks and down?

94
00:13:19.560 --> 00:13:26.690
Liu Ziyin: substantial difference as, all the… as a strength, like, it's kind of inducted constantly.

95
00:13:27.810 --> 00:13:38.689
Liu Ziyin: Yeah, so it certainly has to do with, for example, mode connectivity of neural networks, right? But we do believe that the solution space of neural networks is highly degenerated. There are a lot of

96
00:13:38.750 --> 00:13:53.890
Liu Ziyin: ways that parameterize the same function, and this is really, sort of, a different way of looking at the same phenomena, yes. But now, the theorem is actually a theorem about the distribution of functions, not just a function, right? The distribution of learned functions. Is that right?

97
00:13:54.880 --> 00:14:00.880
Liu Ziyin: It's… it's not about a distribution of function, it's about…

98
00:14:01.250 --> 00:14:17.629
Liu Ziyin: it's about one single function that's parameterized by, let's say, D-mining neurons. I'm saying that if you give me D minor neurons that parametrizes this layer, I can replace these D neurons with another set of D neurons, such that the function they parametrize are the same.

99
00:14:17.930 --> 00:14:20.859
Liu Ziyin: And even during training, they are always the same.

100
00:14:22.550 --> 00:14:32.989
Liu Ziyin: So I can replace my neuron weights by another set of weights, such that they don't really change. I see. So, and the thing that is being fixed is, like, the data, the training data… The training data are being fixed, yes.

101
00:14:34.020 --> 00:14:36.910
Liu Ziyin: So the transformation is at initialization.

102
00:14:37.100 --> 00:14:48.259
Liu Ziyin: the transformation could happen any point in time. You can't do it at initialization, and that essentially gives you the LTH, but you can also do the transformation, let's say, in the middle of the training, that still works, yes.

103
00:14:49.900 --> 00:15:01.690
Liu Ziyin: Okay, so here is sort of why it works. The reason, really, why this works is because permutation symmetry sort of gives you an emergent notion of distance between neurons.

104
00:15:01.690 --> 00:15:20.339
Liu Ziyin: So, every neuron is sort of parameterized by its weights. So, okay, so each WI is sort of the coordinate of the neuron, and you can imagine that these neurons live on a high-dimensional manifold, okay? And you can really look at… when you have a lot of them on the high… in space, a lot of them will be clustered, so you can sort of,

105
00:15:20.430 --> 00:15:22.959
Liu Ziyin: You can, how do you call it?

106
00:15:23.150 --> 00:15:38.229
Liu Ziyin: You can look at small neighborhoods in a space, so each dot here is a neuron, and if you can find a neighborhood that… where there is a lot of neurons, they will essentially be redundant, okay? Neurons that are close to each other are redundant, and therefore you can sort of compress them into fewer neurons.

107
00:15:38.230 --> 00:15:43.350
Liu Ziyin: And you can actually do this compression in a way that the learning dynamics is really the same.

108
00:15:43.970 --> 00:15:47.309
Liu Ziyin: So that's essentially what's happening here. So you can…

109
00:15:47.430 --> 00:15:52.739
Liu Ziyin: summarize a lot of neurons with very few neurons. Of course, you don't have to do the compression, you can just

110
00:15:52.950 --> 00:15:59.239
Liu Ziyin: as I said, it's an equivalence principle, you don't have to compress them. Of course, one of the uses will be to compress them.

111
00:15:59.490 --> 00:16:18.030
Liu Ziyin: Okay, and here is a sort of a experiment that does this, so we learn a very simple low-dimensional function that looks like this. So your theorem is… so your theorem is about one layer at a time, is that right? Yes, it only applies to a single layer. Okay, by a single layer, you sort of need the… basically anything that looks like this.

112
00:16:18.030 --> 00:16:22.209
Liu Ziyin: So, take any submodule network that looks like this.

113
00:16:22.210 --> 00:16:27.690
Liu Ziyin: And you basically have D neurons here, and you can apply the theorem. I have a question about this. Yes, please.

114
00:16:28.000 --> 00:16:37.919
Liu Ziyin: Where are the weights of the second layer? I mean, so here you set all of the weights of the second layer of the MLP to 1 or something.

115
00:16:38.700 --> 00:16:41.779
Liu Ziyin: So the layers don't really look like this, right?

116
00:16:42.030 --> 00:16:44.699
Liu Ziyin: What do you mean?

117
00:16:47.560 --> 00:16:49.559
Liu Ziyin: I mean, I wonder, like…

118
00:16:52.140 --> 00:17:09.449
Liu Ziyin: like, there's an extra weight that multiplies after the nonlinearity, is what he's saying. Like, the neural networks start to become non-trivial when you have the second layer of weight. Yeah, so basically, the way you define a neuron is to…

119
00:17:10.260 --> 00:17:21.560
Liu Ziyin: is the… the outgoing weight and incoming weight, concatenated together? So you have to… so it has to be a two-layer structure if you are looking at a fully connected layer. So these two.

120
00:17:21.800 --> 00:17:24.549
Liu Ziyin: Right. Weight vectors together forms an income.

121
00:17:27.630 --> 00:17:31.129
Liu Ziyin: And so, distinctly… So, a deficit?

122
00:17:32.020 --> 00:17:33.580
Liu Ziyin: Say it again?

123
00:17:33.690 --> 00:17:37.909
Liu Ziyin: The neurons are the activation for a given dataset.

124
00:17:39.730 --> 00:17:58.310
Liu Ziyin: the neurons, well, they… It's the output of sigma here, right? Each neuron is the output of… Yes, well, the… okay, it's tricky, so, okay, so yes, usually we think of the node as a neuron, but here I'm saying that it's the weights coming to it, which is VI, and the weight leaving it

125
00:17:58.310 --> 00:18:03.689
Liu Ziyin: UI forms the coordinate of the neuron. So these two coordinates together

126
00:18:03.690 --> 00:18:09.759
Liu Ziyin: parameterizes the neuron, so it's the weights that parameterize the neuron, okay? So it's really these four edges.

127
00:18:09.900 --> 00:18:13.550
Liu Ziyin: That I call this neuron, not the node itself that I call the neuron.

128
00:18:14.190 --> 00:18:19.520
Liu Ziyin: Is it the parameters, or the… It is the parameters. It is the parameters, yes.

129
00:18:22.820 --> 00:18:28.210
Liu Ziyin: So you're using some… this is… so is that why you took the comma, and you have, like, this…

130
00:18:28.380 --> 00:18:31.639
Liu Ziyin: Sigma and 10WI comma X.

131
00:18:32.590 --> 00:18:47.670
Liu Ziyin: Yes, yes, yes, this is a non-standard way, but if you write it this way, you can… it also applies to transformers, to self-attention, I guess. If you… yes, so here, WI here is really, sort of.

132
00:18:47.700 --> 00:18:57.489
Liu Ziyin: VI UI together is your W1, and V2U2 together is your W2, okay? So you see, you take a sum of two things.

133
00:18:57.540 --> 00:19:09.040
Liu Ziyin: And, okay, so this sigma is actually not this sigma, so your sigma actually have to include both things. Yes. It's actually a two-dimensional vector.

134
00:19:09.520 --> 00:19:25.450
Liu Ziyin: Yes, yes. And U1 is also a two-dimensional, right? Yes. And then, and then… so each of the neurons are four-dimensional. So there's a linear aspect to what's going on, but you've got this… you've got this tensor…

135
00:19:25.450 --> 00:19:34.119
Liu Ziyin: structure, this spark structure that… so that's why you don't have a dot product here anymore, right? So it's like… but there's… so do you… does your theorem depend on…

136
00:19:34.560 --> 00:19:41.400
Liu Ziyin: like, there's… there's a linear structure, and then you have this nonlinear operation that happens? No, I don't know.

137
00:19:42.220 --> 00:19:57.929
Liu Ziyin: Because there still is a linear structure when you write it like this, right? Or maybe not. You sort of need a linear structure outside each of the neurons. You sort of need it in the technical parts of the theorem, but for the purpose of this talk, you can really imagine this kind of neuron. So you have a generic

138
00:19:57.930 --> 00:20:03.090
Liu Ziyin: sigma, that's a generic function of W, right? You can have some linear parts outside.

139
00:20:03.120 --> 00:20:14.969
Liu Ziyin: the hidden nonlinearities of sigma, but not necessarily, yes. And just, like, as a wish for your future talk to some random machine learning superior,

140
00:20:15.560 --> 00:20:19.600
Liu Ziyin: We could connect it to, you know, our daily bread.

141
00:20:20.310 --> 00:20:25.839
Liu Ziyin: Like, where we rise down and earth, the layer's very different, right? It's, like, just to…

142
00:20:26.420 --> 00:20:36.840
Liu Ziyin: Oh, yeah, yeah, like, if you, if you, like, so the thing on the right looks a little bit more familiar, right? The thing on the right, oh yeah, that's Baile break. Yeah, you almost want to, like, say, oh.

143
00:20:36.940 --> 00:20:42.939
Liu Ziyin: You know, step, step, step, and then the thing on the right. Right, the thing on the left, right.

144
00:20:43.100 --> 00:20:56.620
Liu Ziyin: Yeah. Yes, yes, thank you. So basically, yes, so this is a neuron, so UI and W… VI are the neurons, are your first neuron? In this case, it's four-dimensional, okay?

145
00:20:57.050 --> 00:20:57.930
Liu Ziyin: Okay.

146
00:20:58.210 --> 00:21:06.640
Liu Ziyin: So what it really means is that it induces a notion of Euclidean distance between new neurons. So every neuron is parameterized by its weights, okay?

147
00:21:06.980 --> 00:21:14.739
Liu Ziyin: So… so this thought could be a neuron in your fully connected layer, or it could be a self-attention hat in your self-attention layer.

148
00:21:15.220 --> 00:21:21.679
Liu Ziyin: Okay, and so what we do here is basically we learn a very simple function,

149
00:21:21.860 --> 00:21:35.520
Liu Ziyin: Well, of course, it's a two-dimensional function, and we train a neural network. So, the blue line… sorry, the green line is a training of a standard small neural network, trained with Kamin in it.

150
00:21:35.640 --> 00:21:41.169
Liu Ziyin: And the blue dashed line is what you train with a… is where we train a much wider network.

151
00:21:41.580 --> 00:21:46.720
Liu Ziyin: And the orange line is… has the same width as the green line.

152
00:21:46.860 --> 00:22:06.499
Liu Ziyin: But I initialize it in a way that it matches, basically matches the bigger network according to the theorem, okay? And you see that once you match the moments as the theorem requires, you really have the same learning dynamics, okay? So you see the orange line is really the same as the blue line.

153
00:22:06.720 --> 00:22:19.559
Liu Ziyin: Yes, so that tells you how the learning dynamics is really invariant to these reparameterizations, okay? And it also applies to other optimizers, such as RSProp and AdamW, for example.

154
00:22:19.590 --> 00:22:29.360
Liu Ziyin: what's the re-parameterization you're doing here? Okay, so, okay. It's… you have to look at the theorem. It's testing the moments of the bigger network.

155
00:22:30.050 --> 00:22:41.410
Liu Ziyin: So, you divide space into little cells, and within each little cell, you construct a smaller set of neurons, such that all of its statistical movements in this cell

156
00:22:41.410 --> 00:22:57.130
Liu Ziyin: It's the same as the static two moments here. I see. So the left one is a bigger network, because it has more dots, and then the right one is a smaller network with fewer dots. Yes. And so you're saying, like, a few… yeah, okay, you're… you're defining… So, yes, so, but you're… so you're…

157
00:22:57.220 --> 00:23:02.569
Liu Ziyin: you're… you're sort of doing the lottery ticket thing here, right? So you're saying, okay, we have our random initialization. Yes.

158
00:23:02.700 --> 00:23:09.500
Liu Ziyin: And… and, we're going to subsample it, but instead of just subsampling it, Randomly.

159
00:23:09.720 --> 00:23:19.249
Liu Ziyin: I'm gonna subsample it by… you know, by tiling this space… According to this metric.

160
00:23:19.440 --> 00:23:20.880
Liu Ziyin: more uniformly.

161
00:23:21.070 --> 00:23:24.349
Liu Ziyin: than it would be if I had just drawn everything from captions.

162
00:23:26.120 --> 00:23:45.600
Liu Ziyin: Yes, so you not only… yes, you not only have to subsample, you also have to transform your weights a little bit. You have to make them match the original network a little bit. Oh, so it's not just subsampling? It's not just subsampling. You have to have additional stuff, which is to really, sort of, make sure your smaller network match the statistics of the higher net… of the larger network.

163
00:23:46.050 --> 00:23:55.039
Liu Ziyin: So, yes, I would call it a proof of the… a variant of the lottery ticket hypothesis, but it requires one additional step.

164
00:23:55.450 --> 00:24:05.130
Liu Ziyin: Which is to transform the weights. Yes, can you give a little bit of intuition about the transformation… transformation of the weights? Is it… is it just an averaging procedure where you take the proper

165
00:24:05.130 --> 00:24:23.989
Liu Ziyin: and you find, sort of, a centroid of them or something, or is it more complicated than that? So, it is essentially the same as you take… you measure its first moment, which is… which are the averages, and second moment, and third moment, you try to match every one of the moments, and there's a theorem that guarantees that you can actually match them up to

166
00:24:23.990 --> 00:24:31.480
Liu Ziyin: every order and perfectly. It's called Takolov Theorem, and that also allows you to really match the moments with very few neurons.

167
00:24:31.550 --> 00:24:47.819
Liu Ziyin: So there's a… there's an underlying and interesting mathematical theorem that allows you to match these movements. So it's really as simple as matching the moments, but it's actually a little bit tricky to actually… to how to match them. Yeah, they're all in my paper. This is so cool, it's like…

168
00:24:48.050 --> 00:25:03.290
Liu Ziyin: For the first time, the lottery ticket stuff becomes concrete, you know? Good. Yes, yes, I'm very happy to hear that. Yeah, it kind of sounds correct, but I kind of… I don't know what to make of it, right?

169
00:25:03.390 --> 00:25:18.699
Liu Ziyin: There are certainly limitations to the theorem. You see that, for example, you're really sort of trying to divide space into little cells, but that's really bad because of the curse of dimensionality, so there are a lot of caveats to the result, yes.

170
00:25:19.450 --> 00:25:22.460
Liu Ziyin: Okay, now I can move on to the…

171
00:25:22.600 --> 00:25:34.549
Liu Ziyin: Second part of the first part. Okay, so… so what… so we have talked about how permutation symmetry gives you an inherent emergent notion of Euclidean distance.

172
00:25:34.550 --> 00:25:46.809
Liu Ziyin: Within every layer. So, there's a way to really sort of say that in a very formal way, which is what we call this second theorem, which is topological phases of training. So what we showed that is,

173
00:25:46.880 --> 00:25:57.240
Liu Ziyin: If you take any permutation symmetric layer, again, fully connected or self-attention hats, and with neuron weights WI, then the learning dynamics can be divided into two different phases.

174
00:25:57.240 --> 00:26:15.630
Liu Ziyin: The first is a small learning rate phase, where the topology of neurons are preserved. So what that means is that if your neurons are close to each other at the beginning of training, they will stay close, and if they're far away, they will stay far away. And the second phase is the large learning rate phase, where the topology of neurons will simplify. So what that means is that

175
00:26:15.910 --> 00:26:19.200
Liu Ziyin: Neurons that were far away could become closer.

176
00:26:19.420 --> 00:26:31.559
Liu Ziyin: But again, those work clothes cannot be separated, so you, actually, you have… they get essentially closer together. Oh, that's irreversibility. It is, yes, it is irreversibility. So the second part is irreversible.

177
00:26:31.830 --> 00:26:41.490
Liu Ziyin: So this is a pictorial illustration of that. So we initialize the neurons on a ring, and if you try and add a smaller neurator…

178
00:26:41.660 --> 00:26:43.360
Liu Ziyin: What'd you say?

179
00:26:44.120 --> 00:26:51.849
Liu Ziyin: That's the sticky neurons that… I think he'll talk about… he likes sticky neurons. I like stick neurons together, maybe I'll talk about that, yes.

180
00:26:51.850 --> 00:27:04.190
Liu Ziyin: Yes, so at a small learning rate, you learn really by deforming the ring, okay? But you don't really change the topology. You have a hole there, and they stay connected, but you only change the shape of it.

181
00:27:04.190 --> 00:27:13.940
Liu Ziyin: But if you train at a larger learning rate, you actually start to twist things, and that's a change of the topological structure, and part of your neuron actually becomes lower dimensional.

182
00:27:14.270 --> 00:27:32.309
Liu Ziyin: Okay, and here are the experiments. So, we really initialize the neuron… neurons on a two-dimensional… well, actually, on the ring, which is a one-dimensional manifold, and here is what happens when you have a small learning rate, okay, so this is how the distribution of neuron changes, and this is the training loss, okay?

183
00:27:32.620 --> 00:27:47.660
Liu Ziyin: And you see that if you increase the learning rate a little bit, you have larger change in the shape, but still don't really change the topology. But if you have a very strong, very high learning rate, you see that you actually do change the topology, a lot of the neurons actually collapse to this point.

184
00:27:47.660 --> 00:27:59.530
Liu Ziyin: And therefore, giving you a very strange topology at the end. And in this case, actually, changing the topology actually makes the learning happen much easier. So here, the student is bigger than the teacher.

185
00:27:59.760 --> 00:28:07.579
Liu Ziyin: It's much bigger than the teacher, so it's to say that it… the distribution of the neurons doesn't really have to match, for example, the teacher distribution list.

186
00:28:08.350 --> 00:28:09.600
Liu Ziyin: Good Thursdays.

187
00:28:10.030 --> 00:28:11.060
Liu Ziyin: Okay…

188
00:28:11.330 --> 00:28:20.799
Liu Ziyin: Okay, so there's one more thing that I want to tell you very quickly about. It relates to the, basically, the merging of neurons you talked about, but I'll be very brief here.

189
00:28:20.800 --> 00:28:36.110
Liu Ziyin: So what symmetries, and in particular, permutation symmetry, really leads to is a separation of two phases, okay? So you have a face, so whenever you have symmetries, you can actually sort of show that your lost landscape locally will take this shape, okay? It's like a Mexican hat.

190
00:28:39.720 --> 00:28:57.909
Liu Ziyin: And there's a symmetric state where two neurons are identical, okay? So neurons merge into each other, they are… you're really the same as a smaller network, and there's a second symmetry broken phase where two neurons are separate. So you effectively have a larger network.

191
00:28:58.350 --> 00:29:09.919
Liu Ziyin: Okay, so we had a series of results that essentially proved the following two things. Let's say established the following two things. So the first of all, the higher your symmetry… higher the symmetry in your weights.

192
00:29:10.540 --> 00:29:21.689
Liu Ziyin: Of course, when you have these symmetries, then the lower the expressivity of your neurons. So basically, it's the same thing here. If you have… if two neurons are identical, you're essentially having a lower dimensional model.

193
00:29:21.850 --> 00:29:28.460
Liu Ziyin: And, if you have stronger regularization during training, you will converge to higher and higher symmetric states.

194
00:29:28.700 --> 00:29:46.349
Liu Ziyin: Okay? So if you apply that to permutation symmetry, that really means that, for example, if you train with weight decay, your neurons will tend to collapse onto each other. Okay, you will tend to get identical neurons, okay? So basically, this is the illustration of that, when you have basically two neurons here, and…

195
00:29:46.820 --> 00:29:55.050
Liu Ziyin: When the two neurons are identical, it's really having the same neuron, but with slightly bigger weights, for the output weights, okay?

196
00:29:56.110 --> 00:30:15.390
Liu Ziyin: And you see that when you have multiple layers, and when you have many, many neurons within the same layer, there are a lot of hierarchies you can have, right? For example, it's possible for these two neurons to be identical, it's possible for these three neurons to be identical, and so on. You have essentially exponentially many hierarchies that could exist there.

197
00:30:15.970 --> 00:30:35.630
Liu Ziyin: Okay, and here is the training of a four-layer MLP on the CFAR10 dataset, okay? Here, sorry, they are wiggling a little bit. I plot the neuron similarity matrix of every hidden layer, so basically how close are the weights of each neuron in every layer to each other. So you see that during training, the second layer and the third layer actually

198
00:30:35.630 --> 00:30:45.460
Liu Ziyin: has a lot of neurons that collapse onto a single huge neuron, which I don't know why, but this actually happens a lot when you train with Adam.

199
00:30:45.460 --> 00:30:46.579
Liu Ziyin: I'll see if I can.

200
00:30:46.580 --> 00:30:54.030
Liu Ziyin: So in the two intermediate layers, you have a bunch of identical neurons after training. Why does it get stuck there?

201
00:30:54.030 --> 00:31:04.009
Liu Ziyin: Why does it get stuck there? Probably you don't need that many neurons to learn the CFR10. Could there be ever a gradient that, that take you away? Yeah.

202
00:31:04.010 --> 00:31:15.080
Liu Ziyin: Okay, so that's the… that's basically what the theorem established, which is that whenever you have regularization, such as with decay, you're preferred… you're energetically preferred to go to these, identical neuron states.

203
00:31:15.300 --> 00:31:16.070
Liu Ziyin: Yes.

204
00:31:16.980 --> 00:31:35.160
Liu Ziyin: Okay, so here's a recap of the first part of the talk. Okay, actually, I'm… I have time. So, the first is basically parameter symmetries are ubiquitous, and they give rise to redundancies in the… in the words, okay? You have… you have them in attention layers, in fully connected layers.

205
00:31:35.330 --> 00:31:38.899
Liu Ziyin: I mean, essentially everywhere. And every symmetry subgroup

206
00:31:39.180 --> 00:31:51.319
Liu Ziyin: leads to essentially two phases. Again, let's talk about the permutation symmetry. So, the symmetric phase is where two neurons are identical, and symmetric broken phase are where two neurons are different.

207
00:31:51.870 --> 00:32:07.800
Liu Ziyin: Okay? And the hierarchy of subgroups corresponds to a hierarchy of complexities. So, basically, those are what these two figures are talking about. If you had higher symmetric states, for example, 100 neurons are identical to each other, of course you have a low expressivity.

208
00:32:07.990 --> 00:32:18.100
Liu Ziyin: And if you have strong regularization, weight decay, actually noise also helps you get there. You go to a higher symmetric state, so neurons are more likely to merge into each other.

209
00:32:19.540 --> 00:32:22.879
Liu Ziyin: Okay, before I start the second part, is there any question?

210
00:32:23.920 --> 00:32:37.509
Liu Ziyin: Maybe if you can comment on, like, so, given that regularization increases symmetry, and symmetry decreases exclusivity, is this something that we generally want when we train networks, or, like.

211
00:32:38.050 --> 00:32:57.799
Liu Ziyin: Yes, okay, that's a very good question. There's certainly a trade-off, right? You, on the one hand, you want a high expressivity network that allows you to fit the function, but people do believe that you want the simplest function that fits your data, right? So you sort of want a balance point between

212
00:32:57.810 --> 00:33:03.219
Liu Ziyin: There's, let's say, a sweet spot of symmetries and expressivity that you want to achieve.

213
00:33:03.850 --> 00:33:05.070
Liu Ziyin: Yes. So.

214
00:33:05.510 --> 00:33:18.070
Liu Ziyin: So you… so the machine learning word for it is generalization. Do you… do you try to relate symmetry to generalization, or expressivity versus generalization? I've asked that here, or is that something?

215
00:33:18.070 --> 00:33:28.680
Liu Ziyin: That's a different topic for you. Actually, that's one thing we are working on right now. It's actually a very… quite difficult thing, but, I'm collaborating with statisticians and mathematicians to actually make the link

216
00:33:28.680 --> 00:33:43.500
Liu Ziyin: strong, but intuitively, you can already see it right there, right? Because expressivity is directly linked to generalization gap, right? The more expressive you are, the bigger hypothesis class you have, and therefore, you tend to have a bigger… Memorization. Memorization, for example.

217
00:33:43.500 --> 00:33:56.230
Liu Ziyin: So, so you do expect it to be strongly related, but it's still unclear how you can go directly from symmetry to generalization. We're working on that, yes. Yes, please.

218
00:33:56.810 --> 00:34:01.340
Liu Ziyin: It's that you, you study the student that is figure, and…

219
00:34:01.460 --> 00:34:08.439
Liu Ziyin: I talked to another, I forgot, a professor at the conference, and he told me this recite that, like.

220
00:34:08.719 --> 00:34:10.639
Liu Ziyin: You basically can…

221
00:34:11.199 --> 00:34:29.729
Liu Ziyin: But let's say you have a neural network, right? It has some weights. Yes, yes. You can't… you kind of can't distill it into the same size, same architecture neural network. It's hard. There's some paper… but if you make it bigger, if you make the student bigger, you actually can.

222
00:34:30.040 --> 00:34:42.980
Liu Ziyin: Right, so something about the learning gets easier. You, you can't fit the teacher, or you are saying… Exactly, like, even… the teacher, exact, exact copy-paste, student…

223
00:34:43.139 --> 00:34:49.910
Liu Ziyin: random init. Now you're trying to match the teacher, right? But just based on input-output? That doesn't work.

224
00:34:50.060 --> 00:34:55.229
Liu Ziyin: Yes, that's right, yes. But if you take a big student, it can match the teacher.

225
00:34:55.800 --> 00:34:58.929
Liu Ziyin: Yeah! It connects to… to this…

226
00:34:59.420 --> 00:35:10.360
Liu Ziyin: People say that a lot, but I'm not too familiar with the literature on teacher-student setting. It's actually a very common theoretical setting, so I don't have a good answer, but…

227
00:35:10.360 --> 00:35:26.829
Liu Ziyin: Yeah, right? The wider you are, the more easily you get to train them, and therefore, the easier they fit the… I guess you're saying, like, all of your research is saying, like, I can initial… if I would initialize only my student correctly, using the moments of the teacher, I would be good.

228
00:35:27.170 --> 00:35:27.970
Liu Ziyin: Right.

229
00:35:28.130 --> 00:35:37.739
Liu Ziyin: But, like… Yes, so yes, you only need to… okay, yes, that's a very good point. Here, in here, it's collapsed in some way. Here, it will collapse in another way.

230
00:35:37.840 --> 00:35:39.749
Liu Ziyin: And I'm… I'm not gonna get boot.

231
00:35:39.860 --> 00:35:54.950
Liu Ziyin: Yes, actually, that's very good. So, yes, you only need to match the moments of the teacher, so there's no… really no need for you to really… it's sort of impossible for you to really recover… If I just use my standard toolkit, I'm not gonna match the moments. Yes.

232
00:35:55.310 --> 00:36:00.670
Liu Ziyin: But, like, if I make a huge thing, it's gonna match the moment. I think so, yes.

233
00:36:01.050 --> 00:36:03.390
Liu Ziyin: Actually, that's a very good point,

234
00:36:04.090 --> 00:36:08.510
Liu Ziyin: I think you're right, but I think it's an open problem, very interesting. Do you have the paper?

235
00:36:08.710 --> 00:36:20.690
Liu Ziyin: I can't dig it out. It'd be cool to just, like, replicate the result, and then just match the moments and show that actually you can train the student of the same size if you only do this, like, nice moment matching?

236
00:36:21.130 --> 00:36:22.590
Liu Ziyin: Procedure, yeah.

237
00:36:23.900 --> 00:36:33.950
Liu Ziyin: Yes, but should I… I think I should continue. Maybe I can leave the questions at the end, or do I have enough time? When am I supposed to end?

238
00:36:34.150 --> 00:36:39.539
Liu Ziyin: Yeah, we usually run over. Okay, okay, I'll answer questions then.

239
00:36:40.060 --> 00:36:53.709
Liu Ziyin: Can you go back to the graph that Chris pointed out? It was really cool, where you did the moment matching experiment and showed that the loss of the smaller model basically, like, roughly, yeah. Yes, can you explain the… so…

240
00:36:54.370 --> 00:37:03.960
Liu Ziyin: Is the moment matching thing where you take the… you match… so, also, I'm just, like, sketching out my understanding of this experiment.

241
00:37:04.140 --> 00:37:07.570
Liu Ziyin: In the hopes that you'll correct me if I misunderstand something, but…

242
00:37:07.730 --> 00:37:11.079
Liu Ziyin: what I think you described was,

243
00:37:12.150 --> 00:37:18.579
Liu Ziyin: The flatline is just, like, training just the smaller model without knowledge of anything bigger. Then…

244
00:37:18.830 --> 00:37:27.240
Liu Ziyin: You took a bigger model, you trained it, just in the normal way, and then at every step of, like, at every training step, you…

245
00:37:27.610 --> 00:37:30.790
Liu Ziyin: Did some kind of process where you…

246
00:37:31.050 --> 00:37:44.789
Liu Ziyin: did, like, a sort of weight transfer where you took the statistical properties, or you did, like, a distribution matching thing, right? Yeah, but we only match it at the initial… at the first step. So it's not every step we match it, we will only match it at the first step, and we do the standard training, yes.

247
00:37:44.790 --> 00:37:52.150
Liu Ziyin: Oh my god. That's very cool, actually, yes, that's a very cool result, yes.

248
00:37:52.380 --> 00:38:11.649
Liu Ziyin: Like, I didn't understand, like, so you're taking, like, a fully trained large model to initialize a small model? No, we take two networks that are in… So both the big and small network are random initialization. And so there's no awareness of the training data whatsoever? There's no awareness, yes, yes, there's no awareness.

249
00:38:12.480 --> 00:38:19.060
Liu Ziyin: If you match the moments of the neuron, the neuron actually refers to the vectors that you define.

250
00:38:19.330 --> 00:38:26.679
Liu Ziyin: Yeah, it's just, like, concatenation of some… The input and output weights is a little bit more… Yeah.

251
00:38:26.680 --> 00:38:47.400
Liu Ziyin: So, is… is this the correct intuition? So, in the… in the… in the higher dimensional, like, weight space that the larger model gives you, you, like, just by, like, combinatorics, you have way more, like, hypotheses you can test. Yes, yes. You also get way more redundancies as well. Yes, yes. If you're transferring over, like, the larger number of, like, hypotheses, or, like.

252
00:38:47.950 --> 00:38:58.860
Liu Ziyin: like, the more interesting, like, distributional properties from the higher dimensional space, while also getting rid of the redundancies, and that's how you can, like, lower it down. Yes, exactly. Yes, yes. While still, like.

253
00:38:59.010 --> 00:39:07.920
Liu Ziyin: preserving a lot of interesting structure that, like, a typical initialization of the smaller model wouldn't, get to. Yes, exactly.

254
00:39:08.030 --> 00:39:21.520
Liu Ziyin: Yes, so that's basically what… yes, that's the essence of the equivariance principle of learning, which basically states that a lot of weight distribution is really parameterize the same learning dynamics. There's a huge redundancy, yes.

255
00:39:22.290 --> 00:39:37.090
Liu Ziyin: And so could, like, is this just really expensive to do the mode? It is very expensive, yes. But I hope there's a more… I hope there are cheaper ways, faster ways, maybe approximate ways to do them, but if you want to do the exact one.

256
00:39:37.100 --> 00:39:43.530
Liu Ziyin: It's actually quite slow. It's polynomial time, but it's not good. But if it doesn't depend at all.

257
00:39:43.680 --> 00:39:49.529
Liu Ziyin: On the training data, then it seems like a reasonable thing to do.

258
00:39:49.720 --> 00:40:01.530
Liu Ziyin: Is to understand the statistics of what would result from the moment matching process, and then you can just initialize your small networks in this way to have those statistics.

259
00:40:01.760 --> 00:40:08.849
Liu Ziyin: Yes. So you can imagine writing a dictionary of initializations, right? You pre-compute all your…

260
00:40:09.120 --> 00:40:21.740
Liu Ziyin: or your initializations, so that your smaller network really sort of role-plays as a bigger network when you train. So, it's imaginable that we can do that, yes. Turn it into a PyTorch function or something.

261
00:40:21.800 --> 00:40:37.780
Liu Ziyin: But, like, you are also saying, at the same time, this is not what we are doing. And this is… so… But he's just taking this random initialization, and he's distilling it into some other… Right, but here's the connection that I'm gonna draw to our field.

262
00:40:38.150 --> 00:40:46.400
Liu Ziyin: there's these people that think about superposition. I see. And they are like, okay, but maybe our networks are fucked.

263
00:40:46.920 --> 00:40:57.130
Liu Ziyin: Because they're simulating this huge network, right? They're like, that would… that would really fuck our network, right? Right. And make it really hard to interpret.

264
00:40:57.440 --> 00:41:08.249
Liu Ziyin: Yeah, it's… Everything is, like, you know, all of these directions, and… We don't have enough dimensions, multiple things, and stuff.

265
00:41:08.820 --> 00:41:14.059
Liu Ziyin: So yeah, that's the niche, additional… to add more to the confusion.

266
00:41:14.170 --> 00:41:22.280
Liu Ziyin: Yes. I think you have a huge equivalence class.

267
00:41:22.390 --> 00:41:34.850
Liu Ziyin: That, of course… So that's actually… if you're right, then that's not what's happening. Oh, okay, interesting. No, but, like, the models tend… in this notion of…

268
00:41:35.100 --> 00:41:39.969
Liu Ziyin: neurons or something. The standards… in the standard setting of training.

269
00:41:40.070 --> 00:41:44.190
Liu Ziyin: You actually collapse to the smaller class.

270
00:41:44.490 --> 00:41:48.019
Liu Ziyin: Defense, so… and if you wanted to collapse to the bigger network.

271
00:41:48.280 --> 00:41:50.350
Liu Ziyin: You would tend to match the moments.

272
00:41:50.970 --> 00:42:06.059
Liu Ziyin: That's right. That's right. So, kind of a contradiction. It's a contradiction. So, it's like the existence of a… there is an existence of a small network that might not have these problems. We just don't know how to initialize that.

273
00:42:06.250 --> 00:42:07.000
Liu Ziyin: Burden.

274
00:42:07.210 --> 00:42:23.659
Liu Ziyin: testing them for free, right? So, initialization is definitely interesting and difficult. So, there's a recent paper by Brian Chan, I don't know if you know him. He's a MIT postdoc, yes, and he had a paper called Network of Theseus.

275
00:42:23.660 --> 00:42:27.369
Liu Ziyin: Right. Very recent. They showed that if you match a MLP

276
00:42:27.370 --> 00:42:52.229
Liu Ziyin: to a, for example, a CNN, a RASNAT at initialization. So if you match their, basically, representations at initialization, they're actually trying to reach the same accuracy on ImageNet. So it's actually very related to the result I have. It's related. Yes, it's very strong. I'm just putting the camera on you, because you're looking at the ceiling, if you want to… Yes, yes. Okay, great.

277
00:42:52.230 --> 00:43:02.510
Liu Ziyin: Close the door, if it's possible. Okay, so… Okay, let me start the second part. Okay. Thank you.

278
00:43:03.230 --> 00:43:21.239
Liu Ziyin: Okay, so there… so I have talked about how just simple permutation symmetries allows you to very strongly and interestingly characterize the learning process of neural networks, and there's actually a key missing part to that picture, which is the fact that the training process

279
00:43:21.480 --> 00:43:30.779
Liu Ziyin: are stochastic of neural networks. You have to assemble data points, and you train according to the stochastic, loss functions, or data sampling.

280
00:43:30.940 --> 00:43:38.089
Liu Ziyin: Okay, and that's really… so that's answered by the second part of this talk, which is irreversibility.

281
00:43:38.260 --> 00:43:50.290
Liu Ziyin: So here, I really only talk about the visibility in the training process. Of course, I think they will also be very important for the information process of a neural network, but we are starting to work on that, and we don't have a result yet, but…

282
00:43:50.290 --> 00:44:06.669
Liu Ziyin: So let's only focus on the training. Okay, let's look at a generic training algorithm, okay? So this is a discrete time generic training algorithm. Eta is your learning rate, and U is your generic learning algorithm. Could be atom, could be STD, but for most of the time, let's imagine STD, okay?

283
00:44:06.980 --> 00:44:08.919
Liu Ziyin: And what you can show is that

284
00:44:09.100 --> 00:44:28.460
Liu Ziyin: Training on this discrete time algorithm is the same as evolving this continuous time algorithm, which is… essentially, the first term is the same as your learning rule, but you also have a second order in the learning rate, emergent term, that we call the entropic force, of training.

285
00:44:28.640 --> 00:44:44.709
Liu Ziyin: And you can actually establish a bunch of physics results showing that this term is actually really the dissipation rate of this discrete time, a stochastic process. So you could be stochastic, it can depend on randomly… random samples of the data. Okay? And…

286
00:44:44.820 --> 00:44:48.000
Liu Ziyin: I will refer to this term as Nabla as, okay.

287
00:44:48.670 --> 00:45:08.090
Liu Ziyin: And it turns out that very… two very interesting and universal phenomena in deep learning are actually related to this term. To this very simple term, you can actually explain two phenomena. The first is the recent platonic representation hypothesis. So people have found that if you have two large neural networks trained on related datasets.

288
00:45:08.440 --> 00:45:15.360
Liu Ziyin: And they tend to learn very similar… they tend to encode data points in very similar, if not identical, ways.

289
00:45:15.570 --> 00:45:36.570
Liu Ziyin: And the second is a more optimization phenomena called edge of stability. So it states that any neural network at training will slowly lose stability, eventually staying at the tipping point, which is called the edge of stability. It is more technical, but it is actually very universal. If you look at the original paper by Jeremy Kogan, it actually showed, like.

290
00:45:36.570 --> 00:45:39.459
Liu Ziyin: A million networks have the… have this phenomenon.

291
00:45:39.740 --> 00:45:57.680
Liu Ziyin: Okay, and to sort of… So can you give a little bit more intuition about delta S, or noblet S? Yes, I will give you more intuitions, okay, very, very, very soon in the next slide. Okay, so to foreshadow the result, what event… what essentially happens is that this term will actually break…

292
00:45:57.680 --> 00:46:16.319
Liu Ziyin: all continuous symmetries of your original training algorithm, which is U. So what that means is that, suppose your original loss function looks like this, and you have a huge degenerate manifold of solutions, the second term will sort of tilt your solution to favor one of the

293
00:46:16.390 --> 00:46:21.119
Liu Ziyin: Huge degenerate… one solutions in the huge degenerate manifold of solutions.

294
00:46:21.360 --> 00:46:26.240
Liu Ziyin: Okay, so this is what the result… what this result is all about, and…

295
00:46:26.380 --> 00:46:42.309
Liu Ziyin: Okay, okay, so if you apply it to the problem of the platonic representation hypothesis, what I will show is basically that among all the different ways of representing the data, the platonic way of representing them is actually favored because of this second term.

296
00:46:43.420 --> 00:46:58.549
Liu Ziyin: Okay, okay, okay, there's a… you shouldn't have seen the second part. Okay, okay, so I will, for the rest of the talk, I will… I will specialize to focus on SGD. So here, U is really the gradient of your loss function.

297
00:46:58.550 --> 00:47:10.109
Liu Ziyin: And the second term, which I call the entropic term, will take this form, okay? So, you are taking the gradient of this function, which is the norm of the

298
00:47:10.140 --> 00:47:22.930
Liu Ziyin: stochastic gradient itself, okay? So this term really has the straightforward meaning that your training process at a finite learning rate, eta, prefers those learning trajectories that

299
00:47:23.120 --> 00:47:25.280
Liu Ziyin: Minimized its fluctuation.

300
00:47:25.670 --> 00:47:37.979
Liu Ziyin: Okay, so you prefer minimal gradient fluctuation solutions, or training directories. So this is a straightforward, meaning of that. What's the expectation number? It's over the training set, yes.

301
00:47:38.420 --> 00:47:56.120
Liu Ziyin: And you can actually apply the… look at what this term really looks like for a single layer. So let's look at the single, fully connected layer. H is the post-activation of the previous layer, W is the trainable weight of this layer, and P is basically the pre-activation of the next layer.

302
00:47:56.320 --> 00:48:13.299
Liu Ziyin: Let's look at this layer. Well, and we want to look at what the entropy loss really look like for this specific layer, and you can just do the computation, and apply the chain rule. You see that it's a multiply, it's a product of the

303
00:48:13.460 --> 00:48:20.099
Liu Ziyin: neuron gradient, so it's the gradient with respect to P, multiplying the representation H itself, okay?

304
00:48:20.470 --> 00:48:22.069
Liu Ziyin: Okay, there's… okay.

305
00:48:22.110 --> 00:48:33.430
Liu Ziyin: And therefore, the norm of it is really the norm of the gradient multiplying the norm of the representation. Therefore, the second term really encourages you to learn in two ways.

306
00:48:33.430 --> 00:48:44.170
Liu Ziyin: first of all, you are encouraged to learn the most compact representation, okay? You want to have a small… you want all your representations, H, to be as small as possible, okay?

307
00:48:44.300 --> 00:48:53.730
Liu Ziyin: And you also want your gradient to be as small as possible. That's why, actually, people see a lot of low-rank gradient phenomena. My hypothesis is that it's actually due to the first term.

308
00:48:54.110 --> 00:48:55.760
Liu Ziyin: So,

309
00:48:55.830 --> 00:49:18.570
Liu Ziyin: Let me… So, I'm still… I'm still… I'm still trying to get information here. So, does the second… is the second term talking about something… so, I'm not following. Is the second term, so… This is the instantiation of the effective dynamics for… For STD, yes. So he spent a whole month or something to derive this equation.

310
00:49:18.940 --> 00:49:43.049
Liu Ziyin: Right, so is the second term coming from the S in SUD? Is it from the stochastic, or what is the second term from? Very good. It comes from a combination of two terms. The first is the discretization error. That's why you see a factor of eta here, right? Okay. And the second is the stochasticity. So this second term, you could actually decompose it again into a mean part and a fluctuation part.

311
00:49:43.180 --> 00:49:51.959
Liu Ziyin: The fluctuation part comes from stochasticity, and the mean part comes from discretization error. So if you were… if you were to…

312
00:49:52.200 --> 00:50:02.929
Liu Ziyin: do full batch gradient descent, instead of surpassive gradient descent, and if you were to do it in infinite precision, then you would have no second term. You will have no second term.

313
00:50:04.540 --> 00:50:20.770
Liu Ziyin: If you do full batch gradient flow, you will not have second term. With a… actually… Yes, because… As a differential… Yes. Yes. If you do discrete time, finite step size gradient descent, you still have the…

314
00:50:21.040 --> 00:50:31.220
Liu Ziyin: this term is still there, but you only have the mean part, not the fluctuation part. I see. So there is some effect, but it's much weaker. I see. Yes.

315
00:50:32.470 --> 00:50:44.170
Liu Ziyin: So, it's a combination of two effects, discretization error and stochasticity of training. And that leads to this emergent form that minimizes, essentially, the

316
00:50:44.330 --> 00:50:56.890
Liu Ziyin: the norm of the… of the training algorithm. So if you… if you take bigger steps and smaller batches, then you will have higher…

317
00:50:57.610 --> 00:51:03.050
Liu Ziyin: Decipitative effects. You will have higher deceptive effects, and you will see that

318
00:51:03.350 --> 00:51:17.340
Liu Ziyin: your average gradients will be much smaller. So if you increase your learning, it actually causes the gradient to become small. It's a very common effect. I should have… I don't have a figure here, but I have seen that a thousand times, so it's very… it's very robust.

319
00:51:17.830 --> 00:51:22.359
Liu Ziyin: Yes. Your effective learning rate goes down. Your…

320
00:51:22.600 --> 00:51:41.019
Liu Ziyin: It's difficult to say, because really, it's the com… it's the product of the two that gives the effective learning rate. I see, but you're… but what you're really doing when you have your physical learning rate high, is you're putting it all into the second term, you're just bouncing around all over the place. Yeah. And… but your effective learning rate might not…

321
00:51:41.020 --> 00:51:45.500
Liu Ziyin: Yeah, it could go down exactly. It could actually. Yes, yes, yes. Yes, yes.

322
00:51:45.920 --> 00:51:50.059
Liu Ziyin: Yes, so that's the right interpretation, yes. So what quantity gets preserved?

323
00:51:51.020 --> 00:51:53.150
Liu Ziyin: What quantity gets preserved?

324
00:51:53.680 --> 00:51:58.780
Liu Ziyin: Yeah, what do you mean? Like, if you increase the learning rate, you're saying your gradients decrease.

325
00:51:59.100 --> 00:52:03.889
Liu Ziyin: You can sort of say that their product are… Gets preserved?

326
00:52:04.090 --> 00:52:12.479
Liu Ziyin: not preserved, it increases still, but slow… at a slower rate than you would expect naively.

327
00:52:13.000 --> 00:52:13.840
Liu Ziyin: Yes.

328
00:52:15.180 --> 00:52:31.090
Liu Ziyin: Okay, so… okay, in the rest of the part two, I will talk about three minimal models. So they are very simple mathematical models that you can solve exactly, that will give rise to three interesting phenomena. And the first is the phenomena of phase transition.

329
00:52:31.290 --> 00:52:41.859
Liu Ziyin: And we look at a very simple problem, okay? So this is our loss function. You have two trainable parameters, W and V, and you have a stochastic sampling of data point X, okay? And X,

330
00:52:41.950 --> 00:52:57.919
Liu Ziyin: has expectation value and a variance, okay? And you can really compute the entropic loss, okay? So the first term, again, is the same as the original loss function in expectation, and there is an emergent term, okay? So you really minimize this thing if you're training with SGD.

331
00:52:58.120 --> 00:53:11.919
Liu Ziyin: So you see that the second term encourages W and V to be as small as possible, and that leads to… that predicts a phase transition where, at a critical learning rate, you will actually prefer both W and V going to zero.

332
00:53:12.160 --> 00:53:19.730
Liu Ziyin: So, SPP is implicitly regularized. It is implicitly regularized by the noise and final steps.

333
00:53:20.070 --> 00:53:21.539
Liu Ziyin: Formalize it for you.

334
00:53:21.890 --> 00:53:23.199
Liu Ziyin: percolating with F.

335
00:53:23.490 --> 00:53:27.170
Liu Ziyin: Yes, yes. How do you get this? This term?

336
00:53:27.170 --> 00:53:44.700
Liu Ziyin: Where does this F come from? It's not difficult to compute. I will encourage you to read our paper. You basically ask yourself the question, and basically, okay, you basically do this analysis. You try to find an integral that approximates your discrete time dynamics, and you will see, to leading order, it's really this term.

337
00:53:44.740 --> 00:53:48.300
Liu Ziyin: It's easy to do. We can talk about it, yes.

338
00:53:48.950 --> 00:53:51.709
Liu Ziyin: Yes, yes, it's easy to do. Okay. Not the problem.

339
00:53:52.340 --> 00:54:05.450
Liu Ziyin: Yes. Yes, but it's actually a second-order expansion, and you get it very soon.

340
00:54:05.660 --> 00:54:07.560
Liu Ziyin: I think I could work it out.

341
00:54:07.590 --> 00:54:15.519
Liu Ziyin: Yes, yes, that's the… that's the essential message, yes. Okay, so this is really the simulation of… of run… of…

342
00:54:15.520 --> 00:54:23.059
Liu Ziyin: training with SGD on this loss function. You see that it's unbounded, right? The global minimum is where U times V is basically infinity.

343
00:54:23.060 --> 00:54:39.010
Liu Ziyin: And here is what happens. So you see that at a high learning rate, you see that actually they converge to the saddle point at zero at a high learning rate, because of this, entropic turn. You see? Initially, they are sort of spread out, but as you train, they actually get closer and closer to the saddle point.

344
00:54:39.010 --> 00:54:44.440
Liu Ziyin: That's a example of a… Without any weight decay or anything. There's no weight decay in this example.

345
00:54:44.580 --> 00:54:57.089
Liu Ziyin: Okay, so… this actually has to do with symmetry. So this point is actually the symmetric state of the Z2 symmetry. Here, wait, sorry, what's your… what are your labels?

346
00:54:57.420 --> 00:55:05.670
Liu Ziyin: Oh, okay, there's no label, you see. The loss function is very simple, you… Okay, you want your thing to output 0. To basically correlate with X.

347
00:55:05.880 --> 00:55:11.330
Liu Ziyin: No, no, okay, to minimize this loss function, you want your W and V to correlate with X.

348
00:55:13.740 --> 00:55:23.660
Liu Ziyin: Okay, I'm confused. Correlate, so it makes W and could get any value. Yes, yes, yes. As long as 1, you want to be minus 1.

349
00:55:24.120 --> 00:55:29.440
Liu Ziyin: Yes, yes. So you want to be minus infinity? Minus infinity, yes. So that's why it's unbonded.

350
00:55:29.650 --> 00:55:39.199
Liu Ziyin: Yes, but yet you really… the training process really prefers not learning anything at a high learning rate. That's the message here. So even if, to minimize the loss function, you need a…

351
00:55:39.390 --> 00:55:48.920
Liu Ziyin: you need to go infini… negative infinity, but you actually don't go there at higher learning. For, like, small learning rates, you might…

352
00:55:49.070 --> 00:55:56.180
Liu Ziyin: go to a negative infinity-ish? Yes, for smaller units, yes, you will escape, always, yes. So there's a…

353
00:55:56.450 --> 00:56:04.520
Liu Ziyin: what we call a critical point. That's where a phase transfer happens. Where the second term kind of takes… Yes, where the second term dominates the first term, yes. Good.

354
00:56:05.550 --> 00:56:12.430
Liu Ziyin: Okay, and you can also solve a phase diagram, I'll just skip it. And now let's look at the…

355
00:56:12.710 --> 00:56:28.110
Liu Ziyin: The second phenomena, which is edge of stability, and here, let's look at a slightly more complicated problem, okay? So we have a MSC loss function, and our label is Y hat, which is generated by a linear teacher, V,

356
00:56:28.110 --> 00:56:38.680
Liu Ziyin: and has some inherent zero mean noise epsilon. And again, we train with STD with basically a batch size of 1, and our model is a deep linear network.

357
00:56:38.860 --> 00:56:54.449
Liu Ziyin: Okay, it's a simple model, but it actually is strongly nonlinear in terms of the learning dynamics and in terms of the, of the loss landscape, and it also has the notion of reputations, so it's actually a very good minimal model. The notion of what?

358
00:56:54.450 --> 00:57:13.379
Liu Ziyin: Of representations. Representations. Right? Because of the layers. Yeah, because of the hidden layers, right? So WI times X is the first hidden layer representation, and so on. I think, like, just one comment, I think it's super interesting to see this, like, because when I learned about, like, machine learning, deep learning stuff, like.

359
00:57:13.630 --> 00:57:28.829
Liu Ziyin: this case was always dismissed, right, because, like, you could write it as one matrix. This is the boring case. Right, but it actually has different learning dynamics, that's what you are saying. Yes, yes, when you are deeper, yes. If you are… you spend some effort, then it's actually a difference.

360
00:57:28.830 --> 00:57:36.130
Liu Ziyin: Yeah, so it's actually a very good model of learning dynamics. Maybe it's not a good model of expressivity, but I think it's a…

361
00:57:36.180 --> 00:57:42.570
Liu Ziyin: Thanks for giving me. Thank you for… thank you for thanking me.

362
00:57:44.400 --> 00:57:47.820
Liu Ziyin: Okay, so here's the prediction.

363
00:57:48.050 --> 00:57:59.870
Liu Ziyin: I'll jump through all the math, but here's the prediction. If the training does minimize the gradient fluctuation, which is basically the entropic term, then higher label noise imbalance, so basically.

364
00:57:59.870 --> 00:58:13.529
Liu Ziyin: the imbalance on epsilon. So epsilon is a vector, and you can look at its spectrum covariance. If some label has low variance and some label has high variance, that's an imbalance, will lead to higher sharpness.

365
00:58:13.740 --> 00:58:14.670
Liu Ziyin: Okay.

366
00:58:15.030 --> 00:58:34.450
Liu Ziyin: So here is what happens during the training of one such deep linear networks, and we either have an isotropic label noise on the labels, or a non-isotropic label noise on the labels. And you can see that the global minimum for your loss function actually doesn't change, so absolutely doesn't change your landscape at all, whereas it actually changes the solution you convert to.

367
00:58:34.530 --> 00:58:48.680
Liu Ziyin: at a high… high anisotropy, you go to sharper solutions, okay? So these are the trace of the Hesion, so that means a higher… a sharper place, whereas if you have isotropic label noise, you prefer a flat solution.

368
00:58:48.990 --> 00:59:02.120
Liu Ziyin: Can you explain why… why we would predict that? I didn't… You… that's very technical, you will have to prove 10 theorems about this loss function. Is there any, like… is there an inclusion for why you would, like…

369
00:59:02.340 --> 00:59:17.350
Liu Ziyin: Why are we measuring sharpness, also? Okay. Because, okay, so this is trying to explain a phenomena called, adjaccessibility. Adjustability consists of two different but universal phenomena. The first part is called

370
00:59:17.350 --> 00:59:28.630
Liu Ziyin: progressive sharpening. So it's saying that if you train any of your model, you see that the, basically, the sharpness increases as you train, so basically that's actually this part. It increases as you train, right?

371
00:59:28.630 --> 00:59:32.399
Liu Ziyin: And that's a universal phenomena, appears across any… many models.

372
00:59:32.400 --> 00:59:51.049
Liu Ziyin: And the second part is that you stay at a critical value. And this, like, place where you stay is determined by the learning rate. Determined by the learning rate. And this… I hear you're saying it's actually can be determined by the label numbers. It is dependent, yes. So it is saying that you don't always have

373
00:59:51.050 --> 00:59:56.980
Liu Ziyin: edge of stability. You only get it when your label noise is strongly unbalanced.

374
00:59:57.270 --> 01:00:06.039
Liu Ziyin: Yes, so this theory explains the first part of agile stability, which is called progressive sharpening. It explains why you get to sharper and sharper places.

375
01:00:06.700 --> 01:00:09.339
Liu Ziyin: And that's because of the label noise imbalance.

376
01:00:09.820 --> 01:00:22.219
Liu Ziyin: And that's actually very common, for example, for language tasks, right? Imagine the uncertainty of two different words, right? The word jack versus the word, I don't know, chamber it, they are very different.

377
01:00:22.970 --> 01:00:37.899
Liu Ziyin: So, that's why you actually see it universally in real data. So, but now, so… so your… the result comes from all the… through your technical proof? Yeah, it is, yes. But… but, like, alright, but then you must have some intuition.

378
01:00:39.010 --> 01:00:47.059
Liu Ziyin: That's… much harder… okay, there's some intuition.

379
01:00:48.340 --> 01:01:03.920
Liu Ziyin: So, the… okay, the key mechanism, I think, is very clear. It's that… so people in the past used to say that STD training, for example, prefers flat minimum, right? That used to be a very common and popular thing to say. But the key…

380
01:01:04.570 --> 01:01:14.880
Liu Ziyin: prediction of this theory is that SGD has no preference, a priority for flatness or sharpness at all. It only has a preference for low fluctuation.

381
01:01:15.270 --> 01:01:33.570
Liu Ziyin: And it's sort of a… it's partially coincidental that for some data distributions, the minimal fluctuation solutions tend to be a… tend to be a flat one. For other data distributions, the minimal fluctuation solution tends to be a very sharp one.

382
01:01:34.180 --> 01:01:35.270
Liu Ziyin: And there…

383
01:01:35.400 --> 01:01:42.070
Liu Ziyin: And it's not completely a coincidence. You have to leverage a lot of symmetries in the model to

384
01:01:42.310 --> 01:01:48.100
Liu Ziyin: See… why, in this case, isotropic noise leads to,

385
01:01:48.280 --> 01:01:54.669
Liu Ziyin: flatness, but it's not so straightforward. So…

386
01:01:54.850 --> 01:02:03.619
Liu Ziyin: Actually, that's one thing I'm still working on, I'm still thinking about deeply. It would be nice, in my opinion, if I can really name a, let's say, a…

387
01:02:03.960 --> 01:02:11.290
Liu Ziyin: simple mechanism where it's actually… where I can just tell you. I hope I can get there, but we're not getting there as of today.

388
01:02:11.610 --> 01:02:12.480
Liu Ziyin: Yes.

389
01:02:12.750 --> 01:02:18.759
Liu Ziyin: Okay, and you can test the same thing on a variety of neural networks. So what we plot here is

390
01:02:18.880 --> 01:02:26.969
Liu Ziyin: X-axis is the condition number of the label noise spectrum, so that tells you how imbalanced they are, and the sharpness.

391
01:02:27.240 --> 01:02:41.730
Liu Ziyin: And you see that, basically, having a conditional number of 1, meaning that your noise is completely isotropic, that always corresponds to the flattest solution you get to, whereas if you increase the imbalance, you always get to sharper solutions.

392
01:02:42.760 --> 01:02:53.460
Liu Ziyin: Okay, so you can link it to the, basically, the Mexican hat picture I just said, which is basically… you basically, you really have a lot of different ways

393
01:02:53.650 --> 01:03:12.989
Liu Ziyin: to… to learn the function mapping, but there are also a lot of symmetries in the model, right? Because you're deep, you have two models, you can… you have two layers at least, and you can scale up one layer, and you can scale down the other layer. And it turns out that only some of those scalings are really preferred by your learning algorithm.

394
01:03:12.990 --> 01:03:16.759
Liu Ziyin: And, those are the ones that corresponds to edge of stability.

395
01:03:16.760 --> 01:03:19.620
Liu Ziyin: But more precisely, I mean, progressive sharpening.

396
01:03:19.950 --> 01:03:22.690
Liu Ziyin: And the other ones are not favored by the learning algorithm.

397
01:03:23.920 --> 01:03:24.750
Liu Ziyin: Okay?

398
01:03:24.950 --> 01:03:31.390
Liu Ziyin: Now let's talk about the last phenomena I will talk about, which is the platonic representation hypothesis.

399
01:03:32.150 --> 01:03:46.190
Liu Ziyin: It's a slightly more complicated model than what I have just shown. So now, instead of training on Z, sorry, training on X, we train on a transformed X, okay? So instead of seeing X, the model sees X hat.

400
01:03:46.190 --> 01:04:03.019
Liu Ziyin: which is a linear transformation of X, and Z is a full rank matrix, so it doesn't really lose information. So you still have essentially the same data, but transformed. They have different, for example, distance relationships between different data points. And we have the same loss function, we have the same label noise.

401
01:04:03.020 --> 01:04:11.579
Liu Ziyin: And we have the same model, and notice that here you really have a notion of reputation, and therefore we can really talk about how… whether reputations are aligned.

402
01:04:12.050 --> 01:04:13.540
Liu Ziyin: And here's the prediction.

403
01:04:13.690 --> 01:04:21.950
Liu Ziyin: If the training minimizes gradient fluctuation, then the learned representations, basically all the learned representations, are independent of Z.

404
01:04:22.560 --> 01:04:31.800
Liu Ziyin: So no matter how you transform your data point, you will always learn the same reputation. Okay? Yes, please.

405
01:04:32.320 --> 01:04:36.739
Liu Ziyin: What does, minimizing gradient fluctuation imply by?

406
01:04:37.010 --> 01:04:39.090
Liu Ziyin: Or are we skipping the…

407
01:04:39.300 --> 01:04:46.900
Liu Ziyin: it's sort of related to what I said in the first slides of the second part, where I said, when you really…

408
01:04:48.090 --> 01:04:55.589
Liu Ziyin: When you have the entropic term for a fully connected layer, it encourages you to have the sort of the most compact representation.

409
01:04:55.770 --> 01:05:03.110
Liu Ziyin: So they… that sort of gives you… that sort of gives you a preference to, let's say,

410
01:05:03.420 --> 01:05:08.030
Liu Ziyin: non-redundant compact representation, so that's a very indirect…

411
01:05:08.140 --> 01:05:15.830
Liu Ziyin: argument I can give you. There's an implicit additional step where, like, the most compact representations are unique or something.

412
01:05:16.240 --> 01:05:32.949
Liu Ziyin: the most compact ones that are equivariant to the input transformations are platonic. So there's the equivariance part that I sort of skipped here in order to not to confuse you. Yes. I want to go from…

413
01:05:33.660 --> 01:05:35.460
Liu Ziyin: HD-1…

414
01:05:36.450 --> 01:05:45.489
Liu Ziyin: Okay, no, never mind. So that… yes, and so to complete the sentence, under what assumptions this is it? So, so you're saying, oh, it's a… it's the same…

415
01:05:45.820 --> 01:05:50.169
Liu Ziyin: representation. If we initialize our WIs the same way.

416
01:05:50.340 --> 01:06:01.139
Liu Ziyin: Or, like, like, under, like, under… Yes. So here, what you can prove is that if you, if you, if this model reaches the global minimum of what we call the entropic loss.

417
01:06:01.440 --> 01:06:05.019
Liu Ziyin: Then… There's only one way to represent the data.

418
01:06:05.280 --> 01:06:16.680
Liu Ziyin: So as long as you reach the global minimum, there's only one way, and that's the platonic way. That's the way that's independent of Z. That's the way where you actually will essentially cancel out the Z in the first layer. Okay.

419
01:06:16.990 --> 01:06:26.819
Liu Ziyin: Yes, okay, and here's a simulation of that. Again, it's… this is really a deeply near network, and we train two different networks on two different… well.

420
01:06:26.820 --> 01:06:29.120
Liu Ziyin: Related but transformed data, okay?

421
01:06:29.120 --> 01:06:47.130
Liu Ziyin: And you see that the representation… so here I plot the representation similarity in one of the hidden layers, for two different networks. And you see that at the beginning of training, they are really not similar, but as you train, they actually essentially encode data in exactly the same way. Okay, so this is a case where you can actually reach perfect

422
01:06:47.130 --> 01:07:00.949
Liu Ziyin: representation alignment across, across state. Does the last layer invert Z, or the first layer? It's the first… the first layer will always invert, Z. The closest to the input, or… Closest to the input.

423
01:07:01.050 --> 01:07:17.330
Liu Ziyin: So what you can… so what's also the case that if you only look at the same network, all different layers will be aligned to each other. So you don't need two networks. So you have one network, and you have two hidden layers, and those two hidden layers really have the same representation, up to a rotation.

424
01:07:17.550 --> 01:07:22.290
Liu Ziyin: So, yes, there's a side effect. Does anything change?

425
01:07:22.640 --> 01:07:38.690
Liu Ziyin: Of… okay. So here, the theory is only done on linear… deep linear models. It's very unclear what will happen for nonlinear models. It's, in my opinion, a very huge, interesting, open problem.

426
01:07:38.710 --> 01:07:46.949
Liu Ziyin: actually, even the PRH itself, whether it has to do with linear structures, right? What if we did, like, linear resonance?

427
01:07:47.450 --> 01:07:52.809
Liu Ziyin: like, identity plus W1, or… Identity slash W2.

428
01:07:53.770 --> 01:07:55.869
Liu Ziyin: Would that change the end?

429
01:07:56.900 --> 01:08:16.770
Liu Ziyin: I think you will still see very strong alignment, but I'm not sure if you will actually get perfect alignment. So here, what you can prove is that you get 100% alignment. They only differ up to a rotation, but if you have residual connections, I'm not sure. I'm not sure. I hope it's still the case, but it's very difficult to analyze.

430
01:08:17.010 --> 01:08:36.539
Liu Ziyin: But maybe doable, I'll think about it. Yes. You hope it's the same case, but that dramatically changes your learning dynamics. Yes, you really parameterize the same function, right? They really have the same expressivity, but…

431
01:08:36.740 --> 01:08:38.769
Liu Ziyin: Deep learning is…

432
01:08:39.460 --> 01:08:46.380
Liu Ziyin: Yes, but deep learning is not invariant to read parameterizations because of the nonlinearity in the learning dynamics.

433
01:08:47.029 --> 01:08:54.300
Liu Ziyin: So… I can't say at the top of my head. I hope it is true, and I can hypothesize it is true, but

434
01:08:54.729 --> 01:08:56.279
Liu Ziyin: I don't know, yes.

435
01:08:57.359 --> 01:09:00.129
Liu Ziyin: Before you're creating.

436
01:09:01.020 --> 01:09:08.780
Liu Ziyin: Say it again? Like, before you even done, like, it's just some kind of initialization, before you have done any kind of training.

437
01:09:09.420 --> 01:09:16.280
Liu Ziyin: Because there is no nonlinearity involved in this APR network… But you see different data?

438
01:09:16.649 --> 01:09:32.839
Liu Ziyin: Do you think you're saying that you… I mean, in different ages, you have… No, not really. The reason is that you actually have infinitely many different ways of representing the data for the deep linear network.

439
01:09:32.870 --> 01:09:49.080
Liu Ziyin: Because of the symmetry, and here is why, here is why. So, if you look at a deep linear network, you can transform the previous layer by invertible transformation, linearly invertible transformation T, and you can transform the next layer by T inverse.

440
01:09:49.380 --> 01:09:53.390
Liu Ziyin: Such that… sorry, this teamverse should appear on the other side.

441
01:09:53.569 --> 01:09:58.630
Liu Ziyin: And… By doing that, you still get a global minimum, you still learn the same function.

442
01:09:58.960 --> 01:10:10.190
Liu Ziyin: But you represent data in completely different ways. So, out of all the infinitely many different ways of representing the data, you actually prefer the ones that are platonic, that actually cancel your

443
01:10:10.190 --> 01:10:21.419
Liu Ziyin: your Zs, okay? It's a very non-trivial phenomena that you actually cancel it, and it's due to the, basically, the entropic term. It's the minimization of the second term that takes you to these solutions.

444
01:10:23.860 --> 01:10:34.000
Liu Ziyin: Okay, that's… And you're hoping that this also explains generalization. You're hoping that, oh, there's something special about

445
01:10:34.370 --> 01:10:36.710
Liu Ziyin: You know, this… this… this optimal…

446
01:10:37.230 --> 01:10:47.460
Liu Ziyin: it could rela… I hope it could be relatable to generalization, and again, that's one thing we are really working on. And does it… and there's the intuition that…

447
01:10:49.420 --> 01:10:54.230
Liu Ziyin: So you keep on bringing up symmetry. So is the intuition that this…

448
01:10:54.420 --> 01:10:59.350
Liu Ziyin: Point is the maximally symmetric point, or something like that, or…

449
01:10:59.810 --> 01:11:04.520
Liu Ziyin: So you brought this up in a couple of the… Examples, huh?

450
01:11:06.220 --> 01:11:08.469
Liu Ziyin: You, I think a…

451
01:11:09.730 --> 01:11:17.510
Liu Ziyin: That's a very tempting thing to say, that maybe when you have symmetry, you prefer the maximally symmetric state, but it's…

452
01:11:18.410 --> 01:11:21.470
Liu Ziyin: It's actually unclear so far whether that's the case or not.

453
01:11:21.690 --> 01:11:36.340
Liu Ziyin: Here, it's a combination of two effects. The first, you do have symmetry, so basically you have many ways of representing the same data, and that also exists in nonlinear models. So I would assume similar effects will happen there.

454
01:11:36.580 --> 01:11:51.430
Liu Ziyin: And the second effect is the irreversibility of the training process. You prefer… so you sort of have to throw away your initial… your initial information before you can sort of get to a universal sort of shared representation, right? So that's what the irreversibility is sort of about.

455
01:11:51.860 --> 01:11:55.879
Liu Ziyin: So we need two effects, at least for this specific, exactly solvable model.

456
01:11:56.100 --> 01:12:02.989
Liu Ziyin: I can't say if that's the maximally symmetric state or not. It doesn't seem to be, actually. Yes.

457
01:12:03.900 --> 01:12:11.080
Liu Ziyin: Maybe in some strange hidden space, it is, but we are actually not sure about that yet.

458
01:12:13.010 --> 01:12:29.359
Liu Ziyin: Okay, so that essentially finishes the theoretical framework that I want to present to you. The idea is simple. When you have symmetries, that really creates different phases, a hierarchy of phases, where you have low expressivity at symmetric states, and higher expressivity at symmetry-broken states.

459
01:12:29.630 --> 01:12:33.330
Liu Ziyin: And regularization will take you between these two states.

460
01:12:33.840 --> 01:12:52.289
Liu Ziyin: And also, when you have a lot of neurons, that naturally leads to a huge manifold of solutions, and irisibility sort of prefers specific solutions along the degener manifold, and that gives you… gives rise to interesting phenomena, such as platonic rotation and adjective stability.

461
01:12:54.240 --> 01:13:11.790
Liu Ziyin: And there's the last question. If I have time, I can talk about it. If I don't have time, I'll just end here, which is whether you can make use of it. So, should I continue, or should I… Continue. Okay. Yeah, people feel free to leave, but they've got a conflict, but yeah. Okay, but, okay, before I start, do I have any questions?

462
01:13:14.170 --> 01:13:26.640
Liu Ziyin: Okay, so okay, now I have talked about so much about really solving, exactly solvable, technical models, there's… there's an interesting remaining question, which is whether we can make use of it.

463
01:13:26.910 --> 01:13:30.989
Liu Ziyin: Well, I personally do think we can make a lot of use of it, and

464
01:13:31.400 --> 01:13:48.369
Liu Ziyin: as the simplest example, let's consider what we call the law of parsimony. Well, it's not only a law, I think, a principle for machine learning, but it's also a principle for science and for our ordinary life, which states that among all the competing hypotheses, we always choose the simplest one, right?

465
01:13:49.370 --> 01:13:54.390
Liu Ziyin: And in the case of machine learning, here's…

466
01:13:54.760 --> 01:14:07.500
Liu Ziyin: how you frame it, so basically you want to learn a mapping from X to Y, from input to the label, and you have a model which takes X as input and is parameterized by your weights, which is data.

467
01:14:07.810 --> 01:14:15.260
Liu Ziyin: And okay, and you, for example, and one instance of that is, basically, you want to find your model that learns the mapping.

468
01:14:15.520 --> 01:14:22.339
Liu Ziyin: such that as many of… as many theta i is zero as possible, so you want the most

469
01:14:22.340 --> 01:14:39.529
Liu Ziyin: sparse representation, parameterization of your model, and sometimes we believe that it's the simplest. Okay, and the classical way of doing it is less so, right? You train your model, and you add a L1 penalty, and that naturally gives you a sparse parameterization.

470
01:14:40.360 --> 01:14:52.970
Liu Ziyin: But let's ask a different question, which is, how would a physicist solve this problem? How would Feynman solve it? How would Einstein solve it? How would Landau solve it?

471
01:14:53.090 --> 01:14:55.000
Liu Ziyin: Well, here's what I think.

472
01:14:55.130 --> 01:14:56.000
Liu Ziyin: Okay.

473
01:14:56.740 --> 01:15:05.140
Liu Ziyin: The physicists would try to leverage what is known to every physicist, which is continuous symmetry breaking.

474
01:15:05.640 --> 01:15:21.999
Liu Ziyin: Okay, so what you will do is that you take every parameter theta i, and you couple it to a new dynamical trainable variable VI. Okay, note that it sort of creates artificial new symmetries in the model. Okay, this is what you do. So you take your original model, which is F of theta.

475
01:15:22.000 --> 01:15:33.189
Liu Ziyin: to f of theta element-wise product V, so V has the same dimension as theta, so this effectively doubles your parameter count, but it also creates additional symmetries.

476
01:15:33.470 --> 01:15:40.539
Liu Ziyin: Okay, so the symmetric state we use where theta i equal to VI equal to zero. Okay, I have a figure, actually.

477
01:15:40.830 --> 01:15:43.419
Liu Ziyin: Okay, I'll say that at the end. Okay, -

478
01:15:44.020 --> 01:15:54.200
Liu Ziyin: And what you do is that you try to minimize the energy, which is your loss function. Could be anything, but here I wrote it as a MSE. And you try… minimize it at a finite temperature, okay?

479
01:15:54.850 --> 01:16:00.269
Liu Ziyin: And then, immediately, the laws of… the fundamental laws of physics tells you two things.

480
01:16:00.480 --> 01:16:07.259
Liu Ziyin: If breaking the theta i symmetry reduces energy L, then theta I will be non-zero. That's the symmetry broken state.

481
01:16:08.060 --> 01:16:19.290
Liu Ziyin: And if that doesn't help you reduce the energy L, theta L will be 0. So that's the symmetric state. So this is called the Langdao theory of phase transitions, okay? So, so here is a figure of that. The symmetry

482
01:16:19.530 --> 01:16:24.790
Liu Ziyin: Of this additional redundant artificial parameterization.

483
01:16:24.910 --> 01:16:27.410
Liu Ziyin: Is theta equal to VI equal to 0?

484
01:16:27.870 --> 01:16:32.339
Liu Ziyin: And the symmetry broken state is where theta IVI are non-zero.

485
01:16:32.480 --> 01:16:33.400
Liu Ziyin: Okay.

486
01:16:34.470 --> 01:16:40.719
Liu Ziyin: And… So, and here is our simple algorithm, okay? We redundantly parametrize our model.

487
01:16:41.160 --> 01:16:56.060
Liu Ziyin: by doubling its parameter count, and we have a regularization term. And if you leverage the theory we have proved previously, you know that you should add a regularization, and the higher the regularization, basically the bigger… the higher symmetric states you go to.

488
01:16:56.770 --> 01:17:07.939
Liu Ziyin: And this term, in physics words, is really the temperatures, okay? So whenever we study temperature, it actually looks like this. It's like a L2 decay, okay?

489
01:17:08.520 --> 01:17:29.869
Liu Ziyin: And this is what happens if you really just train it with Atom or STD. This is a MNIST problem. So, you see that very naturally, you learn… so there's a two-heeden-layer neural network, fully connected, and you see that you very naturally learn a very sparse neural network that are, in principle, more interpretable than a dense network.

490
01:17:30.260 --> 01:17:41.180
Liu Ziyin: And this is what happens if you train without… train a vanilla model that doesn't have this strange parameterization. You basically get a very dense solution.

491
01:17:41.360 --> 01:17:53.459
Liu Ziyin: Okay, so very naturally, you leveraged symmetry, to build interpretable and sparse models that are maybe also more efficient.

492
01:17:54.030 --> 01:18:04.189
Liu Ziyin: Okay, and we also applied it to compress larger models, ResNet18, they actually work very, very well. You can compress ResNet 18 by a thousand times without really affecting its performance.

493
01:18:04.680 --> 01:18:13.039
Liu Ziyin: You can also compress pre-trained models, so here we compressed a pre-trained ResNet50 on ImageNet. It's also working very well.

494
01:18:13.480 --> 01:18:26.670
Liu Ziyin: And now let's do an exercise, okay? We have learned all the great things about symmetry, okay? Suppose now I want my model really divides into sub-modules, okay? So theta is your parameter, and it's

495
01:18:26.840 --> 01:18:29.439
Liu Ziyin: And it has D sub-modules.

496
01:18:29.560 --> 01:18:44.389
Liu Ziyin: parameterized each by W sub i, okay? So W1 is the weight of the first module, and WD is the weight of the last module. And suppose you want to make as many modules to be zero as possible. What would you do as a physicist?

497
01:18:44.800 --> 01:18:45.750
Liu Ziyin: Anyone?

498
01:18:46.470 --> 01:18:48.720
Liu Ziyin: Learn some coefficients of each one.

499
01:18:49.180 --> 01:19:05.210
Liu Ziyin: Yeah, something like that, right? Basically, it's like a mask, right? Right? So basically, you couple each WI to a scalar VI and try it at a finite temperature, okay? That naturally creates two phases where they are either zero or they are non-zero, which is the symmetry broken state.

500
01:19:05.210 --> 01:19:12.609
Liu Ziyin: And finite temperature encourages you to find those solutions that are basically… that uses as few modules as possible.

501
01:19:12.990 --> 01:19:21.179
Liu Ziyin: Okay, and we apply this to a gene selection task, which is one of the old problems in AI for science, and we want to basically,

502
01:19:22.320 --> 01:19:27.500
Liu Ziyin: Find the genes that are really relevant to certain disease, to certain cancer.

503
01:19:27.810 --> 01:19:33.800
Liu Ziyin: And we want to not only learn the mapping, we also want to identify those genes that are,

504
01:19:34.110 --> 01:19:58.669
Liu Ziyin: that are really relevant. So basically, you apply this thing to a… to every encoder of the… of every gene. So we encode every gene with a different encoder, and we apply the additional coupling to every encoder, and the method actually works very well. It actually identifies, let's say, 50-something relevant genes, and working very well, leveraging the deep learning technology. So you can actually do very interesting nonlinear feature selection.

505
01:19:58.720 --> 01:20:09.509
Liu Ziyin: By leveraging these techniques. So all you do is you just insert a little… an extra multiplicative parameter. And you have a finite temperature, which is basically the way it decays.

506
01:20:10.560 --> 01:20:11.430
Liu Ziyin: Yes.

507
01:20:12.500 --> 01:20:31.930
Liu Ziyin: Okay, so wait. Now I have really talked about everything about symmetry and irreversibility, and it really seems like deep learning can be… maybe be a part of physics. Is that really true? Well, let's recall this figure, which I… where I stated at the very beginning, where symmetry and irreversibility are the two organizing principles of

508
01:20:31.940 --> 01:20:37.320
Liu Ziyin: physics, and could there be a place for AI on this map?

509
01:20:37.560 --> 01:20:51.499
Liu Ziyin: Well, I personally believe it's actually right here. It's where you have very high degree of symmetries because of how redundant your models are, and there's a very strong, sense notion of irisibility due to the learning process.

510
01:20:51.910 --> 01:20:57.450
Liu Ziyin: Okay, thank you very much. That's really the end of the talk.

511
01:20:59.290 --> 01:21:08.700
Liu Ziyin: You close to hitting the job market as a pros, or… I, yes, I am on the job market. That's a really good part. Oh, thank you.

512
01:21:09.590 --> 01:21:10.840
Liu Ziyin: Should be shorter.

513
01:21:11.490 --> 01:21:23.940
Liu Ziyin: Or longer. I have a question about the platonic. Oh, yes, I didn't understand exactly what you were talking.

514
01:21:24.930 --> 01:21:26.369
Liu Ziyin: Look in the heat pumps.

515
01:21:28.850 --> 01:21:29.610
Liu Ziyin: Huh.

516
01:21:30.370 --> 01:21:38.820
Liu Ziyin: Yeah, so what do you plot, like, what are you plotting or sampling? Okay, so this is the representation similarity, okay? So, have you seen any of these?

517
01:21:39.020 --> 01:21:51.909
Liu Ziyin: This, this is the, this is the, the center kernel alignment stuff that they use, is that right? Yes, yes, this is the CKIA, right? But, so, are you comparing two networks?

518
01:21:52.130 --> 01:22:04.830
Liu Ziyin: So yeah, so this is the… this is the representation similarity for network A, okay? So this is how network A encodes the distance relationship between two data points, okay? So a…

519
01:22:05.300 --> 01:22:12.380
Liu Ziyin: Yellow here means, okay, means that they are encoded in a very similar way. Why doesn't it be start, like…

520
01:22:12.430 --> 01:22:29.730
Liu Ziyin: already structured at, like, step zero. Maybe it has to do with, the form of the data being very isotropic there. But the main point of this figure is that, like, by the end of their training, these two things look similar.

521
01:22:29.790 --> 01:22:39.440
Liu Ziyin: Yeah, so, okay, maybe that's one decision mistake I made, which is to actually give network B a slightly bigger learning rate, so actually, you have to wait it.

522
01:22:39.440 --> 01:22:57.250
Liu Ziyin: Okay, once it goes back to step zero, you will see that it actually learned very fast, so it's actually very… okay, you see that? It's actually very different at the beginning, but it gets to the… But when you really care about, like, the snap cut at step 100 or something? Yes, the point is that eventually, let's say.

523
01:22:57.250 --> 01:23:03.089
Liu Ziyin: After training, they got two similar reputations. Right. Okay, and then the other thing, so… so…

524
01:23:03.880 --> 01:23:19.090
Liu Ziyin: Are you… would you hypothesize that deep learning just would not work if we… if we could do gradient flow? Like, and that the, like, good stuff of gradient… of, deep learning is a coincidence of, like, the fact that we need to discretize stuff? And that's where I'm going.

525
01:23:19.370 --> 01:23:20.690
Liu Ziyin: Magic happens.

526
01:23:21.000 --> 01:23:37.549
Liu Ziyin: That's fun. That's a good… Because, like, the… out of memory? Out of time. Yeah, that's where the second term… that's where the second term came from. Like, the interesting term was the difference between doing gradient flow and discrete, gradient descent, and you're like.

527
01:23:37.550 --> 01:23:46.870
Liu Ziyin: oh, like, it leads to all this, like, cool, interesting stuff, and maybe it's actually the important part of learning. So would we not get…

528
01:23:47.210 --> 01:23:51.829
Liu Ziyin: Would deep learning not work if we… train things with Bradiant Flow.

529
01:23:53.380 --> 01:24:00.220
Liu Ziyin: Maybe there's a different way of asking the question. So there's something that happens when you train this way. Is it better?

530
01:24:01.310 --> 01:24:08.270
Liu Ziyin: That's a very good question. Hmm… So I…

531
01:24:08.390 --> 01:24:18.570
Liu Ziyin: So what tends to happen when you really train with gradient flow is that it will keep a lot of information of its initialization. So basically, if you train it that way, people usually find it works.

532
01:24:18.570 --> 01:24:30.950
Liu Ziyin: But you also see that they learn things that are not quite not interpretable, quite not comparable to other neural networks. So irreversibility definitely gives you some sense of universality.

533
01:24:31.140 --> 01:24:37.519
Liu Ziyin: So, there's something that's useful about it, but whether you get better performance.

534
01:24:37.650 --> 01:24:50.979
Liu Ziyin: Actually, I don't know. There are evidences, people… some people claim that when you… when you have the second order term, that actually gives you better performance a little bit, because that gives you, let's say, more compact representations,

535
01:24:51.860 --> 01:25:01.250
Liu Ziyin: So, there are people who say that, but it's so far unclear how important that is. Is there a way to study this empirically? Like, can you…

536
01:25:01.490 --> 01:25:10.580
Liu Ziyin: But besides just doing, like, really, really small learning rates, like, is there a way to study gradient flow empirically? Yes, there's a paper actually called,

537
01:25:11.250 --> 01:25:15.290
Liu Ziyin: It's… You just solve for the mean?

538
01:25:15.650 --> 01:25:33.459
Liu Ziyin: Wait, wait, no, no, that's not gonna… wait, yeah, I don't know. There's a paper, I think, from a few years back, it's called Stochasticity is Not Necessary for Generalization. Oh, yeah. So what they did was that they trained with greedy and decent, but they also explicitly add in this term.

539
01:25:33.850 --> 01:25:38.559
Liu Ziyin: too great and decent. Like, offset it. Too… too, sort of…

540
01:25:39.060 --> 01:25:45.190
Liu Ziyin: To have the effect of second-order term, but not really having it as an emergent term.

541
01:25:45.640 --> 01:25:57.910
Liu Ziyin: So they have a… they put this in explicitly into the training. That actually gives you slightly better performance. 5 to… roughly, let's say, 5% improvement on CFR10, something like that. So it's…

542
01:25:58.120 --> 01:26:01.090
Liu Ziyin: Seems to be a little bit helpful.

543
01:26:01.380 --> 01:26:05.999
Liu Ziyin: having this term for some problems, but it's… I think it's unclear yet.

544
01:26:06.810 --> 01:26:20.139
Liu Ziyin: So it is a lot. So it's a huge gap for deep learning, but it's not a huge gap if you are comparing deep learning to, right, to non-deep learning methods, yes.

545
01:26:23.000 --> 01:26:26.459
Liu Ziyin: So, there are evidence where this term actually helps, yeah.

546
01:26:30.980 --> 01:26:32.520
Liu Ziyin: What is this jump from 1?

547
01:26:33.110 --> 01:26:34.109
Liu Ziyin: Say it again?

548
01:26:34.240 --> 01:26:35.930
Liu Ziyin: What is this term for more?

549
01:26:36.270 --> 01:26:38.890
Liu Ziyin: What does it jump from?

550
01:26:39.630 --> 01:26:40.440
Liu Ziyin: Sorry?

551
01:26:40.620 --> 01:26:56.960
Liu Ziyin: Do you know muon? Oh, oh, oh, oh. I thought that was German, German, but I don't see… Muon, do you know muon? Muon, muon. Yeah, physicists would say muon, but… I don't know, I… that's an interesting open problem, yeah.

552
01:26:57.650 --> 01:27:04.099
Liu Ziyin: Another good thing. Another exercise.

553
01:27:04.980 --> 01:27:07.880
Liu Ziyin: So, so,

554
01:27:09.440 --> 01:27:16.879
Liu Ziyin: So do you… so you had this very interesting, thing of, like, oh, you can do these…

555
01:27:17.020 --> 01:27:29.089
Liu Ziyin: simple things in the end, based on the theory. So would you… I mean, so you're… here… here you are, talking to an interpretability lab. Would you advocate

556
01:27:29.220 --> 01:27:32.390
Liu Ziyin: That my students should go and

557
01:27:32.540 --> 01:27:35.840
Liu Ziyin: Try training all sorts of masks.

558
01:27:36.390 --> 01:27:42.210
Liu Ziyin: You know, these extra parameters into their models that… that this… this might lead to…

559
01:27:42.420 --> 01:27:55.679
Liu Ziyin: a clearer view of what's going on inside the models. Okay, I think there are really, sort of, two messages here for the purpose of interpretability, right? The first is the explicit engineering part, right?

560
01:27:55.790 --> 01:28:00.359
Liu Ziyin: By incorporating these strange masks.

561
01:28:00.670 --> 01:28:08.159
Liu Ziyin: You get to incorporate the desired type of sparsity you want, any structure…

562
01:28:08.450 --> 01:28:16.509
Liu Ziyin: sparsity you can achieve with… by building in these strange things. So that's a engineering sort of message.

563
01:28:16.890 --> 01:28:17.740
Liu Ziyin: But…

564
01:28:17.970 --> 01:28:34.239
Liu Ziyin: these strange masks already exist implicitly in… some of them, implicitly in many of the existing neural architectures. So, the other way to use this is to actually analyze the existing ones. For example, in Transformer, you have those

565
01:28:34.240 --> 01:28:44.290
Liu Ziyin: strange symmetries. You can figure out their symmetric states, and you know that you are adaptively preferring to get to those symmetric states.

566
01:28:44.550 --> 01:29:00.589
Liu Ziyin: One example is, for example, the key and query matrices in the self-attention layer. That's a yes, and that's actually a symmetry. It has a lot of Z2 symmetries, and you can identify the symmetric states, which are basically the low-rank states.

567
01:29:00.590 --> 01:29:10.590
Liu Ziyin: for those layers. And the theory directly predicts for you that, for example, if you train with weight decay, those, query key matrices will be low rank.

568
01:29:11.130 --> 01:29:25.889
Liu Ziyin: For example, so there are two ways to use them. Either you analyze existing symmetries or irisibility, actually, they can combine, as I have shown, in existing models, or you can build them… build artificial symmetries into your model.

569
01:29:25.890 --> 01:29:38.569
Liu Ziyin: And, and build more interpretable… I see, so this little rank, which is maybe why, as an engineering matter, we only give a really small number of ranks, you know, to them, because they would…

570
01:29:39.800 --> 01:29:41.289
Liu Ziyin: They would waste them anyway.

571
01:29:41.790 --> 01:29:45.740
Liu Ziyin: Yeah. Right? Right.

572
01:29:45.870 --> 01:29:47.970
Liu Ziyin: yeah.

573
01:29:48.600 --> 01:29:52.819
Liu Ziyin: There's… there's an old… architecture called StyleGAN.

574
01:29:53.270 --> 01:30:02.850
Liu Ziyin: Oh, oh! Which has an interesting multiplication in the middle of the architecture, and it always… Came out…

575
01:30:03.090 --> 01:30:04.369
Liu Ziyin: more interpretable.

576
01:30:04.510 --> 01:30:10.469
Liu Ziyin: Oh! And all the other… networks, and so I, you know, there's this… there's this weird echo.

577
01:30:10.750 --> 01:30:14.040
Liu Ziyin: Of adding this extra multiplication.

578
01:30:14.370 --> 01:30:21.959
Liu Ziyin: In the middle, yeah. I didn't know that. Yeah. But… It could be the case…

579
01:30:22.140 --> 01:30:31.250
Liu Ziyin: that by introducing these redundant degrees of freedom, you actually make your model more interpretable. Actually, yeah, that could be a good thought. Yes.

580
01:30:32.550 --> 01:30:33.460
Liu Ziyin: It's interesting.

581
01:30:33.720 --> 01:30:34.460
Liu Ziyin: Hmm.

582
01:30:41.330 --> 01:30:42.220
Liu Ziyin: So…

583
01:30:42.780 --> 01:30:53.839
Liu Ziyin: Faculty question, which would be… oh, this is… this is really, really great work for your first trick as a… as a, as a student postdoc, so…

584
01:30:54.220 --> 01:30:58.200
Liu Ziyin: What's next? So what… if you were to set up a lab, what are you going to do next?

585
01:30:59.360 --> 01:31:09.499
Liu Ziyin: Okay, I think, personally, I think there are a lot of missing steps, even to the SIF, to this framework.

586
01:31:09.830 --> 01:31:20.829
Liu Ziyin: One thing is basically, I think, what you have asked about, how does the… how does it really relate to generalization? So I think having a first principled

587
01:31:21.090 --> 01:31:29.909
Liu Ziyin: connection between the SIF and the statistical learning theory will help us understand a lot more about,

588
01:31:30.100 --> 01:31:38.200
Liu Ziyin: Both about deep learning and also about, let's say, symmetry and irisibility, so that also, in principle, could contribute back to science.

589
01:31:38.660 --> 01:31:42.970
Liu Ziyin: And SRF, I think, is a really great framework for understanding phenomena.

590
01:31:43.240 --> 01:31:45.860
Liu Ziyin: And so that also gives, for example, the…

591
01:31:45.960 --> 01:31:58.960
Liu Ziyin: conventional statistical learning theory, sort of the lens of phenomenology, if we really build that link. So that's definitely one thing. The next thing, for example, I'm also interested in applications,

592
01:31:59.320 --> 01:32:15.050
Liu Ziyin: for example, I talked about how you can build in artificial symmetries to make your model more efficient, more sparse, more interpretable, so I think that route, we can go very far, and there are a lot more to be done there. It's actually an old idea in physics, you…

593
01:32:15.260 --> 01:32:28.830
Liu Ziyin: it's called… sort of called symmetry engineering, so we want a material that has, for example, high conductance. And the way you really solve it is not to look for specific materials. You look for symmetries, because we know we can build theories

594
01:32:28.970 --> 01:32:43.949
Liu Ziyin: Based on symmetries alone. And we know that, for example, this symmetry class will correspond to better resistance, better conductance, so if we have a theory for symmetry, we can only look for those symmetries, and then we identify materials that have these symmetries.

595
01:32:44.360 --> 01:32:52.130
Liu Ziyin: So that's also a very interesting new design principle kind of thinking that I think people in machine learning could also have.

596
01:32:52.440 --> 01:32:53.910
Liu Ziyin: What's the third?

597
01:32:55.010 --> 01:33:08.050
Liu Ziyin: I think it's… I personally believe it's sort of a new physics, in a way. If we really believe in symmetry and irisibility, then I think it's very physical.

598
01:33:08.170 --> 01:33:11.100
Liu Ziyin: I think that's the short-term plan I have.

599
01:33:11.330 --> 01:33:25.450
Liu Ziyin: Longer plan, I really… I'm really sort of interested in neuroscience as well, and I also sort of believe that asymmetry visibility will also be two fundamental principles for biological learning, right? Because symmetry is really sort of,

600
01:33:25.550 --> 01:33:40.729
Liu Ziyin: you can… in the most general form, it's sort of some sort of redundancy, right? And we do know that human brains are highly redundant. When you are born, you have much more synapses, much more connections in your brain, and we actually prune them

601
01:33:40.730 --> 01:33:48.709
Liu Ziyin: very quickly at the first stage of development. So we are born with a high degree of redundancy, and that's really sort of what symmetry really means.

602
01:33:48.770 --> 01:33:58.860
Liu Ziyin: And we also know that, for example, our brain is really irreversible, right? There are a lot of critical periods in our brain, and I personally think SF could also eventually be a part of neuroscience.

603
01:34:03.430 --> 01:34:07.400
Liu Ziyin: Will you, I have one question for that,

604
01:34:08.030 --> 01:34:24.710
Liu Ziyin: I was, I used to be very motivated about the theory of deep learning until I kind of felt like I cannot contribute to this. And also, then I started my PhD, and I felt, like, really disappointed by the theory of deep learning, because none of the results make predictions, and then I was like, oh, okay, I'm…

605
01:34:25.020 --> 01:34:26.210
Liu Ziyin: You know, I'm gonna wait.

606
01:34:26.410 --> 01:34:38.399
Liu Ziyin: And now it feels like you guys are suddenly doing stuff, but, like, why I… and why I got attracted to Macint was, like, I was, like, hoping that, like, us

607
01:34:38.840 --> 01:34:44.850
Liu Ziyin: Poking around in the actual models will deliver, like, some stuff.

608
01:34:45.490 --> 01:35:04.899
Liu Ziyin: the theory people could get inspired by. Were you inspired by any mechanism? I certainly am. I mean, now we have a large collection of, like, methods that we find in the models that are trained and that are sitting around.

609
01:35:05.230 --> 01:35:17.059
Liu Ziyin: Yeah. Yes, I really… so first of all, I believe deep learning and AI is becoming an empirical field, and sort of should be an empirical field, and by that, I really mean that it's actually

610
01:35:17.060 --> 01:35:26.629
Liu Ziyin: experiment-driven, it's phenomenology-driven, and I do think the real foundation there is experiments, is the phenomenology. And then you start to build theories based on it, right?

611
01:35:26.680 --> 01:35:36.370
Liu Ziyin: So… for example, part of my theories are driven by these phenomena, edge of stability, platonic representation hypothesis.

612
01:35:36.610 --> 01:35:41.600
Liu Ziyin: So, yeah, they really formed the foundation of my, sort of, of my work, yes.

613
01:35:42.590 --> 01:35:44.200
Liu Ziyin: Thank you. Thank you.

614
01:35:47.990 --> 01:35:50.400
Liu Ziyin: Thank you very much. Thank you very much.

