WEBVTT

1
00:00:04.580 --> 00:00:10.700
Rohit Gandikota: So you'd have to turn your volume up so that if… if people… you're gonna… we're gonna use your audio. Yes. And

2
00:00:11.660 --> 00:00:12.480
Rohit Gandikota: Great.

3
00:00:12.720 --> 00:00:18.769
Rohit Gandikota: And let me turn this off here. You can see my screen here. I could use this to get the audience, yeah, the rest of the…

4
00:00:19.040 --> 00:00:28.590
Rohit Gandikota: people here. How do you… how does your camera do that? Where it's, like, zooming into people? I don't know. Zoom added some features, some AI thing. Oh, okay. Right.

5
00:00:29.010 --> 00:00:31.690
Rohit Gandikota: It's terrible, right? It's always, like, looking over my shoulder.

6
00:00:32.040 --> 00:00:36.439
Rohit Gandikota: I'm, like, trying to talk to people over Zoom, and he's like, what's going on right now?

7
00:00:36.660 --> 00:00:40.660
Rohit Gandikota: And so… But, okay.

8
00:00:40.980 --> 00:00:44.290
Rohit Gandikota: So today… so… Rohit.

9
00:00:44.420 --> 00:00:46.520
Rohit Gandikota: Has been working on a bunch of stuff.

10
00:00:46.900 --> 00:00:57.970
Rohit Gandikota: And, as you know, over the years, that's the boundary between you know, vision…

11
00:00:58.120 --> 00:01:03.239
Rohit Gandikota: And generative modeling… And human understandability and control.

12
00:01:04.019 --> 00:01:05.420
Rohit Gandikota: And,

13
00:01:06.780 --> 00:01:14.380
Rohit Gandikota: And… You've not seen this talk. And, well, I don't know what this talk is going to be about, right? But, so, but he's an interesting juncture, I think.

14
00:01:14.550 --> 00:01:20.819
Rohit Gandikota: As we all are in the field, right? And it's been an interesting experience, advising him.

15
00:01:21.170 --> 00:01:27.100
Rohit Gandikota: Because I think that the field has advanced faster than we anticipated.

16
00:01:27.230 --> 00:01:29.069
Rohit Gandikota: when he started his PhD.

17
00:01:29.230 --> 00:01:34.149
Rohit Gandikota: I think Ro has been very good at being at the forefront.

18
00:01:34.260 --> 00:01:39.800
Rohit Gandikota: Of a number of interesting things, but despite… this velocity?

19
00:01:39.970 --> 00:01:43.699
Rohit Gandikota: The field keeps on… Moving in the forefront?

20
00:01:44.160 --> 00:01:45.080
Rohit Gandikota: Ahead?

21
00:01:45.250 --> 00:01:49.810
Rohit Gandikota: Right? And so, you know, I've been having this conversation with Rohed over

22
00:01:49.990 --> 00:02:06.679
Rohit Gandikota: the last few months of, like, alright, he still has some time left in the PhD, like, what's… in light of this, what's the strategy? What's… what's to do? And so I don't know if this talk is gonna touch on any of this at all, but I think it's… I think it's a very interesting case. Most of you don't…

23
00:02:07.010 --> 00:02:12.649
Rohit Gandikota: most of you are in the language space, and not doing computer vision, but I think that computer vision

24
00:02:12.820 --> 00:02:14.940
Rohit Gandikota: you know.

25
00:02:17.130 --> 00:02:27.979
Rohit Gandikota: I have to be careful, because I'm being recorded, right? But, like, it sometimes gives the feeling of being a field that's getting very close to being solved.

26
00:02:28.230 --> 00:02:35.720
Rohit Gandikota: Right? There's a lot of, like, really hard problems that have been standing out there for many years, and… and…

27
00:02:36.480 --> 00:02:37.820
Rohit Gandikota: And,

28
00:02:38.170 --> 00:02:45.559
Rohit Gandikota: And they're, like, you know, they're just so soft now. They're, like, they're just products you can just go and buy for $20 a month.

29
00:02:45.780 --> 00:02:48.309
Rohit Gandikota: And so, the question is, like, what's… what's…

30
00:02:48.470 --> 00:02:50.850
Rohit Gandikota: What's… what's vision? What's the vision?

31
00:02:51.020 --> 00:02:52.690
Rohit Gandikota: down, right?

32
00:02:52.840 --> 00:02:59.149
Rohit Gandikota: Okay, so what is it, Rohit? So, so welcome, Rohit. Thank you very much. Okay.

33
00:03:02.420 --> 00:03:13.570
Rohit Gandikota: Welcome, everyone. Thank you for being here. I know a lot of you traveled all the way across from the river to be here. So, it's… it's a very good introduction, and

34
00:03:13.810 --> 00:03:17.530
Rohit Gandikota: I originally had a totally different set of 48 slides.

35
00:03:17.910 --> 00:03:20.139
Rohit Gandikota: Last week, when I showed the talk to David.

36
00:03:20.490 --> 00:03:27.579
Rohit Gandikota: And then, the pearls of wisdom I got from him, get this, he says, I see a very good value in all your projects.

37
00:03:28.060 --> 00:03:31.830
Rohit Gandikota: But I don't see the purpose being reflected in your talk.

38
00:03:32.160 --> 00:03:47.520
Rohit Gandikota: And then he… he also sort of nudged me to think about… and we've talked about this a lot of times, like, I think I spoke to Tamar, and a lot of you here about the same thing, which his vision seemed to have been solved, right? There are a lot of new models that come… came out.

39
00:03:47.710 --> 00:03:48.660
Rohit Gandikota: Which…

40
00:03:49.030 --> 00:03:57.310
Rohit Gandikota: I think tackle most of it. I think maybe we should wrap it up and, like, go home and start thinking about language models, or maybe world models, right?

41
00:03:57.700 --> 00:04:03.969
Rohit Gandikota: But the more I thought about this, the more I thought about the advice David gave me, it… I came to this conclusion.

42
00:04:04.560 --> 00:04:09.549
Rohit Gandikota: that… I don't think vision is solved at all. We're not even close.

43
00:04:09.670 --> 00:04:13.650
Rohit Gandikota: And I'm going to, like, take you through 35 minutes or 40, hopefully.

44
00:04:13.790 --> 00:04:17.649
Rohit Gandikota: minutes of my talk, trying to convince you that vision just started.

45
00:04:18.100 --> 00:04:27.359
Rohit Gandikota: It just started as to, like, how we can change the way vision works, and how if we don't do it, it would come to a dead end, thinking that it is solved.

46
00:04:29.750 --> 00:04:32.229
Rohit Gandikota: Can I add something to it? Yes.

47
00:04:32.450 --> 00:04:36.349
Rohit Gandikota: I gave a talk a few months ago at a place I won't mention.

48
00:04:36.630 --> 00:04:43.169
Rohit Gandikota: But I was making similar statements over lunch of, well, vision problems are solved.

49
00:04:43.350 --> 00:04:51.610
Rohit Gandikota: And some of the vision people got very… I think as an NLP person, you cannot say that. No, exactly, so, so…

50
00:04:51.940 --> 00:04:56.319
Rohit Gandikota: I was thinking, well, it used to be that at NeurIPS, everyone worked on vision.

51
00:04:56.830 --> 00:05:06.440
Rohit Gandikota: And then now everyone works on an LP and on all language models, and you can think, well, maybe that's because vision is solved, and that was my naive perception.

52
00:05:06.480 --> 00:05:22.369
Rohit Gandikota: But what they told me is that the hard problems are completely not solved, and there are, like, old, very old benchmarks where models are very far from humans, even in things like object localization and detection, like, the basic stuff.

53
00:05:22.650 --> 00:05:29.340
Rohit Gandikota: And that the reason… that's the most interesting thing. The reason that everyone is working on LLMs is because it's easier.

54
00:05:31.370 --> 00:05:32.949
Rohit Gandikota: That's just an tick.

55
00:05:33.520 --> 00:05:35.750
Rohit Gandikota: I don't know if you feel this way.

56
00:05:35.950 --> 00:05:46.669
Rohit Gandikota: No comments. But I do have a lot of similar thoughts, but, yes, I'll talk more about them. And I think I'll draw some parallels to text, too.

57
00:05:47.500 --> 00:05:51.110
Rohit Gandikota: So, okay, this talk is,

58
00:05:51.440 --> 00:06:01.049
Rohit Gandikota: I know I usually like to do a lot of interactive talk, where I let you stop me in between, but for the first time, with my heart, I'm going to tell you, I want to practice like a

59
00:06:01.180 --> 00:06:02.830
Rohit Gandikota: Like, a one-shot talk.

60
00:06:03.000 --> 00:06:07.239
Rohit Gandikota: So please, if you have any, like, if there are urgent questions which

61
00:06:07.380 --> 00:06:13.099
Rohit Gandikota: like, went right over your head, please stop me, and say, hey, this is not clear, so that I can be more clear about it.

62
00:06:13.310 --> 00:06:16.950
Rohit Gandikota: If not, we can do, like, a Q&A in the end.

63
00:06:17.850 --> 00:06:21.440
Rohit Gandikota: So, also, I'd like to keep it slightly…

64
00:06:21.900 --> 00:06:28.930
Rohit Gandikota: formal, while also being informal and, like, trying to make you go through this journey together, right? So…

65
00:06:30.150 --> 00:06:40.130
Rohit Gandikota: So yeah, I start by telling you that I had totally different set of slides, and to get to this purpose slide, I had to sit 3 days on an empty, blank.

66
00:06:40.330 --> 00:06:44.290
Rohit Gandikota: slide, and I could not get to anything. And I thought, is David tricking me?

67
00:06:44.570 --> 00:06:51.650
Rohit Gandikota: You know, what does it mean to have a purpose? Like, is it altruism? What is it? Like, So…

68
00:06:51.780 --> 00:06:52.600
Rohit Gandikota: And…

69
00:06:53.460 --> 00:07:04.909
Rohit Gandikota: At the end of third day, I realized that I usually like to give my talks with a big load of motivation for my audience in the beginning, to talk about why it is I care about the project.

70
00:07:05.140 --> 00:07:07.079
Rohit Gandikota: And the problem in the first place?

71
00:07:07.320 --> 00:07:11.170
Rohit Gandikota: And then I go into, like, 2-3 slides of my results, and then I end the talk.

72
00:07:11.420 --> 00:07:16.610
Rohit Gandikota: So… Turns out, all it takes is for me to bring all of them together.

73
00:07:16.810 --> 00:07:23.500
Rohit Gandikota: And then bring some of the heated conversations I had with most of you here about AI, about why I'm doing my research.

74
00:07:23.650 --> 00:07:28.000
Rohit Gandikota: And sort of sit on them for a good 8 hours, and then it comes out.

75
00:07:28.600 --> 00:07:32.900
Rohit Gandikota: So… I started by asking this question.

76
00:07:33.740 --> 00:07:35.149
Rohit Gandikota: Wait, why is it not?

77
00:07:36.140 --> 00:07:42.879
Rohit Gandikota: Yeah, so… Why now? Now especially, that vision models need to be more readable.

78
00:07:43.190 --> 00:07:48.549
Rohit Gandikota: And I'm going to go into more depth about what Readable is, why today.

79
00:07:48.800 --> 00:07:51.920
Rohit Gandikota: And why am I even caring about this problem in the first place?

80
00:07:52.280 --> 00:07:55.019
Rohit Gandikota: But, I think for this audience.

81
00:07:55.230 --> 00:07:57.860
Rohit Gandikota: And I'm going to give a really controversial title.

82
00:07:58.260 --> 00:08:01.930
Rohit Gandikota: Which is why interpretability is far more important in vision.

83
00:08:02.310 --> 00:08:03.260
Rohit Gandikota: than text.

84
00:08:06.020 --> 00:08:13.910
Rohit Gandikota: And, again, today. Today is the word which is very important. Why it's today that it's more important than yesterday? Yesterday, I mean, like, a few years ago.

85
00:08:14.720 --> 00:08:15.590
Rohit Gandikota: So…

86
00:08:16.790 --> 00:08:24.470
Rohit Gandikota: I keep coming back to this question whenever I talk about AI with any of you here, is… it's a tool, right? Like, built by humans, for humans.

87
00:08:24.700 --> 00:08:27.059
Rohit Gandikota: To sort of do something for us.

88
00:08:27.450 --> 00:08:28.310
Rohit Gandikota: And…

89
00:08:28.720 --> 00:08:38.270
Rohit Gandikota: Over the millions of years in the history of humans, we have built so many widely used tools, right? Like, starting from a million years ago, we had stone access.

90
00:08:38.460 --> 00:08:41.990
Rohit Gandikota: Which really, I think, is the reason why we survived as human race.

91
00:08:42.409 --> 00:08:46.919
Rohit Gandikota: And then we had printing presses that helped us communicate en masse.

92
00:08:47.140 --> 00:08:55.290
Rohit Gandikota: amounts, I think, also a good milestone for us as humanity to improve our intelligence. Then came calculators.

93
00:08:55.490 --> 00:09:00.199
Rohit Gandikota: If you recognize that, I don't think you would, that's a mechanical calculator. Even I didn't.

94
00:09:00.740 --> 00:09:06.860
Rohit Gandikota: And then, of course, smartphones, right? All of them have, in some way, been

95
00:09:07.060 --> 00:09:11.739
Rohit Gandikota: Mostly wide-used tools, like, sorry, widely used tools in our human history.

96
00:09:13.460 --> 00:09:15.949
Rohit Gandikota: I… should I keep it there?

97
00:09:16.760 --> 00:09:18.149
Rohit Gandikota: I don't know, maybe?

98
00:09:18.290 --> 00:09:20.899
Rohit Gandikota: I think it deserves a place, at least.

99
00:09:21.470 --> 00:09:32.269
Rohit Gandikota: But funnily, these models, AI models, have been there for decades now, right? Like, if you know that, it's called ELISA. It's the first ChatGPT to ever be

100
00:09:32.630 --> 00:09:38.849
Rohit Gandikota: on Earth, it's from 1967, and you can talk to your computer, it talks back.

101
00:09:39.020 --> 00:09:42.920
Rohit Gandikota: If it's good or not, I can't vouch for it, but yeah, it used to talk back.

102
00:09:43.550 --> 00:09:50.420
Rohit Gandikota: And then, sure, over the years, we have built so many tools that sort of increase this…

103
00:09:50.960 --> 00:09:54.480
Rohit Gandikota: Our ability to talk to, machines, right?

104
00:09:55.040 --> 00:10:00.380
Rohit Gandikota: And up until 2019, we had GPT-2, which I think was a part of discussion here.

105
00:10:00.720 --> 00:10:04.420
Rohit Gandikota: That did the same thing that we are doing today, which it can talk to you.

106
00:10:04.640 --> 00:10:07.009
Rohit Gandikota: Like, but why is it that

107
00:10:07.470 --> 00:10:11.860
Rohit Gandikota: Over the 20 or 40 years in the history that this…

108
00:10:11.960 --> 00:10:14.840
Rohit Gandikota: Models did not deserve a place there, and why today?

109
00:10:15.410 --> 00:10:28.170
Rohit Gandikota: Right? And I thought, well, because tools are supposed to be to automate our human labor, make us more efficient, make our life easier, and that's the reason, right, that tools have become so important in our lives.

110
00:10:29.070 --> 00:10:33.130
Rohit Gandikota: Sure, I think that could be a good definition of tool in a Wikipedia page.

111
00:10:33.260 --> 00:10:37.989
Rohit Gandikota: But the reason that something becomes really widely used is because

112
00:10:38.220 --> 00:10:41.680
Rohit Gandikota: They are the tools that help us extend human intention.

113
00:10:42.120 --> 00:10:44.330
Rohit Gandikota: Humans, in general, have a need.

114
00:10:44.520 --> 00:10:50.589
Rohit Gandikota: of showing their intention, or, like, to make something happen in reality. And the tools that really make that

115
00:10:50.900 --> 00:10:54.769
Rohit Gandikota: Happen are the ones that happen to be used widely in the history.

116
00:10:56.810 --> 00:11:01.769
Rohit Gandikota: So, with language models, All this intention came at free cost.

117
00:11:02.510 --> 00:11:07.889
Rohit Gandikota: like, free cost, right? What that means is when you ask a model to do something for you.

118
00:11:08.340 --> 00:11:25.220
Rohit Gandikota: It's going to give a very nice description of, oh, sure, I'm gonna, like, write a warm email, which is super funny, and you can see its intention right in the live, that it's trying to… what it's trying to write, and you can see its output and sort of understand every word that it wrote.

119
00:11:25.720 --> 00:11:26.540
Rohit Gandikota: Right?

120
00:11:26.780 --> 00:11:37.380
Rohit Gandikota: And this intention of the model can be easily manipulated, either by token forcing, which a lot of people here know about already. You can go and change its reasoning.

121
00:11:37.600 --> 00:11:44.820
Rohit Gandikota: Or you can literally ask it to say, no, no, no, I want to, like, do something else, make… make… make it more funny.

122
00:11:46.470 --> 00:12:03.099
Rohit Gandikota: And another thing is you can literally go copy, paste it, and change every word there is, and still be confident that it's still text, eligible text. All of this comes at free cost, because text in itself, as a modality, is very easy to understand and reason…

123
00:12:04.510 --> 00:12:07.879
Rohit Gandikota: And also easy to edit the intentions behind text.

124
00:12:09.050 --> 00:12:12.520
Rohit Gandikota: So, yeah, you can do all of this, and then the model can do the job for you.

125
00:12:13.300 --> 00:12:16.669
Rohit Gandikota: Vision, on the other hand, Does not have this.

126
00:12:16.860 --> 00:12:24.629
Rohit Gandikota: So what I mean by that, in the recent history of vision, the problems that we all mostly cared about had a very

127
00:12:25.090 --> 00:12:27.619
Rohit Gandikota: Universal ground truth, right?

128
00:12:28.290 --> 00:12:41.460
Rohit Gandikota: So they all have these accuracy metrics that, if the model can get it right, then you don't really worry about what the intention of you as a user is. As long as its accuracy is 100% or 99, you're good.

129
00:12:41.980 --> 00:12:49.809
Rohit Gandikota: And what does it even mean to have an intention and accuracy model, right? As a user, you don't have a need to show intention. It's a universal truth.

130
00:12:51.740 --> 00:13:02.040
Rohit Gandikota: But for the first time, again, in a loose way, I'm talking about this. For the first time, we have a problem in vision which doesn't have a ground truth.

131
00:13:02.790 --> 00:13:11.900
Rohit Gandikota: Right? We don't know what the true image should look like. It can be anything, and this is the first time we had a bandwidth as a user to show our intention.

132
00:13:13.500 --> 00:13:15.409
Rohit Gandikota: So, what's the problem now?

133
00:13:15.580 --> 00:13:18.289
Rohit Gandikota: What is it that we are trying to optimize here?

134
00:13:18.680 --> 00:13:27.039
Rohit Gandikota: And for a long time, it's been about, let's try to improve the reality or realism of these models. And sure, that's been the problem.

135
00:13:27.230 --> 00:13:30.490
Rohit Gandikota: So, over the years, we have definitely improved.

136
00:13:30.920 --> 00:13:36.550
Rohit Gandikota: the way the realism works, and today, I don't know if… can you tell if that's a generated image? I couldn't.

137
00:13:36.730 --> 00:13:40.390
Rohit Gandikota: And that's… I just generated it last night from Imagine 4.

138
00:13:41.120 --> 00:13:44.760
Rohit Gandikota: So, I think we did. We did solve the realism problem.

139
00:13:45.280 --> 00:13:53.210
Rohit Gandikota: And… And I think that's where most of the conversations end, in terms of Yeah, we've solved it.

140
00:13:53.690 --> 00:13:56.169
Rohit Gandikota: We've trained models that can sort of

141
00:13:56.320 --> 00:14:03.410
Rohit Gandikota: take an image, give any sort of output. Very realistically, it can take any text, generate any text, realistically.

142
00:14:03.770 --> 00:14:06.599
Rohit Gandikota: And… Most of the problems are solved.

143
00:14:06.900 --> 00:14:07.610
Rohit Gandikota: Type?

144
00:14:08.120 --> 00:14:13.570
Rohit Gandikota: So, if we come back to the actual usability of it, as a user.

145
00:14:14.010 --> 00:14:18.890
Rohit Gandikota: When you type in a prompt, there's a bunch of neural activations happening, and you generate this image.

146
00:14:19.350 --> 00:14:22.279
Rohit Gandikota: What if you want to change one small thing?

147
00:14:22.490 --> 00:14:24.859
Rohit Gandikota: Like in your text. Change a word.

148
00:14:25.630 --> 00:14:30.400
Rohit Gandikota: Right? I just… I don't like bass, I want a guitar or an electric guitar there.

149
00:14:30.530 --> 00:14:31.630
Rohit Gandikota: How do I do it?

150
00:14:33.040 --> 00:14:40.900
Rohit Gandikota: I have no idea. It's… nothing is in a legible text. I cannot read whatever is happening inside the model.

151
00:14:41.050 --> 00:14:50.410
Rohit Gandikota: The output is a continuous space. You can't go in and change a pixel and let it… it's not an image anymore. So, it's almost as opaque.

152
00:14:50.650 --> 00:14:52.320
Rohit Gandikota: Starting from scratch.

153
00:14:53.800 --> 00:14:59.190
Rohit Gandikota: So… I think now the real problem becomes that there are

154
00:14:59.340 --> 00:15:05.120
Rohit Gandikota: Millions, or at least billions of invisible decisions that the model takes internally.

155
00:15:05.220 --> 00:15:09.040
Rohit Gandikota: For the one short prompt that you gave, And it outputs

156
00:15:09.150 --> 00:15:16.260
Rohit Gandikota: an image that's literally uneditable by human hand. Like, not just… you can't just go and type a pixel number in there.

157
00:15:18.180 --> 00:15:28.150
Rohit Gandikota: So, what I really hope is happening inside is, given a particular prompt, the model is taking multiple plannings and decisions internally.

158
00:15:28.270 --> 00:15:42.600
Rohit Gandikota: Right? First, it plans the layout as to what it should draw where, and then for each of the small concepts it is decided to draw, it has to expand how that particular concept should look like, and then eventually generates the image.

159
00:15:42.730 --> 00:15:49.580
Rohit Gandikota: I really hope this is what it does, right? If it's doing this, then your… the question of intention becomes such an easier problem.

160
00:15:50.000 --> 00:16:01.849
Rohit Gandikota: Right? You want to replace a bass guitar, you know exactly where the model took a decision, you change that guitar, right? If you want to move the moon to the right, you know where it is that it took the decision, just go change that.

161
00:16:03.220 --> 00:16:14.209
Rohit Gandikota: It would have been nice if this is possible, but it's not, because I don't know if that's what the model is doing. It's supposed to be doing all of this internally, maybe it is, maybe it's not.

162
00:16:15.070 --> 00:16:22.920
Rohit Gandikota: Right? So… I think… That's the current problem, which sort of opens up vision now.

163
00:16:23.250 --> 00:16:26.169
Rohit Gandikota: Back to being at square… square one.

164
00:16:26.270 --> 00:16:32.069
Rohit Gandikota: We don't know what the model is doing inside. There's no way for you to touch the model and edit it.

165
00:16:32.280 --> 00:16:44.800
Rohit Gandikota: Especially in vision, because language gives you an illusion of knowing what the intention is, and if my mother is using it and I'm using it at the same time, then I know for sure that it's a widely used tool, right?

166
00:16:44.900 --> 00:16:55.289
Rohit Gandikota: And I don't think none of you in this room have ever touched how, like, Luma AI works, or how Nano Banana can be edited using different text prompts.

167
00:16:55.590 --> 00:16:56.400
Rohit Gandikota: And…

168
00:16:56.740 --> 00:17:02.270
Rohit Gandikota: For that to happen in the first place, we need to open up the model and try to understand how it works in the first place.

169
00:17:04.359 --> 00:17:08.160
Rohit Gandikota: So… So what does it take for us to…

170
00:17:08.430 --> 00:17:12.200
Rohit Gandikota: make this possible is first, I think, foremost.

171
00:17:12.319 --> 00:17:17.019
Rohit Gandikota: Does it even have a structure? Does it have these decisions that the model is taking internally?

172
00:17:17.260 --> 00:17:18.999
Rohit Gandikota: First, we should expose this.

173
00:17:19.460 --> 00:17:20.480
Rohit Gandikota: And then…

174
00:17:20.710 --> 00:17:27.089
Rohit Gandikota: Hopefully, that can be controllable, so that users can control the structure and decisions that the model is taking.

175
00:17:27.450 --> 00:17:36.640
Rohit Gandikota: And then, unlike language, where you can just say, what else can I learn about this? You can't ask a visual model, because it's very hard to

176
00:17:36.810 --> 00:17:44.589
Rohit Gandikota: talk to a vision model in that way. So, hopefully, discover the concepts that the model has learned and the underlying structure underneath.

177
00:17:44.860 --> 00:17:52.799
Rohit Gandikota: And finally, after you get all of this together, find a means of communication between the model and you as a user, so that you make this happen.

178
00:17:53.090 --> 00:17:54.239
Rohit Gandikota: In the first place.

179
00:17:55.240 --> 00:18:00.999
Rohit Gandikota: So… Let's talk about exposing the structure first.

180
00:18:01.320 --> 00:18:05.909
Rohit Gandikota: Earlier, in the very beginning days of diffusion models.

181
00:18:06.290 --> 00:18:13.719
Rohit Gandikota: When realism started to become a reality, we're like, oh yeah, we can, like, over a few years of research, we know we'll get to a reality point.

182
00:18:14.490 --> 00:18:25.499
Rohit Gandikota: People were really concerned about some of the concepts that the model are generating, which is not ideal for anyone, not the lawmakers, nor the people who are hosting the models, nor the users who are using it.

183
00:18:25.740 --> 00:18:28.040
Rohit Gandikota: So we needed a way to sort of fix this.

184
00:18:28.400 --> 00:18:35.080
Rohit Gandikota: And the best available solutions we had back in the day were, let's just sensor the outputs that we don't like.

185
00:18:35.790 --> 00:18:36.570
Rohit Gandikota: Right?

186
00:18:36.720 --> 00:18:42.389
Rohit Gandikota: the way that this happens, I think even today, most of the models do it, which is you generate an image.

187
00:18:42.790 --> 00:18:49.119
Rohit Gandikota: Post-generation, there is a checker that checks, oh, does it have nudity? If yes, I'll not show it to the user.

188
00:18:49.340 --> 00:18:59.489
Rohit Gandikota: Right? And there's a lot of research, including, like, white hat… white hat researchers, who try to overcome the census,

189
00:18:59.640 --> 00:19:02.600
Rohit Gandikota: In the, in the, in the… in the backend pipeline.

190
00:19:03.050 --> 00:19:20.199
Rohit Gandikota: And another solution is to completely retrain from scratch, remove all the training data that you have, which you don't want in the first place, but it's super costly, right? Like, if you want to train a new model, it'll take millions of dollars and lots of GPUs to train this.

191
00:19:21.300 --> 00:19:27.689
Rohit Gandikota: So, a neat way to put this is, imagine that you are in a beach, and you sort… you hide a gold coin somewhere.

192
00:19:27.960 --> 00:19:33.109
Rohit Gandikota: And now that becomes a whole beach where there's a possibility that a coin exists.

193
00:19:33.380 --> 00:19:39.179
Rohit Gandikota: And what we are trying to do is let's rebuild the entire beach from scratch where we don't have a coin. I think that's…

194
00:19:39.330 --> 00:19:50.609
Rohit Gandikota: almost silly to sort of think about it in the first place. But another thing you can do is just take a coin detector or a metal detector, check around, and keep digging till that detector stops beeping.

195
00:19:50.760 --> 00:19:54.140
Rohit Gandikota: Right? I think this detector is what is called

196
00:19:54.490 --> 00:20:00.720
Rohit Gandikota: the structural… the structure finding method, right? If there is a structure like that underneath.

197
00:20:00.840 --> 00:20:05.060
Rohit Gandikota: We should be able to detect it and remove it precisely from the model's knowledge.

198
00:20:05.900 --> 00:20:11.360
Rohit Gandikota: So that's exactly what we tried to propose, which is, if that was a training data and pre-training model.

199
00:20:11.560 --> 00:20:16.650
Rohit Gandikota: We want a model that unlearns this knowledge without having to retrain.

200
00:20:16.950 --> 00:20:24.450
Rohit Gandikota: So the better solution is first expose that there is an underlying knowledge, and then we remove it precisely from the model's internals.

201
00:20:26.520 --> 00:20:30.379
Rohit Gandikota: So, coming back to our original prop, like, original,

202
00:20:31.380 --> 00:20:37.229
Rohit Gandikota: generation here, what's happening is this model is taking a text prompt C as an input.

203
00:20:37.370 --> 00:20:41.849
Rohit Gandikota: And it's giving an image X as an output. This is what a generator looks like.

204
00:20:42.230 --> 00:20:49.569
Rohit Gandikota: And the classifier which we are really looking for, the structure-finding tool, is actually…

205
00:20:49.960 --> 00:20:54.940
Rohit Gandikota: Something that should take an image as an import and predict if it has the class in it or not.

206
00:20:55.150 --> 00:21:03.669
Rohit Gandikota: Right? And specifically, it needs to detect the underlying data structure, which you don't know, so you don't have this classifier in the first place.

207
00:21:04.140 --> 00:21:10.379
Rohit Gandikota: Right? You're trying to find a classification model from a model you never understand in the first place.

208
00:21:11.510 --> 00:21:18.120
Rohit Gandikota: Turns out, a lot of years ago, there's someone called Bass, I think, who came up with this nice equation.

209
00:21:18.430 --> 00:21:34.309
Rohit Gandikota: Which essentially turns this classification problem back to your generation paradigm, where you can just, condition your model with the prompt that you don't want, and just an unconditional empty prompt, and you can bring the classifier proxy out of your own network.

210
00:21:35.270 --> 00:21:40.299
Rohit Gandikota: And to really get that unlearned model, all you need to do is to keep your original

211
00:21:40.660 --> 00:21:46.419
Rohit Gandikota: Original structure intact, except just ablate or remove the probability.

212
00:21:47.250 --> 00:21:55.259
Rohit Gandikota: that your model generates the image that belongs to the class you don't want in the first place. So, let's say you want to remove the concept of moon.

213
00:21:55.820 --> 00:21:59.639
Rohit Gandikota: So, all you have to do… oh yeah, you can replace the classifier.

214
00:21:59.860 --> 00:22:06.979
Rohit Gandikota: And then all you have to do is run two prompts through your model. One is a prompt that you want to be removed, the concept, which is moon.

215
00:22:07.430 --> 00:22:15.350
Rohit Gandikota: And 2 is just an empty prompt. And you can… you can do this neat equation here, which we can go into the detail later.

216
00:22:16.450 --> 00:22:24.710
Rohit Gandikota: And then, if your original model was generating Moon, even though it doesn't have it in the first place, you can train your network using this equation.

217
00:22:25.020 --> 00:22:28.989
Rohit Gandikota: And then… Yeah, it just stops generating more.

218
00:22:29.420 --> 00:22:33.860
Rohit Gandikota: So, what this is, is that There is an underlying structure.

219
00:22:34.300 --> 00:22:43.070
Rohit Gandikota: Which you can easily detect using the model itself as a beacon, look for where the knowledge is, and precisely edit it out.

220
00:22:43.830 --> 00:22:49.979
Rohit Gandikota: So, there were… This… this is, like, a rough diagram taken from David's slide.

221
00:22:50.440 --> 00:22:52.460
Rohit Gandikota: Where, the model

222
00:22:53.170 --> 00:23:00.780
Rohit Gandikota: has multiple different, modalities that it takes as an input. First, you have a text input through which you are prompting your model.

223
00:23:00.930 --> 00:23:03.739
Rohit Gandikota: So it goes through something called a cross-attention.

224
00:23:04.050 --> 00:23:09.559
Rohit Gandikota: And then you have your original unit, which is what is actually generating the image in the first place.

225
00:23:10.110 --> 00:23:17.349
Rohit Gandikota: So, you can remove or detect for this knowledge from any of these locations, and each of them have their own impact.

226
00:23:17.880 --> 00:23:21.989
Rohit Gandikota: For instance, If you want to remove the knowledge.

227
00:23:22.740 --> 00:23:30.360
Rohit Gandikota: completely the visual knowledge of the model. You have to look for this, for this structure inside the vision aspect of the model.

228
00:23:31.080 --> 00:23:36.699
Rohit Gandikota: which is what we call ESDU, which is everything except the text side of things.

229
00:23:37.160 --> 00:23:45.079
Rohit Gandikota: And if you precisely want to remove the knowledge only when it is being asked for, queried for, for example, some artist

230
00:23:45.590 --> 00:23:49.710
Rohit Gandikota: style painting. It's usually queried through their own name.

231
00:23:49.850 --> 00:23:56.629
Rohit Gandikota: So, if you want to be more precise, you can go and just ablate that structure from the tech side of things.

232
00:23:56.860 --> 00:24:08.719
Rohit Gandikota: And turns out, it also matters where it is that you are looking for in this… for the structure in the model, and it has an… its own impact as to how deep of a knowledge that you are removing or tapping into.

233
00:24:10.200 --> 00:24:20.250
Rohit Gandikota: So, yeah, so that method definitely erases the things that we don't want, but I don't think that's the whole picture when it comes to editing a model.

234
00:24:20.560 --> 00:24:26.700
Rohit Gandikota: I can randomize the entire network and still say, yeah, it removed everything, it removed the entire knowledge of vision.

235
00:24:27.060 --> 00:24:33.119
Rohit Gandikota: But then, the thing that is really important is that the model is still intact for the rest of the concepts around it.

236
00:24:33.300 --> 00:24:40.179
Rohit Gandikota: Right? You need to be more precise about what it is that you are removing. Only then will I be able to call it that, yeah, I found structure.

237
00:24:40.850 --> 00:24:41.720
Rohit Gandikota: So…

238
00:24:42.440 --> 00:24:52.519
Rohit Gandikota: Turns out, if you ablate your knowledge from the cross-attentions, you have much more precise representations of your structure compared to removing it from the visual side of things.

239
00:24:52.690 --> 00:24:55.329
Rohit Gandikota: And this was a very nice observation that we had.

240
00:24:55.500 --> 00:24:59.849
Rohit Gandikota: And which led to a follow-up with Hadas and Janetan.

241
00:25:00.060 --> 00:25:04.420
Rohit Gandikota: Where we directly just attack the cross-attention layers.

242
00:25:04.530 --> 00:25:07.910
Rohit Gandikota: And then find the structure within using Flowsform solutions.

243
00:25:09.300 --> 00:25:15.070
Rohit Gandikota: So, this… once we found that, sure, there is a structure in it.

244
00:25:15.320 --> 00:25:24.250
Rohit Gandikota: Can we really control this structure, is the next question, right? We surely removed it. I would call that some form of control.

245
00:25:24.510 --> 00:25:28.499
Rohit Gandikota: But can I precisely more have a continuous control over this layer?

246
00:25:29.210 --> 00:25:37.360
Rohit Gandikota: Turns out, yes, you can very simply take the exact structure-finding method from the previous work.

247
00:25:37.500 --> 00:25:42.659
Rohit Gandikota: And linearly control it in a way that it has a continuous effect on your model.

248
00:25:42.930 --> 00:25:52.989
Rohit Gandikota: For example, if I were to bring all of the fine-tuning that I did for unlearning the model into a small set of parameters as a module.

249
00:25:54.120 --> 00:25:57.609
Rohit Gandikota: It does remove the beard, like, so that unlearn's beard.

250
00:25:57.860 --> 00:26:04.739
Rohit Gandikota: But you can also control the intensity of it, and it controls how much beard is removed from the model's knowledge.

251
00:26:05.650 --> 00:26:16.109
Rohit Gandikota: Which is great. So, it sort of tells that, inherently, the parameters that capture structure also has, like, a smooth, controlling definition for them.

252
00:26:16.640 --> 00:26:30.650
Rohit Gandikota: And the nice part about it is when you can erase something, you can do the total opposite and add something. So you can add a smile, you can add glasses, and sort of get a very nice dimension of control over your outputs.

253
00:26:30.970 --> 00:26:33.739
Rohit Gandikota: Just by finding structure inside there.

254
00:26:35.030 --> 00:26:39.230
Rohit Gandikota: And this is not doing anything completely new.

255
00:26:39.530 --> 00:26:41.269
Rohit Gandikota: What we did is we took

256
00:26:41.620 --> 00:26:47.509
Rohit Gandikota: The erasing formulation, and also added an enhancing formulation to it.

257
00:26:47.790 --> 00:26:54.320
Rohit Gandikota: And found ways to control this knowledge linearly in the parameter space, or in the weight space.

258
00:26:54.710 --> 00:27:02.839
Rohit Gandikota: So, what I do is I take the person vector and try to move towards the old direction and away from the young direction.

259
00:27:03.050 --> 00:27:07.599
Rohit Gandikota: And that's exactly what this erasing Parameter module is doing.

260
00:27:08.840 --> 00:27:12.980
Rohit Gandikota: So, now what happens is that you can, just using text.

261
00:27:13.410 --> 00:27:16.450
Rohit Gandikota: As methods to find your structure.

262
00:27:16.750 --> 00:27:19.969
Rohit Gandikota: You can simply control the model's internal knowledge.

263
00:27:20.200 --> 00:27:22.769
Rohit Gandikota: Right? So, all I use is curly hair.

264
00:27:22.990 --> 00:27:27.689
Rohit Gandikota: As a prompt to find the structure, and we can control it inside the model.

265
00:27:28.410 --> 00:27:43.030
Rohit Gandikota: So, yeah, there are multiple dimensions you can control, and these are also composable on top of each other, so I can keep composing more and more structure on top, so I can make them have glasses, plus old, plus smiling, and you can keep adding.

266
00:27:43.160 --> 00:27:45.940
Rohit Gandikota: Up to 200 sliders is what we've tested.

267
00:27:46.340 --> 00:27:48.179
Rohit Gandikota: And now we can test more.

268
00:27:49.450 --> 00:27:58.999
Rohit Gandikota: But more specifically, this enables you to have a multidimensional control over your images. This is, like, imagine being in a…

269
00:27:59.370 --> 00:28:09.220
Rohit Gandikota: like, 1990s, maybe, when you have a game engine where you can create your own character. You had multiple sliders you can work with, sort of create a character which you really like.

270
00:28:09.840 --> 00:28:15.759
Rohit Gandikota: This sort of, composable sliding techniques will allow you to do this now.

271
00:28:17.890 --> 00:28:22.540
Rohit Gandikota: So… And funnily, we also found

272
00:28:22.680 --> 00:28:28.659
Rohit Gandikota: Structure inside the models that will self-repair its own outputs.

273
00:28:29.170 --> 00:28:33.630
Rohit Gandikota: All we did was ask the model to be more realistic.

274
00:28:33.990 --> 00:28:43.959
Rohit Gandikota: And just made the model's structure move towards being more realistic. And it automatically started fixing its own outputs whenever it was breaking the physics.

275
00:28:44.130 --> 00:28:51.509
Rohit Gandikota: or just drawing very bad bicycles. It just corrects itself and creates much better-looking images.

276
00:28:51.830 --> 00:28:57.730
Rohit Gandikota: And originally, we thought that this was a problem of additional training or adding more guidance.

277
00:28:58.040 --> 00:29:03.130
Rohit Gandikota: Turns out this was just a structural issue, you can just fix the structure if you can find it.

278
00:29:06.260 --> 00:29:13.990
Rohit Gandikota: And this is another nice throwback to, I think, 2023, when your images were not able to generate nice hands.

279
00:29:14.240 --> 00:29:18.250
Rohit Gandikota: Yeah, you can fix them. And it was… it was,

280
00:29:18.550 --> 00:29:22.790
Rohit Gandikota: It was mind-blowing for everyone who looked at this at that time. Now it's…

281
00:29:23.450 --> 00:29:25.489
Rohit Gandikota: Every model can do this now.

282
00:29:26.600 --> 00:29:40.999
Rohit Gandikota: So, okay. So, coming back to this, if you can find a structure and control it, I think this in itself will open up a lot of possibility for you to do explorations of this model and show your intent.

283
00:29:41.270 --> 00:29:45.829
Rohit Gandikota: And turns out, this did catch a lot of attention in the community.

284
00:29:46.380 --> 00:30:00.300
Rohit Gandikota: So when we open source this method, we've immediately seen a burst of open source artists, like, training their own sliders with their own creative imaginations, and releasing it for the public to use.

285
00:30:00.800 --> 00:30:09.900
Rohit Gandikota: And this enabled, I think, a good amount of research within the stable diffusion community to work more on these controllable methods.

286
00:30:10.310 --> 00:30:14.010
Rohit Gandikota: For… for enhancing the user's intention.

287
00:30:14.310 --> 00:30:20.070
Rohit Gandikota: And I think this was… I don't want to talk too much about it, but this was kind of the chat GPT movement, I guess, I hope.

288
00:30:20.260 --> 00:30:24.980
Rohit Gandikota: Where, like, users were finally able to see that, sure, I can use

289
00:30:25.520 --> 00:30:27.860
Rohit Gandikota: these models to do something which I want.

290
00:30:29.490 --> 00:30:30.370
Rohit Gandikota: So…

291
00:30:30.520 --> 00:30:39.759
Rohit Gandikota: We've talked about this, where as long as your models can't talk to you, it becomes really hard for you to use the models beyond what your own imagination is.

292
00:30:40.000 --> 00:30:56.570
Rohit Gandikota: Right? For instance, let's say I've trained a model which is super realistic and does a lot of things, and my knowledge is that small circle over there. That's all I can explore using the concept sliders that I just showed you, because that's the extent at which I can prompt the model with.

293
00:30:56.670 --> 00:31:02.450
Rohit Gandikota: Right? And hoping that I can bring a lot of humans into this crowdsourcing work.

294
00:31:02.630 --> 00:31:05.699
Rohit Gandikota: And allowing their creativity to be open source.

295
00:31:05.830 --> 00:31:11.710
Rohit Gandikota: maybe I can extend the concept slider limitations to that area of its knowledge.

296
00:31:12.070 --> 00:31:15.010
Rohit Gandikota: But what about this area, where

297
00:31:15.150 --> 00:31:24.930
Rohit Gandikota: Where the model learns a lot of things during pre-training, but it can't say what it has learned, because that's not the way it was trained in the first place. Unlike language, I think language can…

298
00:31:26.060 --> 00:31:29.689
Rohit Gandikota: It gives an illusion of saying, yeah, yeah, sure, you should learn more about this.

299
00:31:29.880 --> 00:31:31.730
Rohit Gandikota: There's more to learn beyond this.

300
00:31:32.180 --> 00:31:37.459
Rohit Gandikota: So, how is it to explore that, and why not just ask the model to talk about its

301
00:31:37.740 --> 00:31:42.090
Rohit Gandikota: Own internals, so that you don't have to be limited by your own potential.

302
00:31:43.130 --> 00:31:53.890
Rohit Gandikota: So, what we do as a first step is we try to ask the model to generate a lot of images using the prompt that I want to explore.

303
00:31:54.160 --> 00:31:59.870
Rohit Gandikota: And then simply just decompose its outputs, And find a slider direction.

304
00:32:00.160 --> 00:32:03.870
Rohit Gandikota: Inside its weights, that would sort of pledge to one of these

305
00:32:04.100 --> 00:32:09.100
Rohit Gandikota: decompose directions. What it says is, basically, I'm trying to look For the models.

306
00:32:09.830 --> 00:32:15.649
Rohit Gandikota: internal, principal components about this particular concept that I'm very interested in.

307
00:32:16.750 --> 00:32:28.919
Rohit Gandikota: And this is completely unsupervised, so when I say Van Gogh art, I would get how many of our PCR directions I do, like, let's say 300… 300 directions that might be relevant to Van Gogh.

308
00:32:29.050 --> 00:32:34.769
Rohit Gandikota: Which I personally would not be able to bring up in the first place. So let me show you some examples.

309
00:32:35.850 --> 00:32:45.610
Rohit Gandikota: So this is the best I could prompt for monsters. It looks pretty generic, it's the same, probably, monster coming up again and again.

310
00:32:45.940 --> 00:33:02.020
Rohit Gandikota: But once you open up these models, internals, and look for directions that the model can draw, you start getting all these different weird-looking monsters, which I… I don't have an answer for. I don't know what… this… this,

311
00:33:02.350 --> 00:33:04.489
Rohit Gandikota: The captions that you see here are…

312
00:33:05.640 --> 00:33:08.910
Rohit Gandikota: are generated, like, post-discovery using Claude.

313
00:33:09.130 --> 00:33:14.149
Rohit Gandikota: So, most of them don't even match. I don't think this is Blue Bestion, that's the best Claude could do.

314
00:33:14.290 --> 00:33:16.830
Rohit Gandikota: To make sense out of that direction.

315
00:33:16.990 --> 00:33:23.850
Rohit Gandikota: But essentially, what my point is, is that you can start exploring features that you originally would not have

316
00:33:24.170 --> 00:33:26.640
Rohit Gandikota: Thought of in the first place to prompt.

317
00:33:27.870 --> 00:33:37.290
Rohit Gandikota: And again, since we got them out as form of sliders, you can smoothly control, and this, again, opens up multiple dimensions of control, except

318
00:33:37.580 --> 00:33:45.090
Rohit Gandikota: This is creative, territory where you don't know that they existed in the first place, and you're just exploring the module.

319
00:33:46.370 --> 00:33:55.679
Rohit Gandikota: I'm… the example that I… I'm really interested and excited about is us exploring art styles inside the model.

320
00:33:56.020 --> 00:34:01.139
Rohit Gandikota: So, art is a very studied concept.

321
00:34:01.340 --> 00:34:04.850
Rohit Gandikota: In the world, of course, and especially in generative AI.

322
00:34:05.310 --> 00:34:07.759
Rohit Gandikota: So when we…

323
00:34:08.239 --> 00:34:19.300
Rohit Gandikota: decompose the artistic styles inside… like, inside the diffusion models. We found directions which are not human… which are not one-to-one human corresponding, but these are sort of principal directions.

324
00:34:19.510 --> 00:34:24.090
Rohit Gandikota: That the model has learned to encode every human artist that it knows of.

325
00:34:25.760 --> 00:34:35.530
Rohit Gandikota: And what, essentially, like, originally, the human studies were done in the past to understand what artistic styles that the model could

326
00:34:35.909 --> 00:34:38.409
Rohit Gandikota: Replicate.

327
00:34:38.659 --> 00:34:41.349
Rohit Gandikota: And then it took almost 8,000…

328
00:34:41.510 --> 00:34:49.070
Rohit Gandikota: 250 artists, 7 months of their time, to come up with a list of 8,000 artists that the model can mimic.

329
00:34:49.670 --> 00:34:50.930
Rohit Gandikota: Right? So…

330
00:34:52.320 --> 00:34:59.119
Rohit Gandikota: Now, you can literally explore this space using a single prompt, and maybe 2 hours on a GPU.

331
00:34:59.310 --> 00:35:05.080
Rohit Gandikota: And get most of the relevant slider or, like, artistic directions out of these models.

332
00:35:05.710 --> 00:35:11.009
Rohit Gandikota: So… Those are all the realistic artists, like, the real artists and their art.

333
00:35:11.170 --> 00:35:22.789
Rohit Gandikota: And you can always argue that, why do I need this exploration in the first place, right? You can simply use generic prompts, like, create an artist, artistic-style image.

334
00:35:22.930 --> 00:35:27.920
Rohit Gandikota: And then the model generates these bunch of images, which all look pretty generic.

335
00:35:28.210 --> 00:35:34.710
Rohit Gandikota: And the distance between the original distribution and the model-generated distribution is quite high.

336
00:35:35.180 --> 00:35:43.350
Rohit Gandikota: Right? And you can also say, hey, I'm in the era of language models, I'll ask the language model to prompt diverse artistic styles.

337
00:35:43.700 --> 00:35:54.700
Rohit Gandikota: And that should do the trick. Why should I explore… why should I ask the model itself to talk about it? You can, and it definitely covers a lot of distance. It's almost, like, half the distance now. You're much closer.

338
00:35:55.110 --> 00:36:00.419
Rohit Gandikota: And using these LLM prompts, we also do concept sliders.

339
00:36:00.840 --> 00:36:17.570
Rohit Gandikota: And then, sure, they are very close to each other, but when you really ask the model to explore and provide its own knowledge, then you see that it covers a lot of knowledge gap between the true distribution and what we have in our hands in the first place.

340
00:36:19.290 --> 00:36:23.060
Rohit Gandikota: So… -Oh.

341
00:36:23.320 --> 00:36:24.180
Rohit Gandikota: Yes.

342
00:36:24.740 --> 00:36:32.409
Rohit Gandikota: And this is, a response to what a… so slider space essentially is restricting you again.

343
00:36:32.620 --> 00:36:37.169
Rohit Gandikota: Where you have to be the first person to seed what it is that you're looking for.

344
00:36:37.370 --> 00:36:41.469
Rohit Gandikota: What we did with this visualization is we took a concept of art.

345
00:36:41.790 --> 00:36:45.930
Rohit Gandikota: And then started exploring within slider space again and again through an iteration.

346
00:36:46.980 --> 00:36:57.060
Rohit Gandikota: Right? So what that means is you can click on any art, and it'll show you every nearby sliders around it, and you can control this space and explore it in a much interactive way.

347
00:36:57.410 --> 00:37:03.430
Rohit Gandikota: This takes a lot of time, so I… I pre-generated a lot of styles, and I'm showing you as a…

348
00:37:03.570 --> 00:37:05.410
Rohit Gandikota: As a demo here.

349
00:37:05.700 --> 00:37:13.769
Rohit Gandikota: But yeah, it got to a place which I would never have imagined in the first place, at least with my creative brain.

350
00:37:15.520 --> 00:37:22.280
Rohit Gandikota: So, okay. We have found a way to discover it. Finally.

351
00:37:22.870 --> 00:37:25.519
Rohit Gandikota: Coming back to this communication part.

352
00:37:26.320 --> 00:37:31.609
Rohit Gandikota: Where, when you ask something, the model is doing a lot of planning and generating an image.

353
00:37:32.170 --> 00:37:34.049
Rohit Gandikota: So, as a…

354
00:37:34.810 --> 00:37:41.350
Rohit Gandikota: As a user, I would want to understand what is the decision that the model made, and I want to communicate with that decision.

355
00:37:41.440 --> 00:37:53.990
Rohit Gandikota: For instance, I want to look at that particular, this particular person that it drew there, and I think that is the decision that is more relevant for this particular generation.

356
00:37:54.070 --> 00:38:01.990
Rohit Gandikota: So how is it, can I tap into this decision-making boundary, and how can I communicate with it, so that I can draw the same person chilling in a beach?

357
00:38:02.640 --> 00:38:10.530
Rohit Gandikota: Right? It becomes a really hard problem when… when you want to now go into the model's decisions and internal decision space.

358
00:38:11.580 --> 00:38:22.649
Rohit Gandikota: So, a quick intro to how the model that I'm working on looks like. So, it's… it's called Flux. It has two streams, an image and text.

359
00:38:22.770 --> 00:38:27.679
Rohit Gandikota: So both of them are processed together. You start from a noise, and then you do multiple layers.

360
00:38:27.980 --> 00:38:36.810
Rohit Gandikota: Of the same block. You also have, like, some fancy, conditioning called time step and guidance. They all go into this flux block.

361
00:38:37.270 --> 00:38:39.670
Rohit Gandikota: Finally, it generates an image.

362
00:38:40.030 --> 00:38:47.380
Rohit Gandikota: From the image stream, but then you discard the text stream, because you don't need text when you're talking about image generation.

363
00:38:47.920 --> 00:38:54.409
Rohit Gandikota: Right? This is how the model is trained, where it takes a noise, does a dual stream,

364
00:38:54.880 --> 00:39:03.659
Rohit Gandikota: processing, and then you only take the image stream of things, decode it, and just discard that final. There's no optimization done on this text stream.

365
00:39:04.380 --> 00:39:10.709
Rohit Gandikota: So, what we found is that, what if we can use this rich information that the model has learned.

366
00:39:11.430 --> 00:39:13.849
Rohit Gandikota: And discard the image side of things.

367
00:39:14.150 --> 00:39:18.050
Rohit Gandikota: And look what the text has learned. And what can we do with that text.

368
00:39:19.500 --> 00:39:20.470
Rohit Gandikota: So…

369
00:39:21.320 --> 00:39:27.500
Rohit Gandikota: That's exactly what we do. We give the reference image that the model has generated to the same model again.

370
00:39:27.820 --> 00:39:28.790
Rohit Gandikota: And then…

371
00:39:29.340 --> 00:39:38.059
Rohit Gandikota: Instead of touching the concept that you want as a haptic, you can literally just say a layperson on the left playing bass.

372
00:39:38.640 --> 00:39:41.949
Rohit Gandikota: So now, we know that the leaf person belongs to that

373
00:39:42.220 --> 00:39:47.840
Rohit Gandikota: person on the left, through captioning. I'm sort of capturing what it is that I'm really interested in.

374
00:39:48.620 --> 00:39:56.109
Rohit Gandikota: And when I use the same prompt, a leaf person chilling on a beach, the regular diffusion model draws this.

375
00:39:56.370 --> 00:40:00.210
Rohit Gandikota: And it's not that person, of course. Why would it?

376
00:40:00.490 --> 00:40:06.650
Rohit Gandikota: Except, I can take these… the vectors that come out of this magic network, which I'll talk about

377
00:40:06.960 --> 00:40:13.240
Rohit Gandikota: In a few seconds, and take the information that it has learned and plug it into my generative model.

378
00:40:13.480 --> 00:40:15.000
Rohit Gandikota: And then agenda instead.

379
00:40:16.300 --> 00:40:20.289
Rohit Gandikota: Right? You can… you can… You can bring…

380
00:40:20.810 --> 00:40:24.359
Rohit Gandikota: You can tap into the internal decisions that the model makes.

381
00:40:24.510 --> 00:40:27.979
Rohit Gandikota: And bring it into your own generations, and play around with it.

382
00:40:29.380 --> 00:40:30.230
Rohit Gandikota: So…

383
00:40:30.760 --> 00:40:37.100
Rohit Gandikota: The way it was trained is that the feedforward network, we show the image and the caption that it takes.

384
00:40:38.010 --> 00:40:51.860
Rohit Gandikota: And it generates this, what I call, like, decision vectors, or customization vectors, for each of the text prompt there is, so that if you're interested in the dog, you can take the vector of the dog token.

385
00:40:52.000 --> 00:40:58.210
Rohit Gandikota: If you're interested in the background, you can take that token. At least it gives you a range of communication bandwidth.

386
00:40:58.390 --> 00:41:03.070
Rohit Gandikota: Of how… what it is that you want to customize, or bring, or what decision it is you want to capture.

387
00:41:03.590 --> 00:41:07.720
Rohit Gandikota: I use the same reference prompt as my generation.

388
00:41:07.910 --> 00:41:16.809
Rohit Gandikota: So if the original diffusion model is generating some other image, I just ask it to generate the same image when all the tokens are employed. So essentially, I'm training this model.

389
00:41:17.260 --> 00:41:27.499
Rohit Gandikota: to say, whatever it is that you're generating from your vectors, word to word, they have to represent some delta towards the concept that I'm really interested in.

390
00:41:29.380 --> 00:41:38.220
Rohit Gandikota: Right? There'll be questions, I'm sure. I'm happy to answer them. If this… if you're absolutely lost, please let me know.

391
00:41:39.120 --> 00:41:46.609
Rohit Gandikota: So yeah, it's a simple MSC that we do, and then there is no other additional,

392
00:41:46.840 --> 00:41:51.539
Rohit Gandikota: captioning or annotations that we are doing, it's a simple text-image captions.

393
00:41:52.020 --> 00:41:53.600
Rohit Gandikota: And using this.

394
00:41:53.750 --> 00:42:01.219
Rohit Gandikota: a single paradigm. We basically learn the true language of whatever is happening inside and the decisions that the models are making.

395
00:42:01.860 --> 00:42:10.629
Rohit Gandikota: So, yeah, we collected 2 million images and trained this general-purpose network that does this.

396
00:42:11.180 --> 00:42:18.000
Rohit Gandikota: So now, what you can do is you can have a bunch of reference images that you want your concept to come from.

397
00:42:18.300 --> 00:42:26.860
Rohit Gandikota: and have a reference prompt to show what it is that you want to pick the concept from, right? So, I'll explain

398
00:42:27.050 --> 00:42:28.070
Rohit Gandikota: these things.

399
00:42:28.950 --> 00:42:33.329
Rohit Gandikota: So these… this is the inference prompt. I said, a dog sitting on a sofa.

400
00:42:33.580 --> 00:42:38.140
Rohit Gandikota: And the original model generates diverse Not so diverse, but…

401
00:42:38.360 --> 00:42:43.540
Rohit Gandikota: very different looking dogs, right? Except, now I can…

402
00:42:43.650 --> 00:42:45.780
Rohit Gandikota: Give this to my feed-forward model.

403
00:42:46.190 --> 00:42:52.129
Rohit Gandikota: And bring in the delta vectors from the dog token, and

404
00:42:52.520 --> 00:42:54.969
Rohit Gandikota: And plug it into this original seeds.

405
00:42:55.080 --> 00:42:58.260
Rohit Gandikota: And then, immediately, the model starts drawing the dog.

406
00:42:58.380 --> 00:43:03.169
Rohit Gandikota: That's in the reference image. So it's essentially, I'm picking the concept and dropping it into my

407
00:43:03.440 --> 00:43:04.969
Rohit Gandikota: Diffusion model.

408
00:43:05.650 --> 00:43:10.730
Rohit Gandikota: And, yeah, it… you can see that for different dogs, it brings all the different dogs.

409
00:43:11.310 --> 00:43:14.559
Rohit Gandikota: From the reference image and puts it into your generations.

410
00:43:17.310 --> 00:43:18.840
Rohit Gandikota: Oh, I had animation.

411
00:43:19.320 --> 00:43:20.030
Rohit Gandikota: Yes.

412
00:43:21.510 --> 00:43:22.330
Rohit Gandikota: Okay.

413
00:43:23.320 --> 00:43:27.990
Rohit Gandikota: So, till here, this is mostly about what I've done so far.

414
00:43:28.230 --> 00:43:30.179
Rohit Gandikota: And I think I could… oh.

415
00:43:30.530 --> 00:43:32.020
Rohit Gandikota: 30 minutes, that's nice.

416
00:43:32.180 --> 00:43:37.450
Rohit Gandikota: So… In summary, what I've taken you through is that

417
00:43:37.560 --> 00:43:41.350
Rohit Gandikota: We never had a way to… Verify their structure.

418
00:43:41.690 --> 00:43:57.280
Rohit Gandikota: We did. Now, we find that there is structure inside these models. You can control the structure. It's also discoverable. There are… it's just the first step, and there are many follow-ups that sort of worked on top of this to make it more discoverable.

419
00:43:57.410 --> 00:44:00.820
Rohit Gandikota: And you can now communicate with the internal decisions that the model do.

420
00:44:01.040 --> 00:44:05.289
Rohit Gandikota: With this in hand, I'll show you what you can do now in the current

421
00:44:05.660 --> 00:44:08.930
Rohit Gandikota: today, with vision models, right? So…

422
00:44:09.320 --> 00:44:11.860
Rohit Gandikota: You ask for a photo of a woman.

423
00:44:12.420 --> 00:44:15.309
Rohit Gandikota: Seated on a couch at a cafe, and it draws this image.

424
00:44:15.690 --> 00:44:20.090
Rohit Gandikota: Think she's too serious, so I can add a smile.

425
00:44:20.420 --> 00:44:23.170
Rohit Gandikota: And then that makes the women smile now.

426
00:44:23.410 --> 00:44:28.380
Rohit Gandikota: Right? You can say, no, no, no, I want that couch from my favorite TV show.

427
00:44:28.890 --> 00:44:33.739
Rohit Gandikota: and say, I want the woman to be sitting on that, It brings the couch.

428
00:44:33.850 --> 00:44:43.400
Rohit Gandikota: Right? You can also say, yeah, I think I want, like, some funky new hairstyle, but I don't know what it is. Let me ask the model what all hairstyles it has in its…

429
00:44:43.600 --> 00:44:47.819
Rohit Gandikota: directory, and then I can choose style number 23 that the model gave me.

430
00:44:48.230 --> 00:44:51.329
Rohit Gandikota: And then it draws this. So, now you…

431
00:44:51.520 --> 00:45:03.110
Rohit Gandikota: It opens up to a new form of exploring the model and working with them, and finally, it, like, gives you some room to show your intention as a user and what you can do in the first place.

432
00:45:04.120 --> 00:45:08.850
Rohit Gandikota: Okay, this slide took me… 3 weeks to do it.

433
00:45:09.980 --> 00:45:10.820
Rohit Gandikota: Fair.

434
00:45:11.010 --> 00:45:12.999
Rohit Gandikota: I can show intent in 3 weeks.

435
00:45:13.720 --> 00:45:20.810
Rohit Gandikota: But is this the real fix? I think now comes… the main,

436
00:45:21.770 --> 00:45:24.100
Rohit Gandikota: Good, solid 10 minutes portion of the talk.

437
00:45:24.330 --> 00:45:29.169
Rohit Gandikota: I don't think this is a real fix, especially because these are all post hoc methods.

438
00:45:29.440 --> 00:45:39.869
Rohit Gandikota: We are training the models. It's very similar to what you in the lab are doing with language models. There's one stream of researchers who are really pushing to go towards

439
00:45:41.250 --> 00:45:55.400
Rohit Gandikota: the top 1% AI model that can solve RKGI3, 3.5, 5, whatever. And then there are us who's trying to be, oh, let me understand this, I don't know what's happening here. By the time I understood GPT-2,

440
00:45:55.640 --> 00:46:01.140
Rohit Gandikota: there's now GPT-10, right? So we… we are playing this cat and mouse game.

441
00:46:01.620 --> 00:46:09.889
Rohit Gandikota: And… I… I don't want to be, way too optimistic with language, but with vision at least.

442
00:46:10.530 --> 00:46:15.690
Rohit Gandikota: It's absolutely necessary that we understand the models for them to be even workable in the first place.

443
00:46:15.950 --> 00:46:26.270
Rohit Gandikota: Right? I showed you a lot of slides earlier, which took 4 years of research to get it to a phase where I can show you one slide.

444
00:46:26.550 --> 00:46:30.109
Rohit Gandikota: Where I have to work for 3 weeks, myself, who built these methods.

445
00:46:30.500 --> 00:46:34.949
Rohit Gandikota: Right? And I don't think that's no way close to being a tool that'll help you show your intention.

446
00:46:36.130 --> 00:46:41.999
Rohit Gandikota: So, I think… The real fix has to be in, like, multiple ways.

447
00:46:42.260 --> 00:46:55.969
Rohit Gandikota: Which always comes back to the same question as to what will make vision the tool that will be widely used in the future. And it's the same thing, which is, I think we'll have to focus on the intention of humans, and…

448
00:46:56.110 --> 00:47:03.670
Rohit Gandikota: How do we enable that to happen in the first place? And it's not an input-output game. I think we should, like, start moving away from it.

449
00:47:03.850 --> 00:47:10.229
Rohit Gandikota: Where we are not trying to optimize towards what's the image arena number one model.

450
00:47:10.770 --> 00:47:16.089
Rohit Gandikota: so, okay, so the first thing is, I think.

451
00:47:16.560 --> 00:47:20.310
Rohit Gandikota: Visual generation and interpretability needs to go hand in hand.

452
00:47:20.960 --> 00:47:23.749
Rohit Gandikota: Right? What I mean by that is…

453
00:47:24.070 --> 00:47:29.119
Rohit Gandikota: when I started thinking about this, it's an upcoming work-in-progress project.

454
00:47:29.490 --> 00:47:39.210
Rohit Gandikota: me and David were talking about, oh, how do we bring this into reality? What is it? And then the first question I wanted to ask was, what format should I choose?

455
00:47:39.730 --> 00:47:41.369
Rohit Gandikota: To start this problem.

456
00:47:41.530 --> 00:47:44.960
Rohit Gandikota: Right? I think, obviously, the answer is video, right?

457
00:47:45.090 --> 00:47:51.699
Rohit Gandikota: Because it has image, it has time, but then 3D is also nice, because you can

458
00:47:51.910 --> 00:47:54.130
Rohit Gandikota: Look at multiple views.

459
00:47:54.480 --> 00:47:58.620
Rohit Gandikota: I think world modeling is also nice. You can, like, explore the space.

460
00:47:59.050 --> 00:48:05.830
Rohit Gandikota: And the more I thought about this, not for a project perspective, but from a,

461
00:48:07.070 --> 00:48:13.750
Rohit Gandikota: From a visual… vision perspective, is that I think every format enables a new kind of

462
00:48:14.190 --> 00:48:19.269
Rohit Gandikota: intention in users. And I think that's why all of them are, in their own ways, very popular.

463
00:48:19.430 --> 00:48:32.010
Rohit Gandikota: Right? World modeling is really popular because you get to explore it, you can play with it, you can say, go forward, go back, turn left, turn right. 3D modeling can show you different perspectives of the same image.

464
00:48:33.610 --> 00:48:39.809
Rohit Gandikota: SVGs, on the other hand, are very human-readable. You can read the code that SVGs are built on top of.

465
00:48:40.280 --> 00:48:46.619
Rohit Gandikota: images are really realistic. So there's signal processing, you can control sliders in Adobe Premium Photoshop.

466
00:48:46.770 --> 00:48:55.489
Rohit Gandikota: And all of them, in their own ways, bring some sort of signal into your intention. And it… that… I think that's why they're…

467
00:48:55.910 --> 00:48:57.620
Rohit Gandikota: Popular in their own way.

468
00:48:58.050 --> 00:49:04.979
Rohit Gandikota: So… Ideally, this is the work in progress, where what we want to do is

469
00:49:05.560 --> 00:49:12.949
Rohit Gandikota: Every time a user types a prompt to generate an image, it's always heavily underspecified. Nobody is going to sit there and write a book

470
00:49:13.180 --> 00:49:14.360
Rohit Gandikota: Of a prompt.

471
00:49:14.570 --> 00:49:17.970
Rohit Gandikota: to generate exactly what they want in their image, right? That's…

472
00:49:18.110 --> 00:49:20.400
Rohit Gandikota: I think unreasonable to us from a user.

473
00:49:20.640 --> 00:49:30.380
Rohit Gandikota: Which is why the model internally is expanding so many different things, having to do all this planning in the first place. So, what if we bring that outside the generation process?

474
00:49:30.560 --> 00:49:35.970
Rohit Gandikota: Right? And let the generative model just be a renderer, which has always been how vision was working.

475
00:49:36.120 --> 00:49:42.740
Rohit Gandikota: From a long time ago, right? So, if I were to have this planner that would plan everything out.

476
00:49:42.970 --> 00:49:47.890
Rohit Gandikota: Ideally in human-legible text, or code, or anything, any other format.

477
00:49:48.200 --> 00:49:53.499
Rohit Gandikota: That would let us, as users, understand what the model is deciding to draw.

478
00:49:53.710 --> 00:49:58.889
Rohit Gandikota: And then have this renderer, whose job is literally just to render the image.

479
00:49:59.460 --> 00:50:07.490
Rohit Gandikota: Right? That way, you can edit your planning that the model has planned, and see that edited in real time, or

480
00:50:07.660 --> 00:50:15.489
Rohit Gandikota: Or at least in the right way, where you can have 3D control over your objects, and, like, have your objects rotated in any way you want.

481
00:50:15.680 --> 00:50:16.500
Rohit Gandikota: And…

482
00:50:17.140 --> 00:50:28.279
Rohit Gandikota: To actually make this more useful, you would also want something in a cyclic way, where it'll look at your image and will be able to give the plan that requires to be, to render that image back.

483
00:50:28.530 --> 00:50:29.400
Rohit Gandikota: Right?

484
00:50:30.450 --> 00:50:43.970
Rohit Gandikota: So, every aspect of this brings some sort of intention from every modality that we have seen earlier, right? Your planner needs to know about 3D and world modeling so that it can exactly look

485
00:50:44.020 --> 00:50:52.710
Rohit Gandikota: at different angles, plan for different angles. Your renderer is almost like an SVG that takes a code and generates a realistic output, other than, like.

486
00:50:52.870 --> 00:50:55.060
Rohit Gandikota: That looks like SVG.

487
00:50:55.240 --> 00:51:05.900
Rohit Gandikota: And your parser needs to know a lot of signal processing to understand what image it's looking at, what contrast and composure that it has to do to generate the image back.

488
00:51:06.290 --> 00:51:07.160
Rohit Gandikota: And…

489
00:51:08.190 --> 00:51:17.170
Rohit Gandikota: I think a good news is that it enables wipe coding too, right? This is a big chunk of new language, and you can literally seek a lot of

490
00:51:17.320 --> 00:51:22.059
Rohit Gandikota: Agents on top of it, and talk to your image, and enable the editing.

491
00:51:22.170 --> 00:51:27.619
Rohit Gandikota: Through vibe coding, or maybe vibe image, right? I'm the first to claim it.

492
00:51:28.280 --> 00:51:34.900
Rohit Gandikota: And current modern-day vision models are doing all of this packed into one single model.

493
00:51:35.210 --> 00:51:38.779
Rohit Gandikota: And all that we have been trying to do as,

494
00:51:38.930 --> 00:51:53.819
Rohit Gandikota: interp researchers, or controllability researchers, have been trying to guess that this is what is happening, and going and trying to control them internally. And I don't know if that's what I'm capturing. I just gave you a story that that is what is happening. I still can't…

495
00:51:53.940 --> 00:51:55.980
Rohit Gandikota: Guarantee that that's what is happening.

496
00:51:56.560 --> 00:51:58.820
Rohit Gandikota: But I think it's time to break that.

497
00:51:59.060 --> 00:52:00.470
Rohit Gandikota: and have…

498
00:52:00.910 --> 00:52:09.889
Rohit Gandikota: Interp to come out of the model that there is, and make this more usable for… and also, like, give the user an illusion

499
00:52:10.000 --> 00:52:25.199
Rohit Gandikota: that they understand exactly what's happening inside the model, right? I… there… this opens up a lot of other avenues. What is this planner internally doing? We don't know. What is the renderer doing? We don't know. But at least this sort of opens up a way for users to show their intention.

500
00:52:25.390 --> 00:52:29.590
Rohit Gandikota: Right? Pushing these models to be more widely used tools.

501
00:52:30.250 --> 00:52:36.629
Rohit Gandikota: And the more… I started making that slide, the second thing came out.

502
00:52:36.920 --> 00:52:45.450
Rohit Gandikota: every time I show how reasoning works inside language model, it's always language. Sorry, inside vision model, it's always language.

503
00:52:45.560 --> 00:52:50.090
Rohit Gandikota: But how would a visual… Reasoning medium look like?

504
00:52:50.200 --> 00:52:56.089
Rohit Gandikota: This is slightly half-baked. I'm still not sure what it should look like. I think this can be a…

505
00:52:56.350 --> 00:53:04.850
Rohit Gandikota: big, research area on its own, but I feel like every vision person I can talk to will definitely vouch for this.

506
00:53:05.040 --> 00:53:10.210
Rohit Gandikota: That image medium on its own brings a lot more information that text can't bring.

507
00:53:10.730 --> 00:53:11.490
Rohit Gandikota: Right?

508
00:53:11.740 --> 00:53:14.260
Rohit Gandikota: Or can bring, but it would take

509
00:53:14.460 --> 00:53:17.479
Rohit Gandikota: Like, unreasonably large amount of text to bring it.

510
00:53:18.220 --> 00:53:31.320
Rohit Gandikota: So, I feel like having some sort of visual reasoning… visual medium reasoning inside the model would obviously be, much more efficient for us to have this sort of… and also, like.

511
00:53:31.480 --> 00:53:32.800
Rohit Gandikota: human usability.

512
00:53:33.190 --> 00:53:46.030
Rohit Gandikota: When it comes to generating these images. Like, for instance, if I want… like, the model might be planning inside that, oh, should I place the dot behind a layer 1 of the tree, or layer 2 of the tree? All of these things are hard to teach to the model

513
00:53:46.210 --> 00:53:54.379
Rohit Gandikota: Via text, and I think there has to be some form of visual reasoning happening inside.

514
00:53:55.150 --> 00:54:01.159
Rohit Gandikota: And finally, okay, so I have brought you into this room by promising that

515
00:54:01.440 --> 00:54:05.819
Rohit Gandikota: Oh, vision is going towards a dead end, and I have the solution to fix it.

516
00:54:06.000 --> 00:54:16.080
Rohit Gandikota: But I have silently been holding back some information. I do hope that generation on its own, visual generation on its own, comes to some natural endpoint.

517
00:54:16.250 --> 00:54:22.690
Rohit Gandikota: Right? Where visual generation is not a standing research area where people are working on it for a long time.

518
00:54:23.040 --> 00:54:26.470
Rohit Gandikota: What I mean by that is, I think vision on its own

519
00:54:26.780 --> 00:54:33.560
Rohit Gandikota: Has to be changed into something which is readable, or which is easily explorable by us humans.

520
00:54:33.890 --> 00:54:49.769
Rohit Gandikota: every time we look at an image, it's… it becomes, like, such a novelty thing that I look at this nebula, and I'm like, wow, this is so nice. But that's pretty much it. I can't explore what's happening inside that image, I don't know what the image is made of, what's the reasoning happening behind it. For that to…

521
00:54:49.950 --> 00:54:54.990
Rohit Gandikota: For me to understand that, I need to go to internet, search, read the text behind it.

522
00:54:55.120 --> 00:54:57.589
Rohit Gandikota: Maybe make sense out of what that nebula is.

523
00:54:57.820 --> 00:54:59.290
Rohit Gandikota: Right? And…

524
00:54:59.800 --> 00:55:08.319
Rohit Gandikota: what I'm really hoping for is vision generation becomes a very small part of us understanding how the visual world works in the first place.

525
00:55:08.710 --> 00:55:19.240
Rohit Gandikota: Right? Where you can finally look at this nebula and say, hey, I want to look at it from a different angle. And then your agent goes to internet, brings multiple images, and forms a 3D rendering on its own internally.

526
00:55:19.360 --> 00:55:22.899
Rohit Gandikota: And you can go explore that space. You can click into that image.

527
00:55:23.180 --> 00:55:31.969
Rohit Gandikota: go beyond what you look in the first place. And I think for this to happen, if this all sounds like a sci-fi, well, of course, language modeling sounded like sci-fi.

528
00:55:32.360 --> 00:55:35.290
Rohit Gandikota: 5, 10 years ago. But I think this…

529
00:55:35.770 --> 00:55:38.390
Rohit Gandikota: like, this is what I really care about. I think…

530
00:55:38.510 --> 00:55:48.240
Rohit Gandikota: I love images, I love visual world, but it always holds you back that you can't explore them, and you can't understand them the way we understand and explore text.

531
00:55:48.520 --> 00:55:51.860
Rohit Gandikota: So I think, eventually, I think that that has to be…

532
00:55:52.080 --> 00:55:54.010
Rohit Gandikota: That is where I would retire, I think.

533
00:55:54.790 --> 00:55:56.950
Rohit Gandikota: So, okay, thank you very much.

534
00:56:11.360 --> 00:56:25.659
Rohit Gandikota: Did you have any questions in the beginning? Yes. Oh, yes. Oh, actually, I have a question for the end. Yes, so I guess, when you talk about, like, I like your last slide, very fascinating, like, for example, like, vision, like, or yeah, I wouldn't say vision, but I would say image generation is not all.

535
00:56:25.970 --> 00:56:29.970
Rohit Gandikota: Because, for example, you can ask questions, they want to see a 3D rendering of stuff.

536
00:56:30.600 --> 00:56:36.219
Rohit Gandikota: I actually feel like for vision… visual exploration or, like, interaction, like.

537
00:56:36.360 --> 00:56:48.099
Rohit Gandikota: what do you think will be a good interface? I feel like that's actually the biggest question, or one of the biggest questions. For language, you know, interface is text. But for images, you can sketch.

538
00:56:48.300 --> 00:57:02.599
Rohit Gandikota: can compose things in layers, we can have sliders, we can have, like, text, you know, there's so many different things you can play with images. Yeah. So I feel like that's actually one of the things that's really hard for the image generation space to converge to. Yep.

539
00:57:02.710 --> 00:57:03.650
Rohit Gandikota: And…

540
00:57:04.040 --> 00:57:10.139
Rohit Gandikota: I'm thinking about, do you think whether we can just have just one model that tackles everything, or it can have, like, different…

541
00:57:10.510 --> 00:57:17.990
Rohit Gandikota: seem to explore things in two different angles? What's your thoughts on that? Yeah, I… So…

542
00:57:18.330 --> 00:57:20.890
Rohit Gandikota: Following the talk that I gave.

543
00:57:21.280 --> 00:57:35.509
Rohit Gandikota: I think definitely it has to be multiple different models coming up and doing the same task as an orchestra, so that at least you have good, resolution to look into what it is each model and, like, is doing.

544
00:57:35.720 --> 00:57:39.839
Rohit Gandikota: But I… I very strongly agree that,

545
00:57:40.790 --> 00:57:47.529
Rohit Gandikota: looking at the technology side of things is one aspect, but the real aspect would eventually become what is the UI?

546
00:57:47.680 --> 00:57:50.340
Rohit Gandikota: That a user is interacting with, right?

547
00:57:50.490 --> 00:57:57.569
Rohit Gandikota: And, with the very narrow view that I have with the slides that I made, I think it

548
00:57:57.680 --> 00:58:00.469
Rohit Gandikota: it becomes something of a hybrid UI.

549
00:58:00.650 --> 00:58:09.619
Rohit Gandikota: Where, given an image, you, as a user, you talk or you type, and then the UI should eventually give you the controls

550
00:58:09.870 --> 00:58:15.560
Rohit Gandikota: dynamically, saying, hey, I want to explore this. Then it gives you, like, arrow keys for you to explore.

551
00:58:15.780 --> 00:58:20.240
Rohit Gandikota: And you say, no, no, no, I actually want to see a 3D view of it, then it gives you a 3D slider to…

552
00:58:20.600 --> 00:58:22.539
Rohit Gandikota: do a 3D rendering of it.

553
00:58:23.040 --> 00:58:27.709
Rohit Gandikota: Essentially where… You feel,

554
00:58:28.010 --> 00:58:36.850
Rohit Gandikota: less friction when it comes to wanting to choose the right tool to do the right thing. I think it should be somehow an intelligent way to

555
00:58:37.090 --> 00:58:39.299
Rohit Gandikota: Feed these, controls.

556
00:58:39.760 --> 00:58:42.809
Rohit Gandikota: Based on your intention, if that makes sense. I see.

557
00:58:43.390 --> 00:58:46.700
Rohit Gandikota: But, yeah, it definitely requires a lot of,

558
00:58:46.930 --> 00:58:51.070
Rohit Gandikota: pretty such, I guess, almost as if, like, for example, one…

559
00:58:52.140 --> 00:59:03.440
Rohit Gandikota: like, recent way I can think about it is just, like, an LM can call, like, a tool use, for example, make its own tool, and then maybe create its own, like, widget. Oh, that'll be nice. Yeah, yeah, why not? Yeah.

560
00:59:05.690 --> 00:59:11.549
Rohit Gandikota: But doesn't it go back to the fact that your intention is more expressable?

561
00:59:12.750 --> 00:59:13.930
Rohit Gandikota: What does that mean?

562
00:59:14.400 --> 00:59:19.660
Rohit Gandikota: So you said, like, that the tool that you're gonna use, or the interface, is based on your intention.

563
00:59:19.980 --> 00:59:22.709
Rohit Gandikota: And then you can prompt.

564
00:59:23.230 --> 00:59:27.160
Rohit Gandikota: something, like, I want to get, like, a 3D view of these, like.

565
00:59:27.420 --> 00:59:38.460
Rohit Gandikota: But you will use text to represent your intention, and then an agent or someone will convert that to, like, the writing. Yeah, yeah, that's a… that's a good point. So I, I,

566
00:59:40.200 --> 00:59:46.840
Rohit Gandikota: I'm not saying text is the only medium to show intent. I'm saying it's a good medium to express your intent.

567
00:59:47.680 --> 00:59:56.910
Rohit Gandikota: But then, when it actually comes to editing or exploring it, it becomes, like, all these multiple dimensions of control, where you're touching the image, like, you're…

568
00:59:57.300 --> 01:00:01.369
Rohit Gandikota: Like, doing some sort of a model walk, or medium walk.

569
01:00:01.510 --> 01:00:20.730
Rohit Gandikota: There are, like, sliders which will give you control as to how much the person is smiling. But for, like, there should be some medium where you say, oh, I think this person needs to smile more, right? But then, it's not just by text, you're saying, no, no, smile more, no, no, no, even more. That's not what you're doing, right? When you say, I want to control the smile.

570
01:00:20.900 --> 01:00:24.449
Rohit Gandikota: And then the model should give you some sort of a new medium to

571
01:00:25.430 --> 01:00:27.790
Rohit Gandikota: Control the intent that you have.

572
01:00:28.450 --> 01:00:31.569
Rohit Gandikota: But it's a good question. I think…

573
01:00:31.990 --> 01:00:41.670
Rohit Gandikota: to start an intention, you need a way to communicate, and right now, I see text as a way to start it. Yeah, I think text is very similar, like, very…

574
01:00:42.450 --> 01:00:52.360
Rohit Gandikota: progressed, and very semantic. Yes, yes. It's like you've got, like, a semantic segmentation map of your image, almost. Yeah. Or something like that. Right.

575
01:00:54.010 --> 01:00:59.410
Rohit Gandikota: But the eventual… Intention needs to go through multiple mediums, not just text.

576
01:01:02.030 --> 01:01:08.269
Rohit Gandikota: And also, like, say, if… The user interface is text, then, like.

577
01:01:08.640 --> 01:01:16.490
Rohit Gandikota: Well, I guess you're proposing something else. Let's say if the user interface is X, then I can easily argue it a little bit. Let's try to be the bad guy.

578
01:01:16.660 --> 01:01:20.420
Rohit Gandikota: I think nowadays people do, like, prompt expansions, all these things, but… Yeah.

579
01:01:21.060 --> 01:01:40.400
Rohit Gandikota: shout out to my friend Red. Yes, yeah, yeah. Yeah, they have all these… after you specify, like, a text prompt, which is, like, not really specified, they'll have the OM call, like, to generate, like, various, like, very detailed text prompts. Yes. They have, like, different types of, like, text, where the users can then edit the text prompt they choose. Absolutely, yeah. But I feel like…

580
01:01:41.550 --> 01:01:56.190
Rohit Gandikota: That's not satisfying enough. Of course, like, text can describe the image so well, but I feel like it's not intuitive enough for humans to say, hey, like, I really want to look into these text prompt and select the very conversive process. Yes. I feel like interface is still, like, a big issue.

581
01:01:56.190 --> 01:02:02.039
Rohit Gandikota: Yeah. I agree. And I also want to clarify, I don't want to remove text from…

582
01:02:02.690 --> 01:02:04.330
Rohit Gandikota: Whatever is happening right now.

583
01:02:04.440 --> 01:02:11.179
Rohit Gandikota: I think it has to be there, like, it can do multiple good things, but it restricts us from doing a lot more beautiful things with images.

584
01:02:11.610 --> 01:02:18.799
Rohit Gandikota: And I'm just saying it might require new ways to interact with them, new mediums of reasoning.

585
01:02:19.690 --> 01:02:22.419
Rohit Gandikota: But yeah, eventually.

586
01:02:23.160 --> 01:02:29.660
Rohit Gandikota: think text will be a way to… hey, like, hey Siri, and then… And we'll get…

587
01:02:32.750 --> 01:02:36.970
Rohit Gandikota: Do you think that understanding model internals is…

588
01:02:37.560 --> 01:02:48.120
Rohit Gandikota: Necessary for providing the user with the control and ability to express and Exercise doing tent.

589
01:02:48.270 --> 01:02:49.360
Rohit Gandikota: or…

590
01:02:50.080 --> 01:02:59.489
Rohit Gandikota: Or maybe scale will solve everything, once… once the big labs collect enough examples of… it doesn't have to be text, but whatever

591
01:03:00.220 --> 01:03:16.129
Rohit Gandikota: whatever representation you use for expressing user intent, they collect enough of these examples, they match them with images, they train one model to do this, and do they still… do you think you're… sort of… your techniques are kind of…

592
01:03:16.370 --> 01:03:29.990
Rohit Gandikota: not all the way back and terp, but they are, you know, looking internally and cleverly selecting where to do interventions. Do you think this is needed, or maybe scale will solve everything? So, that's a very good question.

593
01:03:30.470 --> 01:03:34.540
Rohit Gandikota: And I was asked this question when I gave a similar talk, and…

594
01:03:35.230 --> 01:03:45.520
Rohit Gandikota: So, my response is, we already have a lot of models, like Stable Diffusion Excel, that spent a lot of money to collect different forms of input-output

595
01:03:45.680 --> 01:03:48.790
Rohit Gandikota: data and train those models, like, including Nano Banana now.

596
01:03:48.930 --> 01:03:51.219
Rohit Gandikota: Where you can talk to your images in text.

597
01:03:51.800 --> 01:03:55.840
Rohit Gandikota: And with these models, by understanding

598
01:03:56.550 --> 01:04:00.519
Rohit Gandikota: these, the ways to interact with them more, unlocks

599
01:04:00.760 --> 01:04:04.430
Rohit Gandikota: A slight bit more delta than you would use them when you don't have.

600
01:04:05.330 --> 01:04:11.289
Rohit Gandikota: Right? I'm saying you can use billion dollars to create this one single model that would do most of the things.

601
01:04:11.500 --> 01:04:16.449
Rohit Gandikota: But I think there's always a way to unlock more out of it, if you understand how vision works.

602
01:04:18.040 --> 01:04:21.969
Rohit Gandikota: Like, you can get… you can get more use for your user.

603
01:04:22.290 --> 01:04:28.430
Rohit Gandikota: When you understand what it does, and enable new controls from it. But… I, I, I think…

604
01:04:29.020 --> 01:04:32.940
Rohit Gandikota: Coming to the… the… The next steps.

605
01:04:33.060 --> 01:04:34.440
Rohit Gandikota: that I was suggesting.

606
01:04:34.580 --> 01:04:42.100
Rohit Gandikota: I do want big companies to spend a lot of money and build those kind of methods.

607
01:04:42.310 --> 01:04:47.849
Rohit Gandikota: which would… Which would eventually allow users to see what the model is thinking through.

608
01:04:47.980 --> 01:04:53.999
Rohit Gandikota: And then… have a sense of intention, and that I can change something that the model is thinking of.

609
01:04:54.670 --> 01:05:01.679
Rohit Gandikota: If it is single model, multiple models, I don't think I have a strong opinion yet.

610
01:05:01.860 --> 01:05:04.740
Rohit Gandikota: Because I haven't tested it myself.

611
01:05:05.080 --> 01:05:11.899
Rohit Gandikota: But… It is eventually really, really important that humans feel that they're

612
01:05:12.320 --> 01:05:15.939
Rohit Gandikota: That their intentions are heard, to be cheesy.

613
01:05:16.650 --> 01:05:24.430
Rohit Gandikota: if that… I don't think that happens with vision models, and there has to be some way for you to show what the model is doing, so that

614
01:05:24.600 --> 01:05:33.969
Rohit Gandikota: they have this sense of control over the generations. But do I need to have a way to show what the model is doing internally, or is it enough that…

615
01:05:34.250 --> 01:05:36.529
Rohit Gandikota: I have a way for the user to…

616
01:05:36.830 --> 01:05:47.779
Rohit Gandikota: specify what they want to get. Maybe it's underspecified, and then it's being expanded, whatever. But I have this plan, structured plan, or whatever it is. The model is generating an image.

617
01:05:47.950 --> 01:05:59.340
Rohit Gandikota: It's working, what do I care how it's doing that internally? Yeah, there are models already that do most of the things that, I described. Yeah.

618
01:06:00.120 --> 01:06:02.940
Rohit Gandikota: There's always, like… I think…

619
01:06:04.690 --> 01:06:10.650
Rohit Gandikota: So, from stable diff… I can give you, like, a quick timeline. Stable Diffusion can just do text-to-image.

620
01:06:10.770 --> 01:06:13.129
Rohit Gandikota: And for you to change the image, you have to re-prompt.

621
01:06:13.890 --> 01:06:28.839
Rohit Gandikota: And then Flux came out saying, oh, you can also give an image as an input, so now you can give an image and a text and sort of change it. Nano Barana now said, no, no, you can keep on talking to your image again and again, and based on this context, I'll generate better images.

622
01:06:29.150 --> 01:06:32.430
Rohit Gandikota: With each one of them, I think few problems are solved.

623
01:06:32.740 --> 01:06:36.540
Rohit Gandikota: But it's always this, like, there's always researchers that

624
01:06:37.130 --> 01:06:41.229
Rohit Gandikota: that find that, oh, even in Nano Banana, if you work with them.

625
01:06:41.340 --> 01:06:46.239
Rohit Gandikota: I think it's a good novelty. You can definitely get Something out of it.

626
01:06:46.380 --> 01:06:51.020
Rohit Gandikota: But if you really want to use it in, in your homeworks.

627
01:06:51.440 --> 01:06:55.830
Rohit Gandikota: It becomes really hard for you to specify in text what it is you want to do.

628
01:06:56.120 --> 01:07:12.170
Rohit Gandikota: So if there is… yeah. Let's forget about the restriction to text. Maybe the specification is a mix of text and some structured formats, such as the one you showed, this plan… Yes. …that you've generated. Yeah, that's fine. What I'm specifically curious to hear, thought on…

629
01:07:12.590 --> 01:07:21.580
Rohit Gandikota: Do you need to understand internally how the model is working to provide a user with the ability to express direct intent effectively?

630
01:07:21.720 --> 01:07:36.370
Rohit Gandikota: Or maybe if we collect enough examples of what users want, specified in whatever structured format you want, or of semi-structured format, enough examples of that with matched images or videos, train a big model on that, would be enough.

631
01:07:36.690 --> 01:07:39.499
Rohit Gandikota: So, everything that humans want into…

632
01:07:39.960 --> 01:07:46.109
Rohit Gandikota: Well, like… You don't need to cover everything, right? I mean, in… No, I mean, like, most of the things that…

633
01:07:46.570 --> 01:08:04.569
Rohit Gandikota: like I touched upon, like, 3D controls and all the other things. Yeah, I think that's the vision that I'm going towards, too. So I would love if people would build it. I mean, in other words, you don't need to necessarily understand how the neurons work, as long as people can control it, it's fine. So that's what you're… that's your answer, is that right? Yeah.

634
01:08:04.720 --> 01:08:05.920
Rohit Gandikota: Yeah, that's my question.

635
01:08:06.130 --> 01:08:20.620
Rohit Gandikota: Oh, yeah, I think that's my answer, too. So I didn't… so in the last… in the end, when I was showing things, I didn't really allude to, I need to understand what the renderer and, like, the planners are doing. I don't really care about it.

636
01:08:20.620 --> 01:08:36.140
Rohit Gandikota: in that depth. And then I think it still becomes a research interest as to what these models are doing. But to solve the problem that I think we both are talking about, I think you don't need to understand what the model is doing internally. At least there needs to be a way for

637
01:08:36.310 --> 01:08:40.070
Rohit Gandikota: Users to specify what they want, and get the desired output in the right way.

638
01:08:40.680 --> 01:08:41.370
Rohit Gandikota: Yeah.

639
01:08:42.189 --> 01:08:44.710
Rohit Gandikota: Maybe something to consider is that

640
01:08:45.130 --> 01:08:50.790
Rohit Gandikota: Different people might have different medium to express their intention, so some people, like.

641
01:08:50.930 --> 01:08:57.890
Rohit Gandikota: For them, text is, like, the easiest way. Other people will be growing is the easiest way. And as long as you can

642
01:08:58.270 --> 01:09:06.550
Rohit Gandikota: train your, like, pre-train your model with all these committees as, like, the way to express intention, then great, I think you can just do pre-training.

643
01:09:06.720 --> 01:09:13.550
Rohit Gandikota: But if you don't, you might want to understand internalists to enable new

644
01:09:14.000 --> 01:09:22.490
Rohit Gandikota: mediums of communications, or something like that. Like, maybe understanding internals will help you communicate with the model through

645
01:09:22.680 --> 01:09:25.719
Rohit Gandikota: A new, rather than free training.

646
01:09:26.140 --> 01:09:27.860
Rohit Gandikota: With that anymore, Cynthia.

647
01:09:28.359 --> 01:09:33.069
Rohit Gandikota: I can see another reason why you might want it, and that has to do with your discovery.

648
01:09:33.260 --> 01:09:38.890
Rohit Gandikota: Because… The discovery allows you to perhaps find new things.

649
01:09:39.029 --> 01:09:47.740
Rohit Gandikota: that people have not conceived at all, right? And I'm also curious about your take, if you think that we can, like, in the…

650
01:09:48.189 --> 01:10:00.249
Rohit Gandikota: In the space of concepts, sliders or whatever, can you find completely… extrapolate to completely new regions, or are you mostly interpolating between known things?

651
01:10:00.700 --> 01:10:11.219
Rohit Gandikota: Can we discover completely novel knowledge or skills from examining models internally, or are we restricted to what we already do?

652
01:10:14.240 --> 01:10:21.670
Rohit Gandikota: Good question. Yeah, this reminds me of Lisa Schutz's work on AlphaZero.

653
01:10:27.680 --> 01:10:35.870
Rohit Gandikota: Yeah, I… I'm not sure how to go about looking for novel visual Structured.

654
01:10:36.440 --> 01:10:41.939
Rohit Gandikota: I understand that in chess, maybe a little bit, on how you can think about it, it's much more structured.

655
01:10:42.590 --> 01:10:49.110
Rohit Gandikota: structured problem, but with vision, I don't know how to do this,

656
01:10:49.300 --> 01:10:51.379
Rohit Gandikota: Is it novel or not yet?

657
01:10:51.670 --> 01:10:53.109
Rohit Gandikota: I'll think more about it.

658
01:10:53.270 --> 01:10:57.259
Rohit Gandikota: I thought that's what you were talking about when you were talking about the styles.

659
01:10:57.920 --> 01:11:04.879
Rohit Gandikota: Like, there are… people have found out this many slides, but maybe you can go beyond that, and…

660
01:11:05.210 --> 01:11:10.769
Rohit Gandikota: If you do that slider, if you… Look at the slider space.

661
01:11:13.030 --> 01:11:18.700
Rohit Gandikota: Multiple times, maybe you can get a style which is not really Style that people have.

662
01:11:18.890 --> 01:11:33.950
Rohit Gandikota: explode in the real world. Yeah, no, absolutely. No, the point is the same. I'm not saying the discovery methods will give you new… will not give you new. I'm sure they will. I don't know the way to test it.

663
01:11:34.630 --> 01:11:38.400
Rohit Gandikota: Yeah, I think artists, like, a case where it's, like.

664
01:11:38.550 --> 01:11:48.760
Rohit Gandikota: better defined, right? Like, you can define, or not really, it's not really, but it's more defined at, like, general, visually. Yeah, like, yeah.

665
01:11:51.480 --> 01:11:56.970
Rohit Gandikota: You're saying the art is more subjective, that's why maybe we don't get normal.

666
01:11:58.320 --> 01:12:02.649
Rohit Gandikota: No, I'm actually saying the opposite. I'm saying that, like, with art.

667
01:12:03.900 --> 01:12:06.330
Rohit Gandikota: Maybe there is a way to measure

668
01:12:06.630 --> 01:12:13.589
Rohit Gandikota: Like, new, this, like, new concepts, because you can measure the similarity to known concepts, or known textbooks.

669
01:12:14.370 --> 01:12:18.240
Rohit Gandikota: But with the general vision, Yeah…

670
01:12:18.670 --> 01:12:23.339
Rohit Gandikota: But maybe if you go to, like, a scientific regime.

671
01:12:23.570 --> 01:12:29.569
Rohit Gandikota: maybe then it's easier to define some… like, I don't… maybe there is a way to do that in a specific domain.

672
01:12:29.800 --> 01:12:30.460
Rohit Gandikota: Yep.

673
01:12:31.980 --> 01:12:37.269
Rohit Gandikota: a dumb idea, but I want to do sliders for gang warping, like, weird creatures, weird-looking cats.

674
01:12:37.620 --> 01:12:42.440
Rohit Gandikota: Yeah, I think that's… that's what the UI was trying to do.

675
01:12:42.700 --> 01:12:48.199
Rohit Gandikota: Actually, I thought of something else. Yes, it just reminded me of the… it just reminded me of that,

676
01:12:48.790 --> 01:12:55.370
Rohit Gandikota: discovery, I think, 2020 or 2019, somebody at MIT got a picture of black hole.

677
01:12:55.880 --> 01:12:59.369
Rohit Gandikota: But maybe you can create images of…

678
01:12:59.560 --> 01:13:06.149
Rohit Gandikota: some celestial bodies that people have not taken. Very nice, yes. Why not? Yes.

679
01:13:07.410 --> 01:13:12.590
Rohit Gandikota: But yeah, I think it would be nice if that's possible, yeah. Maybe people can check it.

680
01:13:13.800 --> 01:13:16.519
Rohit Gandikota: Something different. Yeah. Yeah, yeah.

681
01:13:17.110 --> 01:13:22.870
Rohit Gandikota: So, any question about, So your plan… your vision is to,

682
01:13:24.410 --> 01:13:27.450
Rohit Gandikota: Decompose the image into some kind of, like.

683
01:13:27.670 --> 01:13:31.059
Rohit Gandikota: some text form, maybe an SVG form, that you can…

684
01:13:31.520 --> 01:13:34.280
Rohit Gandikota: We can really specify and create a message.

685
01:13:34.720 --> 01:13:37.600
Rohit Gandikota: Which is sort of like a new programming language, apparently.

686
01:13:38.290 --> 01:13:42.040
Rohit Gandikota: In my near… the near future vision, yes, yes.

687
01:13:42.360 --> 01:13:45.310
Rohit Gandikota: So, in that programming language.

688
01:13:45.950 --> 01:13:52.810
Rohit Gandikota: Do you imagine that every concept you have discovered, or every concept in human knowledge.

689
01:13:53.300 --> 01:13:56.480
Rohit Gandikota: It becomes some kind of, like, an operation in the programming language?

690
01:13:59.040 --> 01:14:01.969
Rohit Gandikota: Like, how do you, let's say, smile?

691
01:14:02.130 --> 01:14:08.130
Rohit Gandikota: Does it, like, become some kind of, like, a keyword in that program language?

692
01:14:08.270 --> 01:14:12.120
Rohit Gandikota: And also, like… Okay, so for…

693
01:14:12.530 --> 01:14:15.999
Rohit Gandikota: Like, it's currently underway, project.

694
01:14:16.320 --> 01:14:21.790
Rohit Gandikota: And the… what the current version is doing is, it has…

695
01:14:23.360 --> 01:14:36.360
Rohit Gandikota: something we call, like, concept vectors, where on top of the language that it's generating, it also has identity as a JSON key, and the value is just a vector.

696
01:14:36.830 --> 01:14:41.690
Rohit Gandikota: And then for that vector, the model has to always generate the same identity of the person again and again.

697
01:14:42.720 --> 01:14:57.620
Rohit Gandikota: So, like, similarly, like, smile can be also, like, a slider kind of an input, where you can keep sliding in your, the language, and then the output should just change the smile and keep the rest of the other things intact.

698
01:14:58.140 --> 01:15:01.209
Rohit Gandikota: But yeah, we are currently doing it, where…

699
01:15:03.850 --> 01:15:12.480
Rohit Gandikota: Yeah, I should give a talk about this, when it's in a much more baked form, but yeah, like, that's the goal, where we try to change

700
01:15:12.780 --> 01:15:15.660
Rohit Gandikota: Expand everything the model is planning internally.

701
01:15:15.760 --> 01:15:17.739
Rohit Gandikota: Into as much as possible.

702
01:15:18.420 --> 01:15:27.070
Rohit Gandikota: And then generate a note for the composed. They become their… Keywords, so all operational.

703
01:15:27.700 --> 01:15:31.410
Rohit Gandikota: Video… Yeah.

704
01:15:31.570 --> 01:15:35.669
Rohit Gandikota: like, a layered tree structure. Like, a person can be…

705
01:15:35.910 --> 01:15:38.769
Rohit Gandikota: so many things. Eyes, eye distances.

706
01:15:39.910 --> 01:15:44.030
Rohit Gandikota: What apparel, like, shirt they're wearing, and everything should have

707
01:15:44.310 --> 01:15:48.549
Rohit Gandikota: something of its own. So that's the thing. This language doesn't have to be…

708
01:15:49.250 --> 01:15:51.640
Rohit Gandikota: Compact enough for humans to read.

709
01:15:51.990 --> 01:15:52.780
Rohit Gandikota: But…

710
01:15:52.920 --> 01:16:01.990
Rohit Gandikota: structured enough that if they want to read, that they can go in, or an agent can go in and do that. That's the goal for now. It might evolve as we go.

711
01:16:05.050 --> 01:16:12.110
Rohit Gandikota: These models, these image models are so good, because they can compose between different concepts.

712
01:16:12.990 --> 01:16:23.549
Rohit Gandikota: And there, they have some flexibility. You want them to have that flexibility. Yeah, the renderer still has that flexibility. I don't impose anything in terms of decomposition on it.

713
01:16:24.190 --> 01:16:26.630
Rohit Gandikota: He's pretty, pretty individual.

714
01:16:28.140 --> 01:16:38.869
Rohit Gandikota: So how a Python program gets compiled to assembly language. Yeah, I agree, yeah. So you kind of want that underscatification in the language as well.

715
01:16:42.300 --> 01:16:46.289
Rohit Gandikota: Because if you can exactly specify everything.

716
01:16:47.390 --> 01:16:53.830
Rohit Gandikota: I don't know how you'll do that, but it would be like a… Too rigid. Yeah, 1 million line,

717
01:16:54.590 --> 01:16:56.619
Rohit Gandikota: Program to generate something.

718
01:16:56.740 --> 01:16:59.949
Rohit Gandikota: Simple.

719
01:17:00.490 --> 01:17:01.950
Rohit Gandikota: We want to have that.

720
01:17:04.300 --> 01:17:06.780
Rohit Gandikota: Capability, which makes it a little bit…

721
01:17:07.460 --> 01:17:13.669
Rohit Gandikota: uncertainty, not… not kind of, like, if you want to navigate an appropriate language through one of the certain language.

722
01:17:14.620 --> 01:17:22.769
Rohit Gandikota: I'll… I'll push back a little on that. I don't… I… I don't think it's the…

723
01:17:25.680 --> 01:17:29.070
Rohit Gandikota: I don't think I'm going towards making the language as…

724
01:17:29.580 --> 01:17:36.850
Rohit Gandikota: descriptive as possible, that's not what I'm going towards. I'm going in… I'm going towards… I want the language to be,

725
01:17:38.510 --> 01:17:42.650
Rohit Gandikota: Fine enough that users can control the things that they care about.

726
01:17:43.230 --> 01:17:50.800
Rohit Gandikota: And, like, if they want to describe the identity, they don't have to take millions of lines of code to describe the identity. It can be a single vector.

727
01:17:51.470 --> 01:17:52.230
Rohit Gandikota: Right?

728
01:17:52.370 --> 01:17:58.600
Rohit Gandikota: And then there can be another model that looks at this identity vector and, like, can edit it, can change it.

729
01:17:59.150 --> 01:18:06.340
Rohit Gandikota: But the… The goal of this code, is not to be exactly as Python.

730
01:18:06.540 --> 01:18:16.590
Rohit Gandikota: And also not exactly as prior models that can do everything in one single model. Like, it's somewhere in between, where I want the renderer to be something which

731
01:18:16.870 --> 01:18:24.969
Rohit Gandikota: is not… like, quotable. I want that to do all these weird composition things and create realistic images.

732
01:18:25.080 --> 01:18:27.889
Rohit Gandikota: At the same time, I want to have a format

733
01:18:28.080 --> 01:18:35.039
Rohit Gandikota: What the model is planning for, and have some ways for me to change what it's planning without affecting a lot of things internally.

734
01:18:35.930 --> 01:18:44.369
Rohit Gandikota: But it's a good question. I… I don't know how to draw that line between verbose versus…

735
01:18:44.530 --> 01:18:46.640
Rohit Gandikota: controllable.

736
01:18:47.410 --> 01:18:51.329
Rohit Gandikota: I'm happy to talk more about it, I have multiple versions.

737
01:18:51.950 --> 01:18:56.539
Rohit Gandikota: Like, what do you think about, like, the target audience stuff of, like, these tools?

738
01:18:56.960 --> 01:18:58.529
Rohit Gandikota: Because, like, this reminds me of, like.

739
01:18:58.740 --> 01:19:01.790
Rohit Gandikota: when I was a kid trying to learn to use Photoshop.

740
01:19:01.920 --> 01:19:05.379
Rohit Gandikota: It's like, Photoshop can do so much with it, but it's so hard

741
01:19:05.510 --> 01:19:10.200
Rohit Gandikota: to learn, because it's, it's like, it's like the really verbose thing. You could literally do anything. Yep.

742
01:19:10.540 --> 01:19:22.270
Rohit Gandikota: But, like, you need to have a lot of expertise. Yeah. But also, like, you… so, like, it depends on who's using it, right? Like, if it's, like, you know… It's a good… that's a good thing. I think that's why the now is very important in this life.

743
01:19:22.580 --> 01:19:28.849
Rohit Gandikota: We finally have tools that can automate these things. You can wipe… wipe code things, you can,

744
01:19:29.060 --> 01:19:31.970
Rohit Gandikota: Solve most of these problems, if it's code.

745
01:19:32.430 --> 01:19:37.100
Rohit Gandikota: You can solve most of it by just putting an agent to work and talking to the code.

746
01:19:37.220 --> 01:19:39.289
Rohit Gandikota: Instead of actually changing something.

747
01:19:40.810 --> 01:19:55.260
Rohit Gandikota: I think, eventually, if this were to work, there'll be definitely two sets of users, one that are, like, published users that would want to look at the code, change something, and work with them, and then ones that would just want to vibe image.

748
01:19:56.250 --> 01:20:04.519
Rohit Gandikota: talk to it and say, oh, no, move it to the left, and then the code agent would go inside and change just the XY coordinates.

749
01:20:04.820 --> 01:20:06.529
Rohit Gandikota: And then re-render the image.

750
01:20:06.950 --> 01:20:12.629
Rohit Gandikota: It is possible now. It wasn't… like, it would have been another Adobe Premium.

751
01:20:12.820 --> 01:20:15.449
Rohit Gandikota: Photoshop, if it was 5 years ago.

752
01:20:15.900 --> 01:20:17.530
Rohit Gandikota: Maybe now it's not that.

753
01:20:24.700 --> 01:20:32.080
Rohit Gandikota: Do you have any… Suggestions for, like, people that don't feel creative, And, like…

754
01:20:32.280 --> 01:20:34.830
Rohit Gandikota: How would you build a system that, like.

755
01:20:34.950 --> 01:20:39.560
Rohit Gandikota: Helps them be more creative for, like, development.

756
01:20:39.920 --> 01:20:45.909
Rohit Gandikota: Yes, so… That… I think that was the intention behind my slider space.

757
01:20:46.770 --> 01:20:49.960
Rohit Gandikota: slide, where I was showing that I have such small intent.

758
01:20:50.480 --> 01:20:55.499
Rohit Gandikota: So, for example, when we came up with concept sliders, we…

759
01:20:55.630 --> 01:21:00.540
Rohit Gandikota: The maximum number of creative sliders we could do were 10, or 20, so 20.

760
01:21:00.860 --> 01:21:07.059
Rohit Gandikota: As soon as we open sourced it, we had, like, 10K, 20K sliders online that are all super creative.

761
01:21:07.540 --> 01:21:14.209
Rohit Gandikota: So one is through crowdsourcing and open sourcing, which I always have a very strong stance on.

762
01:21:14.630 --> 01:21:20.389
Rohit Gandikota: And another is the discovery aspect of things, where you can allow users

763
01:21:20.580 --> 01:21:22.830
Rohit Gandikota: Oh, this was another thing which…

764
01:21:23.070 --> 01:21:32.600
Rohit Gandikota: was about to be on the slides, but was not, which is, having a paradigm which will allow users to enhance their intent, like social media, except instead of

765
01:21:33.170 --> 01:21:37.559
Rohit Gandikota: Optimizing on the engagement, you optimize on intent.

766
01:21:37.740 --> 01:21:42.439
Rohit Gandikota: and sort of show them, like, when they're working with the system, the UI,

767
01:21:42.760 --> 01:21:49.929
Rohit Gandikota: You show them, oh, things you might like to explore but haven't ever, or similar users have explored before, or something where

768
01:21:50.530 --> 01:21:56.110
Rohit Gandikota: the signal is not engagement, right? It's… It's, like,

769
01:21:56.360 --> 01:21:59.260
Rohit Gandikota: Not explored before, sort of. A small nudge.

770
01:21:59.960 --> 01:22:02.880
Rohit Gandikota: But again, it's a dangerous territory, I don't…

771
01:22:03.860 --> 01:22:06.850
Rohit Gandikota: I haven't thought more about it beyond that thought.

772
01:22:07.200 --> 01:22:13.100
Rohit Gandikota: But yes, I think discovery and some sort of a… Sensible nudge.

773
01:22:13.430 --> 01:22:16.680
Rohit Gandikota: Might help users to explore more inside the models.

774
01:22:18.030 --> 01:22:26.750
Rohit Gandikota: I guess, like, another thought is, like, I actually had the software for 2 years, but I haven't really done it. I'm always confused why people never

775
01:22:27.000 --> 01:22:31.570
Rohit Gandikota: use a gaming engine with your graphics engine to boost traffic controllability. Yeah.

776
01:22:31.840 --> 01:22:49.329
Rohit Gandikota: So we… Oh yeah, that's what we're doing now. We have computer graphics, like code, where you can have actions, move cameras, we can even have OMs to sort of, like, interfere with the computer graphics code to do editing in that code. After you render an image, then it's probably a much easier problem to bridge between the realism gap.

777
01:22:49.420 --> 01:23:02.939
Rohit Gandikota: of, like, what your text image model or text-to-video model can do, versus, like, the graphics-generated stuff. Yes. I mean, NVIDIA already has some stuff that's, like, using AI to turn, like, a graphics thing into something that's more realistic. Yes.

778
01:23:03.280 --> 01:23:04.200
Rohit Gandikota: Then…

779
01:23:04.570 --> 01:23:10.019
Rohit Gandikota: I mean, there's free data over there, I just know I really want to play around with it, that just feels such a pity.

780
01:23:10.190 --> 01:23:18.120
Rohit Gandikota: I know, I think that's exactly what we're doing with this project now. Oh, I see. Yeah, it's a good catch, yes. Very cool. Yes,

781
01:23:18.680 --> 01:23:27.960
Rohit Gandikota: Yeah, I… another motivation actually does come from the game engines that you were talking about, the ability to control your characters and…

782
01:23:28.110 --> 01:23:31.670
Rohit Gandikota: The camera angles, motions… And…

783
01:23:32.110 --> 01:23:34.079
Rohit Gandikota: Yeah, we want to do that, except…

784
01:23:34.490 --> 01:23:37.490
Rohit Gandikota: have an engine where you don't have to learn it, like Photoshop.

785
01:23:37.850 --> 01:23:46.959
Rohit Gandikota: Where the model would do it for you. And the real… realism can also be in the same power that we have with Imagine GPT right now.

786
01:23:47.510 --> 01:23:51.659
Rohit Gandikota: To get to that level with game engines is really hard.

787
01:23:52.300 --> 01:23:55.110
Rohit Gandikota: It requires a lot of…

788
01:23:55.210 --> 01:23:59.200
Rohit Gandikota: Like, dedicated years of experience to get to that point.

789
01:23:59.450 --> 01:24:07.440
Rohit Gandikota: But yes, it's rich of all the things that we want to do with vision models. Games are, like, the perfect example of

790
01:24:07.780 --> 01:24:11.890
Rohit Gandikota: having… an explorable medium of vision.

791
01:24:12.830 --> 01:24:15.520
Rohit Gandikota: So yeah, I think it can be a good data source.

792
01:24:18.220 --> 01:24:18.930
Rohit Gandikota: Nope.

793
01:24:19.450 --> 01:24:25.079
Rohit Gandikota: I am not sure that this architecture you're proposing Does actually, like.

794
01:24:25.820 --> 01:24:31.460
Rohit Gandikota: serve as a medium to, like, translate the intent of the user, who's, like, the person who understands the blog.

795
01:24:31.790 --> 01:24:44.370
Rohit Gandikota: And the example you showed earlier, which was very good, about somebody prompting a model to write, like, an email or, like, a thank you letter for the data or something, the model did understand this would be really sweet, and so you would have, like, real positive consequences on it.

796
01:24:44.760 --> 01:24:54.520
Rohit Gandikota: Now, imagine, this example where I'm trying to develop a, like, drawing for a campaign against, like, speeding on the road, like, a school zone available.

797
01:24:54.970 --> 01:24:56.979
Rohit Gandikota: And my prompt to the model is.

798
01:24:57.280 --> 01:25:03.550
Rohit Gandikota: Draw a spool, and, like, a no-speed sign, and kids crossing the road, and a car coming.

799
01:25:03.980 --> 01:25:12.939
Rohit Gandikota: And the model will probably just, like, try to do it exactly as it is, without trying maybe to add elements that would communicate my message.

800
01:25:13.040 --> 01:25:20.650
Rohit Gandikota: Or the purpose behind this. I think language models do, because maybe language is… I don't know.

801
01:25:20.750 --> 01:25:23.739
Rohit Gandikota: There's something human about it, that, like, emotions…

802
01:25:23.910 --> 01:25:32.009
Rohit Gandikota: Transpare more easily through it, and is easily… more easily, you know, interpretable than, like, images. Of course, you can also do that with our images and other things.

803
01:25:32.270 --> 01:25:48.299
Rohit Gandikota: And so I'm not sure that this architecture does actually, like, capture the human intent, and it's that this is just, like, translating the instruction better, and planning, and it's controllable, seems also explainable. Yeah. No, no, I think,

804
01:25:48.600 --> 01:25:53.090
Rohit Gandikota: With the amount of… thing that I've shown here, I think that's a very good intuition.

805
01:25:53.440 --> 01:26:01.450
Rohit Gandikota: I have two points. I think we can very easily solve it with the techniques we have today.

806
01:26:02.020 --> 01:26:09.229
Rohit Gandikota: One is… You can use a pre-trained language model to start your

807
01:26:09.340 --> 01:26:17.549
Rohit Gandikota: DSL generation from, where you exactly get all the intentions that you are hoping to get, at free cost.

808
01:26:18.180 --> 01:26:25.919
Rohit Gandikota: Another thing is the… the point that I did for my second roadmap, which is visual reasoning, medium.

809
01:26:26.330 --> 01:26:31.839
Rohit Gandikota: I feel like that will capture a lot of things that… what you're talking about in a visual sense.

810
01:26:32.260 --> 01:26:38.660
Rohit Gandikota: But… The… I stopped sharing, sorry.

811
01:26:47.450 --> 01:26:49.579
Rohit Gandikota: Yeah, you'll have to go through this, sorry.

812
01:26:54.010 --> 01:27:01.240
Rohit Gandikota: Yeah, I think this parser can also be a good way to keep interacting with your images as you draw.

813
01:27:01.430 --> 01:27:16.090
Rohit Gandikota: Right? You can… if the model ends up generating something which you don't intend to, and you want to generate, hey, I feel like it's not showing the safety that I want to teach my kids, you can talk to the parser to do it, and it would

814
01:27:16.650 --> 01:27:26.940
Rohit Gandikota: ideally, like, plan better in terms of, oh, given this image, the user is intending something else, let me sort of put this into the play here, right? You can talk your intention through.

815
01:27:27.280 --> 01:27:33.489
Rohit Gandikota: Or, I don't know how the visual medium works yet. I wish I had a…

816
01:27:34.010 --> 01:27:41.859
Rohit Gandikota: more better understanding as to how visual medium reasoning works. But I feel like there has to be some sense of, oh, no, no, no, I'm looking for

817
01:27:42.790 --> 01:27:49.669
Rohit Gandikota: Like, some, you know, Some frightening emotion that looks similar to this image.

818
01:27:49.840 --> 01:27:53.470
Rohit Gandikota: And then, somehow the model needs to make sense out of it in the visual sense.

819
01:27:53.630 --> 01:27:57.099
Rohit Gandikota: But… I think it's definitely possible.

820
01:27:57.560 --> 01:28:01.509
Rohit Gandikota: It's just a… Needs to be…

821
01:28:01.620 --> 01:28:04.919
Rohit Gandikota: captured in the right place, and using the right medium, I think.

822
01:28:05.550 --> 01:28:15.760
Rohit Gandikota: If you can already say that, hey, my language model does it, I think this can do it too. You just replace one of these parts, or fine-tune one of these parts on top of the model you're talking about.

823
01:28:19.090 --> 01:28:20.880
Rohit Gandikota: Maybe they're moving along.

824
01:28:20.990 --> 01:28:27.830
Rohit Gandikota: language models, like, me doing it. Yep. Like, the…

825
01:28:27.930 --> 01:28:37.509
Rohit Gandikota: Medium that the training data is presented in contains very rich contextual information, whereas maybe images may be, like, isolated from, like, their role in the world.

826
01:28:37.820 --> 01:28:39.420
Rohit Gandikota: That's what that's why it's not able to watch.

827
01:28:39.720 --> 01:28:47.939
Rohit Gandikota: capture this intent that was left over. Good point, yeah. I think pre-training might still be necessary eventually as we go down this path.

828
01:28:48.460 --> 01:29:00.729
Rohit Gandikota: You might still want individual vision models, language models, like Dino, and you might still want all these things that can capture the modality information from everything that you want.

829
01:29:01.190 --> 01:29:07.269
Rohit Gandikota: But, like, if you want to get to a place where you can do all this

830
01:29:08.440 --> 01:29:14.819
Rohit Gandikota: amazing things. I think… we'll have to move to a better fine-tuning regime, I guess.

831
01:29:21.300 --> 01:29:22.640
Rohit Gandikota: Oh, there's a comment.

832
01:29:25.530 --> 01:29:26.640
Rohit Gandikota: From Chris.

833
01:29:27.150 --> 01:29:28.400
Rohit Gandikota: Thank you, Rohit.

834
01:29:32.370 --> 01:29:39.340
Rohit Gandikota: Can't stop you anymore with all these amazing… Oh, you're ready to defend. Thank you, Chris.

835
01:29:40.120 --> 01:29:50.019
Rohit Gandikota: Oh, no, Chris… Too early. Not yet. He's ready, he's ready to propose.

836
01:29:51.230 --> 01:29:54.230
Rohit Gandikota: One more stage. One more stage.

837
01:29:54.330 --> 01:29:56.720
Rohit Gandikota: Nance. Thank you, but on that note.

838
01:29:57.150 --> 01:30:02.959
Rohit Gandikota: Really nice compendium of all your work and your vision for your next stage. Thank you, Rohit.

839
01:30:07.810 --> 01:30:08.830
Rohit Gandikota: Thank you, hon.