WEBVTT

1
00:00:00.030 --> 00:00:01.319
Benno Krojer: Sometimes, like, difficult.

2
00:00:02.810 --> 00:00:22.569
Benno Krojer: That's cool. Is anybody online? No, I think for now not, so… Okay. Everybody decided to come in person. You know, often we have a mix of this crowd online, but they're all out here, in the room.

3
00:00:22.660 --> 00:00:40.200
Benno Krojer: So, I know Hadass is still here, so should we… Yeah, should we maybe wait for a couple minutes for Hadass? Did anybody know where Hadass went? She was in the… Okay. Oh, Andy's gonna go look for her. Oh, is she in a meeting? Oh, maybe she's in a meeting. I mean, we can also…

4
00:00:40.200 --> 00:00:47.549
Benno Krojer: All right, all right. And do we have any visitors here who want to introduce themselves? Anybody who's here that hasn't been in the group before yet? So, go ahead.

5
00:00:49.990 --> 00:00:50.660
Benno Krojer: Excuse me.

6
00:00:51.310 --> 00:00:54.889
Benno Krojer: Nice, nice. So, you're a first-year PhD student?

7
00:00:55.390 --> 00:01:20.370
Benno Krojer: Or first-year master student? Or first-year undergrad. There's so many first-year people. Are you first-year elementary school? First year faculty? Like, I can't tell, right? So, it's very ambiguous. So, so welcome, welcome, you're always welcome. Anybody else? At Northeastern? Okay, not, not at Boston University. See, it's very ambiguous.

8
00:01:20.370 --> 00:01:22.770
Benno Krojer: Like, people understand.

9
00:01:22.830 --> 00:01:36.690
Benno Krojer: Okay, okay, that's great. Welcome, welcome. Anybody else, want to introduce yourself? Yeah, I can… I introduced myself a bit last time, but yeah, so I'm a CBI fellow, Cambridge Boston Alignment Initiative.

10
00:01:36.940 --> 00:01:49.480
Benno Krojer: Bain's, like, research fellowship thing, so I'm there for a bit during the spring, and welcome! Remind me what your home institution is, where you've been from? Oh, Texas A&M. Oh, Texas? Yeah, I'm your PhD student. Okay, nice.

11
00:01:49.920 --> 00:01:50.750
Benno Krojer: Great.

12
00:01:51.210 --> 00:01:57.480
Benno Krojer: Yeah, I see what's interesting group, right? I spoke I couldn't see, yes.

13
00:01:58.070 --> 00:02:01.330
Benno Krojer: So far, I currently aren't, at the university.

14
00:02:01.430 --> 00:02:03.199
Benno Krojer: I also didn't own the drafted

15
00:02:03.560 --> 00:02:12.089
Benno Krojer: Wonderful. Welcome. So, in between undergrad and other things? Yeah. That's great.

16
00:02:12.210 --> 00:02:13.120
Benno Krojer: Okay.

17
00:02:13.670 --> 00:02:23.240
Benno Krojer: So… Can I plug the hackathon? Oh, there's a hackathon to plug! Yes, more announcements. Okay. We have a hackathon today and tomorrow, starting at 2.30 today.

18
00:02:23.710 --> 00:02:27.159
Benno Krojer: For all the projects that you've always wanted to do, but…

19
00:02:27.160 --> 00:02:43.759
Benno Krojer: It just, I don't know, I don't had the time to do. Now that you have a research assistant, like, who can write all your code for you, you can finally do those projects. Research assistant? Who's the research assistant? Claude. Andy?

20
00:02:44.010 --> 00:02:57.369
Benno Krojer: But yeah, we're just gonna have fun. Oh, I see an Opus that has Sonnet working for it, that has Haiku working for it. Yeah, that's right. I see. It's very hierarchical. I use, oh, I see, I see. Even better.

21
00:02:58.400 --> 00:03:00.110
Benno Krojer: Do you have a lab account for?

22
00:03:00.230 --> 00:03:04.820
Benno Krojer: How do you do that? I… I… I expense… I… I ask people to expense.

23
00:03:04.860 --> 00:03:11.659
Benno Krojer: Yeah, so we're just trying to process that. But we have a lab account right now at Miele, for example, like, with Siva. Oh, nice. But it's kind of…

24
00:03:11.660 --> 00:03:32.139
Benno Krojer: It's tricky, because it… I don't know, it's like, it gets in situations where, like, you ask your lab mate to do less, so the time limit doesn't run out, you're, like, almost competing for the… Oh, because there's a quota for the whole lab. Yeah, so now we upgraded, I think now it's fine, but sometimes I sit next to my roommate, I'm like, hey, you've been doing a lot, I think in an hour we run out of tokens. Yeah, hey, hold on, I got something important to do, you're just scooping around. It's like the GPU's not the same.

25
00:03:32.620 --> 00:03:36.939
Benno Krojer: David, can you get on this… this…

26
00:03:37.180 --> 00:03:46.199
Benno Krojer: Okay, we'll figure this out. No, no, I'm expensing it, so, like, people are doing it. Yeah, you can expense it.

27
00:03:46.330 --> 00:04:04.010
Benno Krojer: If you use it. So I'm expecting people to be appropriate in what they're doing. So what I told the… what I told the NDF people is that, you know, for engineers who are…

28
00:04:04.140 --> 00:04:08.520
Benno Krojer: Trying to write hundreds of unit tests all the time, or whatever.

29
00:04:08.720 --> 00:04:28.090
Benno Krojer: then yeah, absolutely. You know, a couple hundred dollars a month, totally worth it. If you're just kind of goofing around with it to see if there's, like, an interesting research thing to do, but you're not using it all the time, then usually you don't exhaust the $20 a month one. But if you find that you're exhausting it, then yeah, absolutely, upgrade yourself. Expense it.

30
00:04:28.580 --> 00:04:31.069
Benno Krojer: Make sense? For those new people adding new space.

31
00:04:32.310 --> 00:04:35.859
Benno Krojer: I can… send me a message on it. Okay. Yeah, yeah.

32
00:04:36.340 --> 00:04:37.240
Benno Krojer: All right.

33
00:04:39.420 --> 00:04:41.540
Benno Krojer: Welcome, everybody.

34
00:04:43.620 --> 00:04:45.850
Benno Krojer: 2026 is such an interesting year, isn't it?

35
00:04:47.290 --> 00:04:57.049
Benno Krojer: I hope it's not going to be interesting. It already is. Raw new papers. It's enough interesting? Isn't it? Not too interesting. Okay.

36
00:04:57.320 --> 00:05:02.969
Benno Krojer: So, okay, so I want to introduce our speakers. So, we're very lucky to have here with us today.

37
00:05:03.130 --> 00:05:08.179
Benno Krojer: Beno… Courier? Correa. Croyer? Croyer? Beno Croyer.

38
00:05:08.570 --> 00:05:13.029
Benno Krojer: Who is wrapping up his PhD at Miele.

39
00:05:13.560 --> 00:05:20.969
Benno Krojer: And, and, he's, he's, sort of on the faculty and postdoc.

40
00:05:21.160 --> 00:05:27.819
Benno Krojer: circuit. And so he's… he's been working on interpretability at the boundary, Between,

41
00:05:28.080 --> 00:05:30.960
Benno Krojer: Language and perception, language and vision.

42
00:05:31.280 --> 00:05:35.609
Benno Krojer: And, and so we're really lucky to have him.

43
00:05:35.780 --> 00:05:45.169
Benno Krojer: Share with us his research and his perspective, get a little view of what a job talk looks like for people who might be putting together a job talk in the near future.

44
00:05:45.410 --> 00:05:47.960
Benno Krojer: And so, welcome.

45
00:05:48.130 --> 00:06:05.890
Benno Krojer: Welcome, Beno. Thank you. Yeah, thanks for having me. I'm glad there's such a large audience with two rows. So yeah, today I'll be presenting, kind of summarizing my PhD, and then focusing on the most recent project that is most, like, very interpretability-focused, which is great for this lab.

46
00:06:06.050 --> 00:06:10.900
Benno Krojer: As David mentioned, I'm a fifth year right now at Miele and McGill with SIVA Ready.

47
00:06:11.190 --> 00:06:15.809
Benno Krojer: Yeah, and feel free to interrupt with questions, like, I'm happy to keep it interactive.

48
00:06:15.970 --> 00:06:21.040
Benno Krojer: So yeah, I'll be talking about unifying and interpreting vision and language representations.

49
00:06:21.340 --> 00:06:35.869
Benno Krojer: And the talk structure will roughly be, like, just motivating why I think multimodality is important, interesting, and not solved yet. Then I'll present, sort of, three themes of my research, like, stress testing models around this theme of, unification, multimodality, unifying generation and understanding.

50
00:06:35.870 --> 00:06:48.510
Benno Krojer: And then, at the end, the biggest chunk of the talk on multimodal interpretability. And then, I'll kind of close off with a roadmap where, like, I want to work on and what I hope the field will go at, the vision, language, or multimodal field.

51
00:06:48.920 --> 00:06:58.850
Benno Krojer: So why, why study visual language, or, like, why care about it? Or, like, I personally also? Fundamentally, perception and language are complementary and very important for understanding the world.

52
00:06:58.850 --> 00:07:13.870
Benno Krojer: So, whether you're an AI system or a human, from perception, like, for example, as a baby, you learn about objects in the beginning, and then after one or two years, you learn, like, the temporal dimension, more dynamics and tracking, and at some point, something more abstract, also maybe intuitive physics, cause and effect, affordances, and so on.

53
00:07:13.880 --> 00:07:25.879
Benno Krojer: And this abstract part also brings us then to the other side. Language abstracts away most of those details. Now we have more, like, symbolic reputations and, like, just dog plays with ball. Your knowledge is maybe more represented as a graph.

54
00:07:25.970 --> 00:07:35.569
Benno Krojer: And the field of pragmatics and linguistics would also remind us that language is rarely used in isolation. Its meaning usually comes from the perception that it talks about in the physical context and the shared goals.

55
00:07:35.700 --> 00:07:38.460
Benno Krojer: So only together they kind of make sense.

56
00:07:38.940 --> 00:07:51.059
Benno Krojer: And there's a lot of commonalities between the two, because they both come from the same world, both have some notion of compositionality. Some people argue they have a shared conceptual or representation structure, maybe under this umbrella term of platonic representation hypothesis.

57
00:07:51.150 --> 00:08:06.260
Benno Krojer: and so on. So they synergize, hopefully, but at the same time, it's also non-trivial to combine the two, right? Perception is a continuous modality, at least most of the time, so there's no, like, sort of well-defined atomic units, like a pixel's not a great unit. On the language side.

58
00:08:06.260 --> 00:08:12.609
Benno Krojer: You have more discrete units, that are usually well-defined, like, either words or sentences and so on.

59
00:08:12.610 --> 00:08:23.430
Benno Krojer: On the perception side, some people say there are maybe spatial reasoners, or visualizing, or we have dreaming, but more canonically, we think of language as sort of the medium of thought and reasoning, but this is often a debate, so also this contrast.

60
00:08:24.330 --> 00:08:40.699
Benno Krojer: And sort of, as a result, the real-world tasks that we care about in AI, or in the world, also reflect this inherent multimodality. So if you have an embodied agent, it not only needs to perceive the world and act, but also integrate instructions from humans and all the knowledge from the internet it can access to solve the task more efficiently.

61
00:08:40.710 --> 00:08:56.070
Benno Krojer: And likewise, a digital agent not only interacts with symbols and web pages symbolically, but there's also spatial arrangement, images, and even videos, maybe a tutorial you want to learn about how to, like, navigate something. So even there, you have to integrate these modalities very, like, deeply.

62
00:08:56.800 --> 00:09:16.479
Benno Krojer: But beyond the demos we see these days, because you might think, like, now modalities like multimodalities solve, we see, like, the image editing is getting really strong, and video generation, and so on. I would say there's still a lot of, like, failure modes, and maybe one that is familiar to people in cloud code, that I briefly want to mention, that I… where I felt like, again, multimodality is still fundamentally unsolved.

63
00:09:16.560 --> 00:09:29.259
Benno Krojer: Like, I want to visualize the results in a plot for the paper, plot goes reasoning for 5 minutes, and then comes back, and it's just a very bad plot. And then I ask it to verify and go again, and it gives me a very different bad plot, kind of very confident.

64
00:09:29.450 --> 00:09:47.730
Benno Krojer: And this was just a plot where it just put the error here, like, I wanted these things not to be, like, super, like, spaced out and so on. And you just feel like it can't reason as deeply as it can with text about, like, going back and forth. Like, sure, it's good at one shot, but when you want the same kind of chain of thought reasoning that you get from text, it doesn't work as well.

65
00:09:47.990 --> 00:10:04.579
Benno Krojer: And I think that's kind of, like, there's, like, an underlying reason for that. So, to address some of these challenges and get to multimodal models that can reason as good, as Cloud Code and Text. My North does kind of unified multimodal models, that's what I want to work towards. And what I mean by that is that

66
00:10:04.910 --> 00:10:14.299
Benno Krojer: On the one hand, we should train more, like, maybe from scratch, that's an open question, but at least train more from the beginning, multimodality, and not sort of post-hoc, stitch them together.

67
00:10:14.490 --> 00:10:20.799
Benno Krojer: At least most… Like, I feel like Claude is definitely…

68
00:10:22.260 --> 00:10:29.709
Benno Krojer: So, yeah, I think Gemini and, like, OpenAI and Google have said that these models are more native.

69
00:10:29.980 --> 00:10:43.279
Benno Krojer: But at least the open source ones, Chameleon, like, when you train fully from scratch, like, it's pretty bad. I'll mention it later. So, what works best is that you at least start, for example, from LLM weights, and then do that, but yeah.

70
00:10:44.250 --> 00:10:53.859
Benno Krojer: You want to share maybe most of the weights, and also, like, as a result of that, you hopefully have a shared space to reason, and not sort of, like, a pixel space, and then a separate sort of text space that you have to, kind of.

71
00:10:53.860 --> 00:11:11.150
Benno Krojer: go back and forth. And arguably, like, as humans, we have a lot of these elements, maybe not exactly like this, also as well, like, a very tight integration. So now I'll kind of mention, like, sort of three pillars or steps towards this notion that I've worked on. On the left side, stress testing, are these models actually integrating modalities robustly, or are they taking shortcuts?

72
00:11:11.160 --> 00:11:16.980
Benno Krojer: Then unifying paradigms, specifically what's been a big challenge is unifying, like, visual understanding and generation in one model.

73
00:11:17.170 --> 00:11:30.550
Benno Krojer: And then, recently, also understanding what is happening internally. Are these representations actually unified? How do they interact? Because you can define your architecture and objective to be unified, but the model might still learn to separate things and do things differently.

74
00:11:31.500 --> 00:11:48.369
Benno Krojer: So for each one, I'm going to mention one work, and then kind of skip the other ones. So on the left side, I'm gonna focus on image code briefly, where we stressed VLMs with complex multi-image reasoning, and I'm going to skip over a work where, more recently, we worked on intuitive physics understanding and curating minimal pairs of videos.

75
00:11:49.570 --> 00:12:00.150
Benno Krojer: So, in ImageCode, this was a big… it's a bit of a older work, it's from 2021. The motivation at the time was Clipk had come out, Vision Birds, and people were wondering, are these models actually that great?

76
00:12:00.150 --> 00:12:15.099
Benno Krojer: Or are they just taking shortcuts? And the problem was, a lot of the benchmarks were too easy at the time. You could, have simple baselines, like bag of words and so on, that could get you far. So our goal was to create a hard, open-ended task with minimal visual pairs, which is much harder than minimal textual pairs to curate.

77
00:12:15.100 --> 00:12:24.039
Benno Krojer: And the deformity we thought of was retrieval image text matching, such that it's easy to evaluate. So the question was really, how do you, at large scale, create very interesting minimal visual pairs?

78
00:12:24.060 --> 00:12:28.640
Benno Krojer: to stress test these models. So what we did was we crowdsourced this,

79
00:12:28.730 --> 00:12:35.199
Benno Krojer: And the step one was getting very highly minimal visual pairs with some heuristics. It often came from video frames.

80
00:12:35.370 --> 00:12:41.800
Benno Krojer: And then we kind of took a pragmatics approach, so one crowd worker would describe one of the frames out of 10 frames.

81
00:12:41.960 --> 00:12:45.359
Benno Krojer: Pragmatically, meaning they would consider a discriminative listener on the other end.

82
00:12:45.580 --> 00:12:50.259
Benno Krojer: And they, like, I'll actually, like, keep this for a second, so in this case, frame 3.

83
00:12:50.390 --> 00:12:58.670
Benno Krojer: And you can already think it's quite hard, like, I'm already giving some clues with the red bounding box, but, like, distinguishing this from 9 other distractors, you can maybe think what you would…

84
00:12:58.870 --> 00:13:00.720
Benno Krojer: say?

85
00:13:00.880 --> 00:13:18.320
Benno Krojer: It's quite hard, so what the person said, it was very efficient and to the point. No bridesmaid visible at all, so it's a very contextual description. And you have to do a lot of reasoning, because this shoulder is only, like, a shoulder of a bridesmaid if you look there. Maybe you have to check the blue dress is, like, means they are bridesmaids. In isolation, you wouldn't even realize a blue dress is a bridesmaid.

86
00:13:18.570 --> 00:13:25.100
Benno Krojer: So a lot of reasoning going on in the background. And then we only keep an example if several other cloud workers received this successfully.

87
00:13:25.440 --> 00:13:40.169
Benno Krojer: So what we get at the end is this, like, sort of very hard task. With 10 image retrieval, sort of, we frame it as that. 21,000 captions in the end we collected, on around 100,000 images, and on average, these captions are actually much longer.

88
00:13:40.320 --> 00:13:46.520
Benno Krojer: Than the one I showed here, because you often have to, like, mention many details to distinguish it not only from one of the distractors, but all nine.

89
00:13:47.060 --> 00:13:48.419
Benno Krojer: How do you come up with this data?

90
00:13:48.910 --> 00:13:56.479
Benno Krojer: how do I… the underlying images? So we took… we were… initially, we're thinking of just, like, taking…

91
00:13:56.620 --> 00:14:10.980
Benno Krojer: like, normal static images and finding, like, clip nearest neighbors, but they were still not that similar. So then, instead, we just took, like, 3 different video data sets and did, like, sort of very, like, short sort of clip distance between the frames. Like, first, we would cut it into scenes.

92
00:14:11.180 --> 00:14:13.719
Benno Krojer: And then, like, distance, and then…

93
00:14:14.040 --> 00:14:22.319
Benno Krojer: the idea was that, like, sometimes they would be too similar, but then our procedure with several other cloud workers would just discard it, so sometimes they would be indistinguishable, almost.

94
00:14:22.520 --> 00:14:39.930
Benno Krojer: I mean, like, this particular example where the bridesmaid is not in the scene. Yeah. You've had to come up with it right now. Let's come up with a task where the bridesmaid is not in the seat. That's crowd worker number one's job, right? Yeah, yeah, so crowd… like, this is emergent, sort of, out of the task, so this is a two, sort of two-player guessing game.

95
00:14:40.040 --> 00:14:42.980
Benno Krojer: Well, like, they could have said anything. They could have said, like.

96
00:14:43.130 --> 00:14:46.740
Benno Krojer: In frame 3, like, the background is slightly, like, brighter.

97
00:14:47.480 --> 00:15:03.009
Benno Krojer: It's basically, like, whatever, like, it's a very, like, goal-directed description. But they also, like, say this, like, set of images is not interesting, let's move to the next one. We didn't… interestingness, no, it's not part of it, no. I see. It could have been…

98
00:15:03.440 --> 00:15:05.780
Benno Krojer: The thing in the filtering was actually in the second.

99
00:15:06.330 --> 00:15:22.420
Benno Krojer: Yeah, yeah, by the retrieval. We did tell them not to focus on text, because often, that was a bit too boring. Like, with videos, you often have, like, just at the bottom, like, captioning from some YouTube video, and not to mention other images explicitly, like, we wanted to keep, like, not say, like, it's more than image one or less than image one.

100
00:15:22.670 --> 00:15:23.390
Benno Krojer: Ew.

101
00:15:24.350 --> 00:15:32.219
Benno Krojer: And it was very interesting, because the players got better and better over time, so they were both describer and Retriever, and, like, at some point, they got very efficient at the task, and very, sort of.

102
00:15:32.520 --> 00:15:36.560
Benno Krojer: But because of the nature of your original source, you probably have a lot of things.

103
00:15:37.360 --> 00:15:47.760
Benno Krojer: it's kind of a… it's about temporal things that have changed, because you've got it all on video, a lot of things that happened before and after. Exactly, so there's a lot of implicit, like, temporal reasoning, spatial reasoning going on now.

104
00:15:48.270 --> 00:15:55.300
Benno Krojer: So yeah, the two key ideas are, like, we wanted to avoid shortcuts, specifically text ones, like, by definition, you can take… you can't take shortcuts here.

105
00:15:55.430 --> 00:16:13.149
Benno Krojer: And we focus on visual minimal pairs, and I would argue, like, from what I've seen, this is probably the most minimal differences you will find in data sets, as least I've seen with these minimal pairs. Often they're much bigger, that you, like, put something on the left instead of the right. Because we have this definition of minimal pair, that anything that you can barely still put into language at the boundaries, sort of, like.

106
00:16:13.240 --> 00:16:25.189
Benno Krojer: By this sort of procedure. The other key idea was, like, just a proxy for real-world challenges, because a lot of emergent phenomena from this guessing game, visual search, multi-image reasoning, temporal pragmatics, and so on.

107
00:16:25.320 --> 00:16:45.159
Benno Krojer: So I would argue, basically, that with all these sort of phenomena that you can solve here, you need maybe, like, a multimodal, like, unified chain of thought to solve this, going back and forth, maybe flexible across-modal representations, not just sort of one statically… static vector for each image, like, it should be adapted to the task, so that's how it ties back to this goal.

108
00:16:45.890 --> 00:16:55.979
Benno Krojer: But at the time, the modality connection was very, very shallow. CLIP was just one dot product between them, so, like, it got slightly above random, like, maybe 15%, 20%, depending on which subset.

109
00:16:56.030 --> 00:17:15.780
Benno Krojer: With fine-tuning, we got it to 25, 30%, and even to this day, like, when I tested ChatGPT, like, a few months ago, it gets 56% on, like, a smaller set, because I didn't want to feed 10 images. This is on the retrieval task? Yeah, this is the retrieval task, so we framed it as retrieval, because we wanted to automate it, but that brings… like, I will briefly mention also the captioning side of it.

110
00:17:17.369 --> 00:17:30.590
Benno Krojer: So the takeaway, I would say, is here that image code still remains a challenge, which is not taken for granted in an age where usually benchmarks after one or two years, get solved. That's a great example of a task that maybe only or, like, mostly unified models would solve.

111
00:17:30.680 --> 00:17:48.659
Benno Krojer: better, or efficiently, at least, yeah. When you do the evaluation, you give it all 10 frames at once? So, like, either the simplest thing is, like, at the time with Clip, you just give them sort of separately, and that makes it very hard, because the descriptions are very contextual, like, it's kind of obvious that Clip wouldn't solve it well.

112
00:17:48.660 --> 00:17:57.869
Benno Krojer: But we trained also models where, like, we take the clip vector and then feed it into, like, a shallow transformer that kind of can attend to all 10 images,

113
00:17:58.270 --> 00:18:12.059
Benno Krojer: Yeah. Nowadays, models can even take 10 images in one go. Just to clarify, the curve is the text problem, and you're trying to find, like, the correct image. Exactly, it's like, you can also frame it as, like, frame search, or, like, something like that. Yeah.

114
00:18:14.250 --> 00:18:19.860
Benno Krojer: So after release, the train split was used in some popular models, like Lava One Vision, or Mantis, and so on.

115
00:18:19.940 --> 00:18:35.059
Benno Krojer: And other people started working more on the captioning side. So we had one paper here with Daniel Freed and Jifu Wu, and I think 2025, someone else, like, kind of set the captioning severity out. But there, it's a bit harder to… it's a bit more open-ended, or harder to evaluate what's a good caption.

116
00:18:36.900 --> 00:18:39.730
Benno Krojer: I'll still practice.

117
00:18:40.300 --> 00:18:46.759
Benno Krojer: I would guess so, my guess would be maybe now that 60, 65%

118
00:18:46.870 --> 00:18:51.900
Benno Krojer: But, like, maybe 56… it's hard to say, but, like, a few months ago, Yep.

119
00:18:52.480 --> 00:18:53.619
Benno Krojer: At least the…

120
00:18:59.910 --> 00:19:01.719
Benno Krojer: For… for always bad.

121
00:19:02.520 --> 00:19:16.300
Benno Krojer: Yeah, half a year ago, I mean, this was the model that was… seemed like the best. Better today's standards. Yeah. What would you recommend? Well, I don't know, like, I mean, I'm very impressed by Opus 4.6 now. I think Opus is really good at visuals.

122
00:19:16.790 --> 00:19:19.339
Benno Krojer: I mean, it has to be sure it's fun.

123
00:19:19.950 --> 00:19:22.569
Benno Krojer: So, I actually didn't test fisher at all.

124
00:19:22.750 --> 00:19:36.150
Benno Krojer: Okay, no, yeah, it's definitely better than GPT stuff. You can also do a lot of tool calling here, like zooming in, cropping and stuff, yeah. I would have guessed Gemini, because they have so much video, right? Right.

125
00:19:36.270 --> 00:19:37.080
Benno Krojer: Nope.

126
00:19:37.320 --> 00:19:49.650
Benno Krojer: Maybe I'll run a normal cloud code, it's very easy to evaluate this again. Like, revisit old benchmarks, see how good they… Oh, sorry, it surged. Let's go home. Maybe.

127
00:19:49.810 --> 00:20:02.100
Benno Krojer: But I don't think it would be solved, like… I would say, like, it's not above 70% right now. There's also some open Chinese models with BL modality that are…

128
00:20:02.320 --> 00:20:04.510
Benno Krojer: Quickest. Yeah. No.

129
00:20:04.950 --> 00:20:14.619
Benno Krojer: Always good to have benchmarks that are not saturated, yeah. I have a question. So, like, is the… like, so what is the idea here, like, that the model can only do this task if…

130
00:20:14.740 --> 00:20:19.210
Benno Krojer: Like, it can… it can understand, like, the language task, and then translate it to image.

131
00:20:19.280 --> 00:20:38.809
Benno Krojer: Like, is that the idea? So the… do you mean my vision, like, my vision for what kind of model could solve this well? Yeah, sure. Like, it's… I guess it's kind of, like, I'll get you sort of my vision, but, like, it would be a model that, like, inherently very flexible, can reason between text and image tokens, it can just be like, let me look at that visual token again, let me, like, copy it here, look at it.

132
00:20:38.810 --> 00:20:43.359
Benno Krojer: And sort of in one sort of long chain of thought, instead of having to either crop it, or…

133
00:20:43.360 --> 00:20:55.159
Benno Krojer: And, like, also maybe, like, as a human, if you tell me something, I can also very flexibly adapt my representation. Like, I'm not encoding, like, right now, my whole frame, like, if you ask me, like, what's going on outside, I will…

134
00:20:55.620 --> 00:21:04.780
Benno Krojer: like, ignore what's… what else is here and just focus on that, so, like, also going back into the vision encoder and, like, adapting that more flexibly, yeah. What, like, margin of error is, like.

135
00:21:04.830 --> 00:21:23.300
Benno Krojer: Okay, so, like, so, like, I bake cookies, right? And, like, I'll write down, like, a recipe for my friends, and then they'll send me photos of the cookies, and, like, they're all very different, but, like, they're pretty smart people, so, like, they understand, like, visual reasoning, they understand, like, text. I guess I'm trying to understand, because, like, these are, like, humans who are pretty good at both.

136
00:21:23.300 --> 00:21:40.349
Benno Krojer: But, like, you can give them, like, a text artifact, like, a recipe, and they'll produce, like, 10 very different things, so I guess I'm still trying to understand. Understand the difficulty of your task, or, like, what kind of model would, like, understand what part? Yeah, I guess so, the second one. What kind of model

137
00:21:40.710 --> 00:21:44.250
Benno Krojer: Like, you have some sort of uncertainty, right? If you only look at the text.

138
00:21:44.800 --> 00:21:59.360
Benno Krojer: You can have multiple set of images. Oh, you mean, like, because the… this description could be correct for, like, this sort of ambiguity? Is that… Yeah, I guess maybe cases like that. Yeah, so there are… Or you're saying… the way you're saying it's like, well, a bridesmaid…

139
00:22:00.130 --> 00:22:13.789
Benno Krojer: could have a very different appearance in different contexts. Is that what you're talking about? Yeah, it does. So, we have situations where, like, semantically speaking, it would be true for two frames, even if, like, I don't know, the man is the tallest, or, like, the man is high up.

140
00:22:13.790 --> 00:22:26.679
Benno Krojer: And, like, it's true for a lot of the frames that they hire, but the high… like, if you think about it sort of from a speaker-listener perspective, you know, if they wanted me to distinguish this, they would have said something else, so they must, like, this is this pragmatics aspect, where we have semantically

141
00:22:26.700 --> 00:22:30.189
Benno Krojer: like… For three, it would be true, but, like, pragmatically.

142
00:22:30.500 --> 00:22:44.179
Benno Krojer: Like, yeah, I don't know if that makes sense. No, thanks for clarifying results of, like, because you said that you base your filtering on, like, the amount of people, other annotators that get it right, basically, but have you thought of, like, using it as some sort of…

143
00:22:44.210 --> 00:22:54.980
Benno Krojer: source of uncertainty to kind of wait, you know? Like, in the… I can imagine that the case where the shoulder is the only thing visible of the bridesmaid, many people could chose that, like…

144
00:22:54.980 --> 00:23:07.030
Benno Krojer: And then that could be a good alternative, kind of, you know, like, higher uncertainty. Right, like a, like, almost correct solution, yeah. I haven't modeled that yet. They're so hard! Yeah, yeah. Is this a dog? I'm like…

145
00:23:07.030 --> 00:23:23.000
Benno Krojer: Like, 40, maybe? But yeah, sort of in pragmatics, people model it this way, so if the probability of the listener is, like, what's the probability that the speaker would have said this if they had a listener in mind? So, like, there's this modeling. I never looked into this in detail, but that's this rational speech act. Yeah, that's true, yeah.

146
00:23:26.650 --> 00:23:41.850
Benno Krojer: Yeah, so that was that. Yeah. Now that we know, like, models are maybe not there fully yet, like, how can we, from a modeling side, get to more unified models that can maybe do this better? Specifically, what I mentioned, like, visual understanding and generation in one… one model, that was at least a challenge at the time, so here I'm going to present…

147
00:23:41.850 --> 00:23:47.100
Benno Krojer: One work at the top, Diffusion RTM from 2023, where we show that CLV Fusion can also do image text matching.

148
00:23:47.120 --> 00:24:00.860
Benno Krojer: And I'm gonna mostly ignore the ones at the bottom that are both about image editing, which I think is also a great testbed for those two, because you first need to understand the image more on an abstract level, and then generate the low-level perceptual change.

149
00:24:00.990 --> 00:24:01.780
Benno Krojer: Yep.

150
00:24:01.900 --> 00:24:21.499
Benno Krojer: So here, the motivation at the time was this famous avocado chair example that maybe some of you remember, where people were like, wow, like, have models solve compositionality, they can compose these concepts. Maybe generative objectives are better for compositional representations than shortcuts you might learn from discriminative objectives. So that was the question, but then what we wanted was, how can you objectively test

151
00:24:21.500 --> 00:24:32.139
Benno Krojer: if these image generators are good on compositionality benchmarks, because these are usually framed as a discriminative task. Like, if you want to know if Clip is good at that, you would say, like, is this image matching the text or not?

152
00:24:32.170 --> 00:24:49.659
Benno Krojer: So that was the research question. So what we show is that stable diffusion is more than a, sort of a generator in this paper. And when I, as I mentioned, when I said we want to measure understanding on compositionality, that is a proxy for that is usually scoring image text pairs. So then our research question specifically becomes, how can we repurpose stable diffusion

153
00:24:49.870 --> 00:24:52.440
Benno Krojer: To assign a score to an image instead of generating one.

154
00:24:52.680 --> 00:25:06.970
Benno Krojer: So if you have, you know, this image like a fire truck, or, sorry, a truck fire, you want to assign some similarity score to it. And basically our, like, key insert is that you can interpret the denoising error as a semantic similarity score.

155
00:25:07.030 --> 00:25:20.930
Benno Krojer: So in this case, if, like, the model is usually trained to denoise this image, or, like, to remove the noise, and we would say it's a… it's a low denoising error if it fits the text, because the modeler doesn't have to do much work. It's, like, it fits, I don't have to maybe edit it and so on.

156
00:25:21.000 --> 00:25:27.919
Benno Krojer: But, so this example is from Vinogound, which is, like, a famous compositionality benchmark. If you then flip the text, and it now says a fire truck.

157
00:25:27.920 --> 00:25:46.169
Benno Krojer: If the model is compositional, it should understand, like, that's a different meaning of these words together. I should give a higher denoising error. Maybe I have to edit the image now, and sort of change, not just denoise. There's, like, a lot of technical details I'm leaving out here, like how to exactly deal with the noise, and we also did some fine-tuning, just to get the key idea across.

158
00:25:46.290 --> 00:25:56.790
Benno Krojer: And maybe other people also remember here Diffusion Classifier, which was a very similar work that focused more on the, sort of, the vision and classification side and less on the compositionality side. Came out at the same time.

159
00:25:56.960 --> 00:25:59.579
Benno Krojer: And now that the nice thing is we can now…

160
00:25:59.930 --> 00:26:13.890
Benno Krojer: By the way you'll fit in, like, the image at such a high noise level? That was just an example, like, I just put some noise over. So, like, what we… in practice, what we do, you have to sample, like, the best… you get the best performance if you sample at many noise levels, and then

161
00:26:14.090 --> 00:26:16.699
Benno Krojer: Take the average, sort of score across that.

162
00:26:16.850 --> 00:26:23.929
Benno Krojer: So that's the details I'm not showing here, yeah. Okay, thanks. And does it matter early or late? I'm just curious.

163
00:26:24.420 --> 00:26:33.779
Benno Krojer: I think in our case, it was best if you more or less do it uniformly. I think the Fusion Classify did some more ablations on this, yeah. Yeah, so just to clarify, what's the difference?

164
00:26:34.240 --> 00:26:36.609
Benno Krojer: The Fusion Cluster isn't using a single classroom.

165
00:26:37.330 --> 00:26:49.780
Benno Krojer: It's the same idea, like, whether it's a class, you could represent the class as a text, and you'd have to, like, iterate over all your classes and, and…

166
00:26:49.900 --> 00:26:57.239
Benno Krojer: scored. In our case, we would just iterate over two… Winogrone is just two texts. With ImageConnect, it would be, like, a few thousand… a thousand or twenty thousand.

167
00:26:57.830 --> 00:27:02.209
Benno Krojer: But you actually input the full trunk, or you just take, like, the class document or something?

168
00:27:02.430 --> 00:27:08.359
Benno Krojer: For ImageNet classification. I think the best thing that worked was this is a photo of the class.

169
00:27:09.350 --> 00:27:15.389
Benno Krojer: Yeah. I'm not sure what the average across noising levels tells within this case, like…

170
00:27:15.490 --> 00:27:31.540
Benno Krojer: If you can expand on that. Compared to just a single… Yeah, yeah, because I could imagine if you had a score, and you could, I don't know, take the integral of the curve over noising level, maybe that would tell you how efficiently you get to the, you know, to the…

171
00:27:31.710 --> 00:27:49.930
Benno Krojer: error that you want, but the average, I'm just… yeah, I'm not… Pretty similar to the integer. Yeah, I mean, I think it boils down to the same thing. So it's kind of like how efficiently the model gets, the information about the caption at a very low noise… like, at a very high noising level, kind of.

172
00:27:50.630 --> 00:27:53.310
Benno Krojer: Right, or, like, any noising level, yeah. Yeah, yeah, yeah.

173
00:27:53.500 --> 00:28:02.129
Benno Krojer: Yeah, okay. It's just that there's a lot of, like, in general, like, variance here, so you just reduce the variance by sampling across many noise levels.

174
00:28:02.410 --> 00:28:06.939
Benno Krojer: Yeah, and also across many, like, actually, you can also, like, what we also do is, like, sampling, like, different…

175
00:28:07.400 --> 00:28:22.309
Benno Krojer: you add the noise several times, and then you average across that. So it's a very expensive procedure. It's definitely not the most efficient way to classify an image or assign it, but it was basically just to test, do these models… have they learned something compositionally beyond just looking at their generated outputs?

176
00:28:22.710 --> 00:28:37.780
Benno Krojer: So now we… now we can do a very, like, nice apples-to-apples comparison with Clip, and we can show, sort of, as a first proof of concept, this was the first work that showed it's on par with Clips on some benchmarks, like Image Code. It was a bit worse than Clip. On Winterground, it was better.

177
00:28:38.330 --> 00:28:54.570
Benno Krojer: I don't think so, looking back, this is not probably, like, it has never fully panned out to, like, turn diffusion models into actually really strong understanding models, but it was nice to see that with some simple tricks, it can get done. So coming back to our research question, how can we repurpose stable diffusion to assign such a score to an image?

178
00:28:54.640 --> 00:29:00.780
Benno Krojer: Just by leveraging, sort of, the training objective, the denoising objectives. And the takeaway here is that

179
00:29:00.850 --> 00:29:18.639
Benno Krojer: we now have a bit of a more unified model, or, like, at least we framed it in a way that now can do generation and understanding in one. And it also ties back to maybe this broader question of what is the relation between generation and understanding. Like, can a model… like, is generation sort of always a prerequisite for understanding, or can you… can you generate without understanding?

180
00:29:19.930 --> 00:29:27.090
Benno Krojer: Okay, that brings me now to the last part, multimodal interpretability, which I'm going to focus the most on today. It's also the most recent work.

181
00:29:27.170 --> 00:29:42.650
Benno Krojer: Like, I… for other jobs, for my… for other talks, or other variants of these talks, I made also a pitch for why methods work, like, improved methods and multi-interpretability should go hand in hand, but I think in this lab, I don't even know if I have to make that pitch. People maybe already, like, have… are convinced.

182
00:29:42.650 --> 00:29:55.309
Benno Krojer: But basically here, I'm just saying, if you want to get to unified architectures, reputation, and reasoning, this has to go hand-in-hand with multimodal interpretability, and there's, like, rigorous theories, building inspection tools that are tailored for multimodality, and so on.

183
00:29:55.310 --> 00:30:15.160
Benno Krojer: So that we can understand, are these architectures actually learning some unified representation? And I see this more than a just extension of LLM interp. I think it's a great testbed for the generality of the tools we're building and the paradigms, if they generalize beyond that. And actually, with Tamar, we are, like, right now preparing a workshop where we make that pitch.

184
00:30:15.300 --> 00:30:18.669
Benno Krojer: And I think the cool thing, scientifically, is also that many research…

185
00:30:18.890 --> 00:30:33.529
Benno Krojer: Yeah, David is lost on that. But, you know, but you let that faculty off the hook, right? It's like, I just, I just have to make sure nothing's, nothing, nothing illegal is going on. You looked, you looked over it, yeah.

186
00:30:34.100 --> 00:30:46.890
Benno Krojer: And I think from a scientific perspective, it's great, because there's many research questions you can only ask in a multimodal setting, like, not in a, like, unimodal setting, like, at what layers are, like, representation becoming more modality agnostic.

187
00:30:46.890 --> 00:30:56.840
Benno Krojer: Are we… are there circuits that are, like, cross-modal? Or if you train from scratch, it's really interesting to know, like, at what stages of training is it easier to learn language? When do you… is it easier to learn vision? Are they synergizing?

188
00:30:57.430 --> 00:31:00.930
Benno Krojer: And I think this also goes hand-in-hand with impact. In the future.

189
00:31:01.080 --> 00:31:18.419
Benno Krojer: I would argue all foundation models will be multimodal, or will interact multimodally, so they will be deployed all over, not just on the internet, but even in the real world, need to be safe. So we need scientific understanding, and I think it… I would be excited to build, like, interpretability tools that assist other scientists and practitioners here.

190
00:31:18.430 --> 00:31:22.550
Benno Krojer: To extend this toolkit that we now have for LLMs, also to multimodality.

191
00:31:23.200 --> 00:31:36.110
Benno Krojer: So that brings me to my recent work, Leighton Lens, where we at least make a first step at that, where we reveal highly interpretable visual tokens in LLMs, and show that visual tokens can be interpreted via contextual LLM embeddings, basically.

192
00:31:36.320 --> 00:31:43.329
Benno Krojer: So this is recent work that we put on Archiva a few weeks ago with my nice collaborators that helped a lot.

193
00:31:43.590 --> 00:31:57.159
Benno Krojer: And I think it's also the kind of tool a bit where we found the kind of methods from, like, just unimodal LLM not quite answering the research question, so we came up with a new tool or method to answer this that's more tailored to multi-modality latent lens.

194
00:31:57.830 --> 00:32:08.169
Benno Krojer: And the motivation for this was just, like, wondering about the sex… like, sure, like, I'm pitching here unified models, but what has worked in practice the most the last 4 years is

195
00:32:08.540 --> 00:32:14.169
Benno Krojer: with very little change, taking LLM and making it multimodal, like, with a couple changes. And…

196
00:32:14.400 --> 00:32:31.919
Benno Krojer: how that works so well. So, look, at this point, we're all used to this. You can just ask an LLM, describe this image. Here, I took an image as an example of Northeastern University, and it can give you a great description, and beyond that, of course, you can do all kinds of things at this point. But how does a language model process images? Like, it's trained at the end of the day on language.

197
00:32:32.050 --> 00:32:35.170
Benno Krojer: And the cool thing is, even if you keep it frozen, it can kind of do this.

198
00:32:36.190 --> 00:32:45.980
Benno Krojer: So, just for those who maybe are not as working on multimodality, the proven recipe that has worked really well is, like, you can keep your vision transformer, so your vision coder frozen, even.

199
00:32:46.130 --> 00:32:55.760
Benno Krojer: You, you just petrify your image, put them through the transformer, you get, like, a sequence of tokens. Then you… all you have to do is, like, take an MLP, or, like, even a linear layer can work.

200
00:32:55.810 --> 00:33:11.800
Benno Krojer: and map it into the LLM prefix basis one sort of sequence, kind of like you would put language there, soft prompts as well. And then you ask your even frozen LLM, describe it, and then the same objective, next token prediction, back prop to the MLP, and that's, like, a very simple recipe.

201
00:33:12.560 --> 00:33:13.450
Benno Krojer: Yep.

202
00:33:13.690 --> 00:33:20.170
Benno Krojer: Right, then you can't… Sorry?

203
00:33:20.690 --> 00:33:38.299
Benno Krojer: Oh, that's, that's the language. So, like, that's just, like, your prompt. Describe the image, please. Describe the image. It's two tokens. It's, it's text tokens, yeah. It's just, like, image end, and then your… The slide is too smart. Yeah, yeah.

204
00:33:38.320 --> 00:33:44.639
Benno Krojer: So yeah, the… what we want to do here is shed light on this puzzling question, how can Frozen LLM integrate non-linguistic inputs, like visual tokens?

205
00:33:44.770 --> 00:34:00.429
Benno Krojer: And what do these visual tokens actually represent, kind of, by the, as they are processed by the LLM? For example, are they converted into word-like tokens? For example, at the input, like, you would wonder, like, are they already interpretable, or, like, like, language at the input, or does it happen, like, through LLM processing?

206
00:34:01.440 --> 00:34:17.279
Benno Krojer: And this is also, like, you could broadly answer this, like, this question, why is visual language so easy, and why does this work? You could give answers like the platonic representation hypothesis, or maybe in pre-training, there's, like, visual prize you learn from pre-training, this is why this works, or transformers are universal computation engines.

207
00:34:17.449 --> 00:34:32.440
Benno Krojer: or they've learned, like, some implicit world model, kind of what they show here in this paper from Roma Patel and Ellie Public, that they have this internal world models, but these are more, like, broad answers to why this works, because it's not, like, a fine-grained answer on a total level, like, what's going on.

208
00:34:32.570 --> 00:34:35.430
Benno Krojer: So that's why we have to now go deeper.

209
00:34:35.570 --> 00:34:53.660
Benno Krojer: and look at the tokens. So what tools exist to study this? Like, on a just intuitive level? There's two things you can do, like, the most obvious one, what we call in the paper, is embedding lens. You want to know, like, are they… like, you just look at the nearest neighbors from the embedding matrix, basically. That's what we… what is very intuitive, are they at the input already interpretable?

210
00:34:53.790 --> 00:34:55.760
Benno Krojer: And what people also did was soft prompts.

211
00:34:55.909 --> 00:34:57.150
Benno Krojer: And I would say, like.

212
00:34:57.180 --> 00:35:17.009
Benno Krojer: the literature kind of says, like, not really interpretable. With soft phones, there was also not much. So what people also started doing more recently on VLMs is using LogitLens. I've seen, like, now many papers the last one or two years doing this. So same idea, in a way, you just look at, in a way, like, nearest neighbors, or, like, the prediction now with LogitLens, and this would start at late layers to work somewhat.

213
00:35:17.010 --> 00:35:26.220
Benno Krojer: it's been used for hallucination detection and so on. There's a question here? Yeah, just a technical question, like, when you chunk up the… when you map the…

214
00:35:26.230 --> 00:35:27.120
Benno Krojer: image.

215
00:35:27.160 --> 00:35:33.649
Benno Krojer: to these tokens that you're gonna then, like, will serve as a prefix for your LLM.

216
00:35:35.150 --> 00:35:42.509
Benno Krojer: How do you do this? Like, what does each token correspond to? Does the first token correspond to the top left? Yeah, yeah, it's a grid, like…

217
00:35:42.610 --> 00:36:07.160
Benno Krojer: It's just image life is local. It's like an… it's like a sub-image. And what's the typical number of… 500, let's say, like, 570 sticks. You chomp up your image into just 500, grids. Yeah. And then, you have a token… You have a token by… Yeah. So it's very stupid, in a way, okay. And today, are those… are those continuous, token embeddings, or are they discrete?

218
00:36:07.160 --> 00:36:19.770
Benno Krojer: They are continuous, usually. I mean, there's experiments with discreteness, especially, like, on the image generation side, people have tried discrete, but it's continuous. So that's also the question, like, how can the LLM deal with these non-discrete, non-language-like tokens that are continuous?

219
00:36:21.520 --> 00:36:25.610
Benno Krojer: But the… so even though you're… You're… you have a bridge.

220
00:36:25.610 --> 00:36:49.930
Benno Krojer: like, the vision transformer is moving information, it's integrating information from all the context into each of those patches. Exactly, so they arrive already contextualized, like, in a way. Final representations of each of those patches, and passing them through some MLP that you're learning, and then… Yeah. And instead of the MLP, people have tried other things, but the… again, this sort of simple recipe is just MLP works fine.

221
00:36:49.980 --> 00:37:12.660
Benno Krojer: And you use the same MLP on all those tokens. Yeah, it's the same way, yeah, that's good to mention. Like, there's the same way, it's just applied in isolation one by one, it's not, like, some elaborate situation, yeah. Do you still have causal masking happening on the image tokens? People tried it, but at this point, the recipe's also so simple, it's just applied, they apply the same thing as with the LM, it's autoregressive. They're newer with Gemma. They take off. I think in Gemma.

222
00:37:12.660 --> 00:37:37.280
Benno Krojer: Yeah, it seems like that would make sense. But, I don't think it improves that. Yeah, recent models, like, it's like, if it doesn't change much, just keep the same code base you have for LLMs and run it on vision tokens. Yeah, so that means, like, yeah, your top left token, for example, can't attend anything in the image, but your bottom right can attend everything in the image.

223
00:37:37.280 --> 00:37:44.380
Benno Krojer: let it integrate. Right, so that context is already there, but in the LLM, you couldn't do the… yeah. Makes sense.

224
00:37:45.020 --> 00:37:48.420
Benno Krojer: Yeah, so that's the puzzling question, like, how does this work so well?

225
00:37:49.360 --> 00:38:01.639
Benno Krojer: So yeah, these are the tools that exist, like, if you just want to, like, intrinsically, like, learn this. Of course, you can do more elaborate things. You could try learning, like, for example, a probe. You don't have, like, a static embedding matrix, now you have your classes as a matrix.

226
00:38:01.640 --> 00:38:11.429
Benno Krojer: But here the issue is, like, A, it doesn't give you insights about the relation to the LLM embeddings intrinsically, it's an external probe, and you actually need to put in effort, whereas it's these other tools, they just come out of the box.

227
00:38:12.000 --> 00:38:31.360
Benno Krojer: So that's why we mostly considered these as our baselines, but we also tried some, like, there's other… of course, you can try SEEs, you can… we tried patch scopes a little bit, but we were mostly about… interested about the intrinsic, sort of, interpretability. And then, we basically argued that you can do better than this, than what, sort of, past work has done, or what we've seen so far, with latent lens, and the key insert is.

228
00:38:31.550 --> 00:38:39.579
Benno Krojer: Instead of, like, using a static embedding matrix, you want to retrieve nearest neighbors from LM contextual embeddings. So what we mean by contextual is

229
00:38:39.710 --> 00:38:46.609
Benno Krojer: just that intermediate representation of the LM, the embedding of dog would be different depending on if the context of

230
00:38:46.780 --> 00:38:50.320
Benno Krojer: here's my black dog, or, like, I'm eating a hot dog.

231
00:38:50.900 --> 00:38:57.660
Benno Krojer: And now you're not restricted anymore by, like, your static embedding matrix. Now these can be, in principle, infinite amount of contextual embeddings.

232
00:38:57.790 --> 00:39:10.620
Benno Krojer: So, is it language context or image context you're talking about? Language context. So, like, we kind of want to have the equivalent of these static embedding matrices, but just, like, not just the input level and the output level, but intermediate. Got it.

233
00:39:12.650 --> 00:39:27.020
Benno Krojer: Can you build a bank of these contextual things? Exactly, yeah. So, like, in practice, we can't do infinite, so we just think what is a relevant corpus for, like, let's say vision, where we want to know, and, like, visual concepts I mentioned. So we just take a…

234
00:39:27.100 --> 00:39:43.250
Benno Krojer: few million sentences that mention visual concepts, run them through our LLM, store the activations at different layers, not at all, because that's a bit expensive, like, every fourth layer, let's say. And then you have, like, sort of a dictionary at the end that's like dog, like, here's dog mentioned this context, and this context, and so on.

235
00:39:43.300 --> 00:40:00.879
Benno Krojer: So this… this corpus is, unimodal or multimodal? Unimodal. So, like, you just take your, sort of… Just text. Just text. Especially since the LLM is frozen, like, you can just take the same LLM, run your text through it, and it's kind of then the equivalent of your embedding matrix or unembedding matrix, but now just, like, embeddings in between.

236
00:40:00.980 --> 00:40:03.700
Benno Krojer: I don't think I understand the experiment.

237
00:40:03.890 --> 00:40:06.099
Benno Krojer: Why is there an image in the bottom?

238
00:40:06.310 --> 00:40:11.020
Benno Krojer: I guess, okay, we're not feeding the image at this point, we want to interpret it, so, like.

239
00:40:11.180 --> 00:40:22.979
Benno Krojer: the image is just there, like, then you would, as a second step, run your, like, then you would run your image through this LLM again, like, in a frozen setup, and then just look at the nearest neighbor. So I'll get to the setup a bit more in detail, but…

240
00:40:23.010 --> 00:40:36.410
Benno Krojer: Yeah, but that's sort of a separate first step, where you just store them as, like, a few gigabytes of, like, sort of… and then you can look up nearest neighbors for that vision, and then you run your vision

241
00:40:36.450 --> 00:40:38.559
Benno Krojer: Your image, through the model.

242
00:40:39.750 --> 00:40:55.950
Benno Krojer: Can you maybe comment about why you think this is useful for multi-modality? Like, I can imagine doing the same type of interpretations for language with language. Yeah, so we've… we've thought about, we've got a question from several, like, when I get to talk somewhere else,

243
00:40:56.710 --> 00:41:01.610
Benno Krojer: I think it's… it can be useful for language, but I would be 50-50 how well, because

244
00:41:02.150 --> 00:41:04.009
Benno Krojer: If you want to know what a…

245
00:41:04.270 --> 00:41:12.049
Benno Krojer: text token… what's the nearest neighbors for text token at some layer? By… trivially, the nearest neighbors will be that text token in that same context itself.

246
00:41:12.560 --> 00:41:28.809
Benno Krojer: So you'd have to be careful, like, like, maybe, like, a very similar sentence, if I want to know, like, what does my dog represent at Leia, something maybe would be a dog that's now dissimilar. Like, I don't know how much they would tell you, so you'd have to be more careful of, like, selecting your contextual embeddings to not be…

247
00:41:29.600 --> 00:41:31.300
Benno Krojer: Infinite?

248
00:41:32.220 --> 00:41:37.069
Benno Krojer: So… But you can say it's the same for the image, right?

249
00:41:38.980 --> 00:41:51.920
Benno Krojer: No, because for the image, you don't know what, like, there's no… there's no trivial solution, sort of, what your nearest neighbor would be. Like, it could represent… like, with visual tokens, people often, like, I came into this project with the assumption that they don't have interpretable nearest neighbors often.

250
00:41:52.190 --> 00:42:00.180
Benno Krojer: But you were wrong. But I was wrong, like, spoiler, yes. Because that's what the MLP learned, too.

251
00:42:00.550 --> 00:42:07.969
Benno Krojer: Yeah, to map into the contextual space instead of the input or output space, yeah. Which is maybe not that surprising, but…

252
00:42:08.090 --> 00:42:25.739
Benno Krojer: nobody has… or we… nobody assumed this was the… we'll get to the results, but yeah. But I do think it's useful for, like, I'll get to this for future work, but for any type, like, whether you have latent thinking, soft prompts, speech tokens, VLAs, there's often situations where your LLM is not processing just text, and you want to know what it represents. So for that, I think

253
00:42:25.910 --> 00:42:28.770
Benno Krojer: it's useful for text only, like, again, I…

254
00:42:30.450 --> 00:42:32.319
Benno Krojer: I think it would be really interesting to run this.

255
00:42:32.640 --> 00:42:36.540
Benno Krojer: Yeah, for multilinguality, people have talked about, like…

256
00:42:36.740 --> 00:42:38.439
Benno Krojer: So, I also thought of this.

257
00:42:38.830 --> 00:42:54.930
Benno Krojer: When I'm doing my load sheet lens timing. Nice. But you never tried it? I mean, that's, I think it shows you're a computer scientist, right? It's kind of annoying to implement. Yeah. I mean, with cloud, it was easier, but yeah.

258
00:42:55.460 --> 00:42:58.179
Benno Krojer: But it's actually… It's a great idea, yeah. Yeah, yeah.

259
00:42:59.120 --> 00:43:04.059
Benno Krojer: The Schmidhuva moment. We did it in 1919.

260
00:43:04.550 --> 00:43:06.860
Benno Krojer: No, no, no.

261
00:43:07.450 --> 00:43:09.609
Benno Krojer: No, I'm just joking.

262
00:43:09.790 --> 00:43:25.339
Benno Krojer: So we'll… we'll get to now, like, so the spoiler, basically, with later lens, you get really nice interpretations now. So here, if you have the northeastern building, and here you have these colors, now you get… your nearest neighbors for that vision token in the LLM processing would be white building with pillars.

263
00:43:25.620 --> 00:43:36.250
Benno Krojer: The second one, like, column similarity 0.47, with brownstone columns and so on. And we, like, I will get to more examples, but this is just to show what, what I mean here.

264
00:43:38.360 --> 00:43:47.190
Benno Krojer: What was the logitized interpretation of tactile? Okay, I can jump ahead to the examples. That's okay. So…

265
00:43:47.300 --> 00:44:01.070
Benno Krojer: Larger lens, for example, would, like, the input, our later lens gives you already nice stuff, like some, like, here we are gray tower with multiple clocks, and we are a clock tower. It's your nearest neighbors of this visual talk in the LM. Larger lens, of course, at the input random.

266
00:44:01.070 --> 00:44:09.479
Benno Krojer: Then towards the output, it gets a bit better, like, this is Gwen, so you get some Chinese tokens, you get clock, but with Logilens, it's always, like, you get subwords, you get commas.

267
00:44:09.490 --> 00:44:11.609
Benno Krojer: Here you get leaves, because there's a tree.

268
00:44:11.710 --> 00:44:17.179
Benno Krojer: And random stuff here, but with all things, like, again, visible behind green foliage, and so on.

269
00:44:17.520 --> 00:44:20.320
Benno Krojer: What's the corpus you used for your…

270
00:44:20.480 --> 00:44:39.079
Benno Krojer: I just took Visual Genome because they had, sort of, very concise sentences. Like, initially we tried conceptual captions, but the sentences were so long and random that sometimes we would get, I don't know, two random stuff. But this corpus had very, like, distinct, like, here's a tree next to a house, so, like, it was, yeah, very atomic, yeah.

271
00:44:39.580 --> 00:44:50.660
Benno Krojer: We tried PetScope, it was a bit panicky, like, we even reached out to Morgeva, like, about, like, it would only work on some layers, same experience.

272
00:44:50.830 --> 00:44:52.010
Benno Krojer: And…

273
00:44:52.810 --> 00:45:01.770
Benno Krojer: Yeah, this is also a bit more open-ended, like, it's again, like, not as directly intrinsic to the model, but you can… I think Peshkov would be nice to make work for vision tokens.

274
00:45:02.120 --> 00:45:03.140
Benno Krojer: Yep.

275
00:45:04.270 --> 00:45:07.989
Benno Krojer: That's who starts logging.

276
00:45:08.530 --> 00:45:10.249
Benno Krojer: Yeah, so that's, like…

277
00:45:10.370 --> 00:45:21.410
Benno Krojer: We tried, basically, like, we tried it for a week, we were like, okay, this is… this might need more time, so we didn't… we thought about including it as a baseline in the paper, but it felt like a proper project almost on its own to study.

278
00:45:22.630 --> 00:45:41.680
Benno Krojer: So the texts that come out of your method are the captions that match contextual embeddings retrieved from the corpus? Yeah, so these are among the 3 million, yeah. Yeah, so these are… and I would emphasize that it's not the caption, but, like, that word, so, like, that is specific… but in the context of that caption.

279
00:45:41.810 --> 00:45:44.229
Benno Krojer: Oh, oh, so you have a lot more…

280
00:45:44.590 --> 00:45:54.649
Benno Krojer: So it could have… it could have also matched with, but with would have not been the nearest neighbors, because, like, it's not, like, it's not as semantically… Oh, I see, so you have 3 million sentences, but you might have…

281
00:45:55.080 --> 00:46:09.840
Benno Krojer: Like, 50 million. Yeah, but we kept it a bit more, like we said, per kind of token, we only keep 20 contacts. We kept it a bit more efficient, and also with the sentences, you will get a lot of repeated or very similar ones that are sort of filtered out as duplicates.

282
00:46:09.910 --> 00:46:25.040
Benno Krojer: So in the end, I think we had even less, like, we had much less than 3 million, because I kept it at 20… I kept it at 20 per… like, for a dog token, just keep 20 contacts of dog. Random? I see. Random. But I'm currently, like, updating this corpus, because we want to expand this as a toolkit in general.

283
00:46:25.040 --> 00:46:31.339
Benno Krojer: So we thought, what are all concepts humans care about? We used WordNet to generate sentences for each WordNet concept, 5 for each one.

284
00:46:31.340 --> 00:46:37.699
Benno Krojer: And then it's more sort of a general, any concept you care about situation. Is this design similarity? Yeah, so this is…

285
00:46:37.700 --> 00:46:52.259
Benno Krojer: cosine 0.47, yeah. Everything is on… So it's… it's… this is a similar ballpark than what you would also find with language, because with… the other thing is with embedding lens and logic lens, or embedding lens at least, the cosine similarities are also much lower. They're usually 0.2,

286
00:46:52.470 --> 00:46:53.960
Benno Krojer: Yep.

287
00:46:54.470 --> 00:46:58.699
Benno Krojer: There's another question. Yeah, the examples you showed are, like,

288
00:46:59.340 --> 00:47:02.080
Benno Krojer: individual patches, if you look at the…

289
00:47:02.490 --> 00:47:08.069
Benno Krojer: This lens on the, like, end image token, does it tell you about the, like, whole image as a whole?

290
00:47:08.490 --> 00:47:24.439
Benno Krojer: We didn't see anything like this, so the… so… But there are interesting phenomena similar to this, so the L1, like, doesn't… it might not put together that this image is an image of, like, Northeastern… There's no… I mean, the northeastern will be mentioned, but all over the image.

291
00:47:24.530 --> 00:47:33.129
Benno Krojer: I can also show you the demo for those Eastern… That would be cool if you could show the demo. Oh, you have a demo? We have a demo, yeah. But I mean, like, since demo takes…

292
00:47:33.790 --> 00:47:36.269
Benno Krojer: Northeastern demo here, viewer.

293
00:47:36.580 --> 00:47:43.500
Benno Krojer: Wow, you're all prepared. So we can go here, so, like, the name… this would be…

294
00:47:44.200 --> 00:47:51.539
Benno Krojer: I don't know if it would say northeastern, maybe here? North? Northern? Northern. So it's like, it's on the text, it works really well, like, usually, like.

295
00:47:51.720 --> 00:47:56.310
Benno Krojer: If I go here, like, it would say University, because it's university.

296
00:47:56.510 --> 00:48:05.370
Benno Krojer: And larger lens would say, like, soft, late, like, random stuff, because we also had layer 8, so, like, if I go late layer, maybe Logic Lens would now work. It also has Eversity.

297
00:48:07.370 --> 00:48:17.939
Benno Krojer: On the logo, inscription, like, heological end is kind of worse. No, 1917, 19… it was his year there. Yeah. Wow. And blame, and blame…

298
00:48:18.180 --> 00:48:20.650
Benno Krojer: So… It's not working. What?

299
00:48:20.870 --> 00:48:33.689
Benno Krojer: What? Vegetation on the… Chris is like… but again, this is… it's all sort of… it's all sort of local, local information. Right. I'm wondering, like…

300
00:48:33.690 --> 00:48:43.389
Benno Krojer: how the LLM eventually will know that it's a unified image of, you know, the front of Northeastern University.

301
00:48:43.390 --> 00:48:52.909
Benno Krojer: When you have a sentence for… It just attends everything, right? So then… but then the LLM has to learn how to… like, the LLM is frozen.

302
00:48:53.350 --> 00:49:10.489
Benno Krojer: So I would be… I'm surprised that the attention… The LLM is going through some fine-tuning. But in my settings, it's actually frozen. Well, there are, so, like, these are frozen. I was curious about how… that was the question, like, how can a frozen LLM do this? But then there are ablations where, like, we unfreeze the LLM.

303
00:49:10.670 --> 00:49:26.229
Benno Krojer: Yeah. Okay. So the only thing being spent here is MLP, is that right? Yeah, the MLP. And even, even if you do a linear connector, so just, like, a linear connection, not just MLP, it still works, so, like, we have a linear connection ablation here.

304
00:49:26.410 --> 00:49:28.140
Benno Krojer: like, the VIT with Dino?

305
00:49:28.560 --> 00:49:30.699
Benno Krojer: Teachers, does it still, like, recognize?

306
00:49:30.880 --> 00:49:34.410
Benno Krojer: There's a great question, there's a slide on this.

307
00:49:34.960 --> 00:49:40.839
Benno Krojer: See? This is what a job talk looks like, you guys. So…

308
00:49:42.090 --> 00:49:59.510
Benno Krojer: Visual text is the most reliably interpretable. When you have… your tokens come from Flip, they're usually, like, just, like, at any layer, they're just the things that are written, the couch, tomato, I kept it very on purpose to show how hard this task is for the model. When it's Dyno, it's still interpretable, but it's very generic. So letter, describing messages, letter, screenshot, letter.

309
00:50:02.950 --> 00:50:14.259
Benno Krojer: Okay, so now I showed you, going back to where I was, I showed you, like, roughly how this works, but we want to study this in a controlled setup, not just take one or two off-the-shelf models, because they can often give, like, very different results.

310
00:50:14.410 --> 00:50:27.450
Benno Krojer: So we train, in total, 9 models, like, 3 by 3, we take different vision encoders, we take different LLMs, and we do actually find that this matters, like, the results are very different across different combinations. Like, can I just interrupt? Yeah.

311
00:50:27.580 --> 00:50:39.830
Benno Krojer: the reason why Dyno is interesting is because Dyno doesn't use text supervision. Exactly. I was gonna mention it. So, exactly, so Dyno is just trained with, sort of, like, self-supervised objectives.

312
00:50:39.830 --> 00:50:48.950
Benno Krojer: Which is why we wanted to include it and see, like, do we find less interpretability there? And we find… I'll spoil it again, we'll find the same level, but it is more generic often, and less sort of detailed.

313
00:50:49.060 --> 00:50:49.750
Benno Krojer: Yep.

314
00:50:50.430 --> 00:50:55.750
Benno Krojer: So yeah, we trained this control setup, and now we can actually look… quantify this. We look inside the LM now.

315
00:50:55.850 --> 00:51:08.750
Benno Krojer: On the x-axis, we have the layers as division token progresses through the LLM. On the y-axis, we have the percentage of interpretable tokens. Now, you might be wondering, how do we quantify this? In the past, people just used, like.

316
00:51:08.810 --> 00:51:21.799
Benno Krojer: class, like, maybe if it's the class of the MS Coco, or, like, whatever. Instead, we kind of work on an LLM judge that does what a human would do here intuitively as well. You look at the sentences, you look at the patch, and you just say, is it semantically related?

317
00:51:21.800 --> 00:51:29.410
Benno Krojer: And we have different categories for semantically related, directly, globally related. So if it encodes, for example, a global concept, it would say this is globally interpretable.

318
00:51:29.530 --> 00:51:30.749
Benno Krojer: And so on.

319
00:51:30.920 --> 00:51:34.389
Benno Krojer: And we make sure this, correlates with humans, like, quite well.

320
00:51:34.660 --> 00:51:36.820
Benno Krojer: In your job talk, people might say.

321
00:51:37.240 --> 00:51:41.920
Benno Krojer: That this is a strange way to implement the judge, given that you just convinced him.

322
00:51:43.010 --> 00:51:45.229
Benno Krojer: GPT is not good at visual rhythm.

323
00:51:45.700 --> 00:51:47.659
Benno Krojer: Right, but… We need to think about that.

324
00:51:48.580 --> 00:51:59.459
Benno Krojer: But I think this task is easier than… than the hard task that I'm talking about. Like, it's still… like, you just have to know that there's a building with pillars, for example, here to say that,

325
00:52:00.140 --> 00:52:01.130
Benno Krojer: Yeah, yeah.

326
00:52:02.230 --> 00:52:06.769
Benno Krojer: Is it something that talks about NLM, or is it not only?

327
00:52:07.540 --> 00:52:18.089
Benno Krojer: I was… I would say it's a bit more LLM-centric, but it… because we try different vision encoders and find different results, like, it also tells you something about the vision encoders.

328
00:52:18.230 --> 00:52:24.649
Benno Krojer: But for me, the big puzzle was really how can a frozen LLM, like, what is… what are the general things in the LLM that can make this possible?

329
00:52:24.760 --> 00:52:37.299
Benno Krojer: And especially coming in, like, thinking these are not interpretable, I was like, if they're not looking like language, like, how is the LLM doing it? Like, what processes are they… is it going through, yeah? Right, so I think, there was,

330
00:52:38.160 --> 00:52:41.220
Benno Krojer: Everything's, so pregnant together.

331
00:52:41.490 --> 00:52:45.030
Benno Krojer: But they did it the logic then.

332
00:52:45.190 --> 00:52:48.440
Benno Krojer: Yeah. We really have there.

333
00:52:48.560 --> 00:52:54.440
Benno Krojer: So, like, this linear layer reminds me of probes, right? Like, where we are trying to say that

334
00:52:56.270 --> 00:53:03.060
Benno Krojer: You mean the linear connector? Like, MLP? What it reminds me is that, oh, maybe the visual encoder already has

335
00:53:03.510 --> 00:53:20.980
Benno Krojer: such rich language information. Yep. There you need just an email layer to connect it to another device. Exactly, yeah, so that's the… the cool thing, that these representations are maybe implicitly already very aligned, just need to be… I think that's where you're sort of seeing this…

336
00:53:21.060 --> 00:53:30.009
Benno Krojer: Yeah, because the LM still has to… the LLM still has to do a lot of work, I think, to make sense of this at the end of the day.

337
00:53:30.250 --> 00:53:31.050
Benno Krojer: Yup.

338
00:53:33.050 --> 00:53:34.190
Benno Krojer: But yeah, it's…

339
00:53:34.300 --> 00:53:39.409
Benno Krojer: yeah, feel free to use it, whether you're a vision person or LLM person, it hopefully gives you insights in both ways.

340
00:53:39.820 --> 00:53:59.610
Benno Krojer: So yeah, now the findings, so embedding lens, just using the embedding matrix, already shows medium to high interpretability, even at the input, so that's kind of the first thing when the project started, we tried embedding lens, and we were surprised that for some models, this is the Olmo variance, so, like, when you use Olmo as the LLM backbone, the interpretability is, like, half of the tokens are now interpretable.

341
00:53:59.980 --> 00:54:08.219
Benno Krojer: How do you measure the wax? I suggested LLM judge that I talked about. So, like, it's a bit expensive, so we don't run it on, like, millions of examples, just, I think…

342
00:54:08.390 --> 00:54:12.610
Benno Krojer: Per model, like, a few hundred examples, but since we have 9 models and stuff, like, yeah.

343
00:54:13.200 --> 00:54:15.900
Benno Krojer: The bars around it, that's the confidence interval.

344
00:54:16.080 --> 00:54:32.069
Benno Krojer: No, this is sort of the… like, we have 9 different models, and I didn't want to show, like, 9 different models, just the lowest model, we can see, like, it's what people often expect, that it's not that interpretive at the input, but with some other models, these are the Olmo variants, where the LMBAC one was ALMO, like, it's 50%. Yeah.

345
00:54:32.990 --> 00:54:47.749
Benno Krojer: Then we apply logic lens. Again, we see the same trend, like, some models are super low, even at the late layers. Like, I think this is also, in general, a situation with logic lens, it doesn't work for all models as well, for some reason. And for other models, as we… as the literature has already established, it goes up.

346
00:54:47.750 --> 00:54:53.350
Benno Krojer: To even, like, 80%, so that's kind of nice. But it's, again, only late layers, only on some models, so also not great.

347
00:54:53.350 --> 00:55:02.129
Benno Krojer: And now comes the cool finding with the right lens, so latent lens, like, visual tokens are very interpretable across all layers and all models, so even your lowest model is still at 60%,

348
00:55:02.130 --> 00:55:16.550
Benno Krojer: your highest is around 80. Does latent lens use a different dictionary for every layer? Great question. This brings us to the next slide, in a way. I didn't explicit… I explicitly didn't mention the detail of, like, which layer you match with.

349
00:55:16.670 --> 00:55:24.010
Benno Krojer: And also here, again, we tried PatchCore, but didn't quite work, as we discussed. I'll come to this,

350
00:55:24.190 --> 00:55:26.679
Benno Krojer: Maybe I'll jump to this first note to his question.

351
00:55:26.990 --> 00:55:31.469
Benno Krojer: This is one last puzzle, actually. How can latent lens outperform embedding lens even at the input layer?

352
00:55:31.760 --> 00:55:46.280
Benno Krojer: Because, as you can see, a big difference, and at the input, there's no contextual embeddings at layer 0. There's… like, they don't exist. So maybe you're already noticing where I'm getting with this, or, like, as with David's question, like, do you have an… is this a puzzle for you? Or, like, I'm… yeah.

353
00:55:46.930 --> 00:55:50.420
Benno Krojer: Or do you see the solution that… I'm getting it.

354
00:55:51.110 --> 00:55:56.750
Benno Krojer: Just gonna give a second, if people have… wanna… Player 0, right? It was literally after 10 minutes.

355
00:55:57.240 --> 00:56:10.859
Benno Krojer: So this is when the vision token just arrives at the LLM, and it hasn't even been processed at all. And then we apply either embedding lens, or let's say… the embedding lens would make the most sense at that point, right? But it doesn't work as well. And then we also apply latent lens, and

356
00:56:11.530 --> 00:56:16.159
Benno Krojer: I would guess that you're matching against, maybe the union of the layers or something?

357
00:56:16.310 --> 00:56:22.940
Benno Krojer: Almost? Not even the… at the union, yeah, sorry, at the union, yeah, yeah. Because it also doesn't matter, right? Because you're…

358
00:56:23.030 --> 00:56:39.439
Benno Krojer: I mean, you do a retriever over your… Exactly, so we just… our pool of contextual banks comes from, like, can be from layer 8, layer 16, you can get the nearest neighbor from anywhere. So you're… you're actually… you can tell something about what the MLP is doing, right? By looking at that. So you're saying that the MLP

359
00:56:39.650 --> 00:57:03.220
Benno Krojer: It learns to map to late layer, later layers. So this is doing some heavy lifting over there. It says I have contextual concepts, I'm just going to insert them at layer 0, because that's what I'm attached. Yeah, but then you pass the info… then the residual stream just passes that on without much change. We'll show that here, but there's also a question about the…

360
00:57:03.860 --> 00:57:07.520
Benno Krojer: Because you're doing a variable on the language.

361
00:57:07.670 --> 00:57:25.959
Benno Krojer: Yeah. But it's good. We'll figure it out anyway. It's only… sorry, like I'm saying, again, I'm eating hot dogs. You map this sentence to the…

362
00:57:25.980 --> 00:57:33.719
Benno Krojer: Late end of the last token. It could be, like, could be a middle talk, could be hot. Hot, if you want to match with hot and hot dog, like, yeah. I mean…

363
00:57:34.580 --> 00:57:45.829
Benno Krojer: You are mapping your image latent, and you are extracting from, let's say, your corpus of documents. And each of the documents in your corpus, they map to a single latent.

364
00:57:46.400 --> 00:58:06.190
Benno Krojer: How you're deciding that? I'm assuming… I just assume that is the last token. It's one of the latents of the last tokens of that. But we're storing all of them, so, like, for, like, you pass, you send it through the LM, you store for a subset of your layers and all the token positions, like your… Yeah, yeah. So it could be the… it could be the first, first one. And multiple layers for each token.

365
00:58:06.190 --> 00:58:14.909
Benno Krojer: So, like, when we go back here, like, for example, this nearest neighbors came from layer 16, and it's pillars, but it could have been, like.

366
00:58:14.950 --> 00:58:21.300
Benno Krojer: This is not the end of the sentence, right? There's more coming here, like, it's just some precision. Okay, okay, okay. No, it's embedding.

367
00:58:21.470 --> 00:58:22.640
Benno Krojer: at Pillars.

368
00:58:22.750 --> 00:58:38.479
Benno Krojer: fillers does not have any information about the white filler width, right? Or layer 0, yeah, at layer zero, no. We're not looking at embedding records, we're looking at whatever comes out of

369
00:58:38.650 --> 00:58:42.159
Benno Krojer: putting the vision transformer embedding through the MLP network.

370
00:58:42.550 --> 00:58:42.900
Benno Krojer: That's

371
00:58:43.260 --> 00:58:55.739
Benno Krojer: Wait, wait, that's not as well. No, no, no, no, that is not my question, actually, because I was… because, the mapping, the text mapping, it has to come from the embedding itself, and the embedding only has information about the token.

372
00:58:55.740 --> 00:59:20.010
Benno Krojer: That's only true for layer 0 activation. But for every token in their corpus, they're taking a layer 0 activation, they're storing layer 4 activation, layer… So for layer 0, we don't store anything, we just have embedding lens, but yeah, like, we have layer 1, and to tell you exactly, like, we take layer 1, 2, 4, 8, 16, 24, 31, 32, so that we have a… So you search over…

373
00:59:20.010 --> 00:59:24.749
Benno Krojer: all of that space. Yeah, like, we have a union over that space, yeah.

374
00:59:25.480 --> 00:59:31.660
Benno Krojer: And then we just take the highest top 5 nearest neighbors, yeah. What's the interoperability score?

375
00:59:32.180 --> 00:59:44.070
Benno Krojer: the interpretability here? Oh, it's the LM judge, it looks at, okay, in detail, it looks at the top 5 nearest neighbors that our thing retrieves, and just says, like, is one of them, like, semantically related to your…

376
00:59:44.210 --> 00:59:57.829
Benno Krojer: To your image. Like, to the patch of the image. It sees both the image and the little patch, and then it can either say it's locally related, it's globally related, or it's abstractly related. Do you look at the residual stream always, or… Yeah.

377
00:59:58.640 --> 01:00:05.770
Benno Krojer: I'm sure you… I'm sure you can find maybe interesting, as a side project, interesting things somewhere else.

378
01:00:06.360 --> 01:00:07.590
Benno Krojer: Mmm, no.

379
01:00:07.840 --> 01:00:15.089
Benno Krojer: But it's just precise, so… Yeah, but to, you know, if you do logic lens, you do layer norm and,

380
01:00:15.930 --> 01:00:19.160
Benno Krojer: And then apply the… post-exempt, or…

381
01:00:19.580 --> 01:00:35.750
Benno Krojer: And it makes a big difference, it matters. Okay, I mean, for us, it started working, so we didn't honestly think about this too much, but… Embedding lens is super weird, because you don't know what you are supposed to do, right? And so, when you build this method, then…

382
01:00:36.230 --> 01:00:40.409
Benno Krojer: I'm curious to hear more about this after me.

383
01:00:40.690 --> 01:00:43.809
Benno Krojer: No, no, I was, I was very surprised.

384
01:00:44.000 --> 01:00:49.949
Benno Krojer: did not implement it like that, you wouldn't find this out. It's really… So, actually, I would mean the results of the…

385
01:00:49.950 --> 01:01:10.319
Benno Krojer: Mixing up the layers? Yeah, this was a bug that I had. What do you mean? I had a bug where, like, I was by accident… I was by accident, like, comparing… like, initially I thought I would just compare it with the same layer. Yeah, yeah, yeah, that's the way to do it. And then I had a… Yeah, I thought I had implemented, like, same layer, but I was comparing to the always layer 8 by accident or something.

386
01:01:10.320 --> 01:01:15.709
Benno Krojer: And then I was like, wait, am I getting… why am I getting this high interpretability even at layer 0 with layer 8?

387
01:01:15.710 --> 01:01:20.209
Benno Krojer: Cloud codes too well. You would never have discovered it.

388
01:01:20.240 --> 01:01:24.439
Benno Krojer: It happened, it happened, I think it even happened with Cloud Code, it happened with Cloud Code, I think.

389
01:01:24.580 --> 01:01:41.479
Benno Krojer: And then someone was like, wait, Claude, why do we have this here? Like, and I was like, oh, no. Do you have an analysis? So, let's get back to the… So, basically… Oh, sorry, okay.

390
01:01:44.120 --> 01:01:44.930
Benno Krojer: Yep.

391
01:01:45.570 --> 01:01:57.419
Benno Krojer: So, for deleted lanes, it makes sense because you have some context there. Yeah. And I'm imagining even if some of the early tokens get picked up, you give the entire context to the LLMS token.

392
01:01:59.210 --> 01:02:09.759
Benno Krojer: I have to be honest here, like, we tried making the LLM judge work with the whole context, but it was either over or under-interpreting often, like, it would either be too strict or not strict enough.

393
01:02:09.870 --> 01:02:12.570
Benno Krojer: So we actually only gave it, sort of, the full word.

394
01:02:12.690 --> 01:02:14.629
Benno Krojer: And not the sentence.

395
01:02:14.780 --> 01:02:22.050
Benno Krojer: So that means, like, I'm basically, like, it's possible that the interpretability is even a bit higher, because sometimes it could be a word that's not interpretable, but the context is.

396
01:02:22.170 --> 01:02:33.130
Benno Krojer: But also, the LLM jot just was a bit noisy, so… and this way, it's also at least a fair comparison, because then with Logit lens and embedding lens, it's also, like, word-based.

397
01:02:33.940 --> 01:02:36.039
Benno Krojer: What if you're giving only the word there?

398
01:02:36.190 --> 01:02:37.010
Benno Krojer: Thanks for that.

399
01:02:37.400 --> 01:02:47.900
Benno Krojer: But, but we sort of give ourselves a little advantage by merging the word with nearby until we hit whitespace, so that… because our method, the nice thing, is you do get the context sort of for free.

400
01:02:48.020 --> 01:02:56.070
Benno Krojer: So even in this case, I think it might have been pillar, and then the S, or like, you know, like, COLL, and then ohms, and we merged it and give the LM judge the full…

401
01:02:56.750 --> 01:02:57.649
Benno Krojer: Yeah. Yeah.

402
01:02:58.000 --> 01:03:01.749
Benno Krojer: But we tried full sentence LLM judge, and it was a bit finicky.

403
01:03:02.470 --> 01:03:07.850
Benno Krojer: And then we, postdoc, argued that it would also be a nice apples-to-apples comparison.

404
01:03:10.240 --> 01:03:30.159
Benno Krojer: Yeah, so going to this, I'm… okay, I'm gonna… for the sake of time, since that's… which I'm happy about so many questions, I'm gonna skip over some things. We… we showed the inserts hold also on a controlled setup, so, like, when the 2VL, where the… for example, the LM would also be fine-tuned on a lot of things, and not just frozen, you see a similar trend, that it just works.

405
01:03:30.210 --> 01:03:38.209
Benno Krojer: We also, like, did lots of ablations. I think the two most interesting ones to mention are when you update the LLM during training, it's even more interpretable by, like, a little bit.

406
01:03:38.450 --> 01:03:49.080
Benno Krojer: And if you replaced MLP with a linear mapping, so that gets back to how… maybe how similar these representations are, you get the same interpretability, roughly. So it's really easy to map between these frozen models.

407
01:03:49.240 --> 01:03:52.130
Benno Krojer: Back to this puzzle that we now touched on.

408
01:03:52.140 --> 01:04:04.009
Benno Krojer: So the explanation for why, even at layer 0 this works is, like, if you have a layer of the visual tokens in DLM, let's say layer 0, and you want to know where are most of your nearest neighbors coming from, like, like.

409
01:04:04.010 --> 01:04:16.870
Benno Krojer: you can basically see that most of them for this model come from Layer 8, actually. So it's already most aligned to Layer 8 when it comes out of the vision encoder, and the MLP just learns to map to that space, and then it just maybe gets passed along with the… from the residual stream, and then…

410
01:04:17.030 --> 01:04:28.890
Benno Krojer: at layer 8, then you see this more intuitive diagonal pattern where, like, you have the corresponding layer situation. It's a beautiful figure. Like, how do you put, like, how do you avoid being, like, messed with?

411
01:04:29.870 --> 01:04:36.789
Benno Krojer: Because, like, this vector is still processed by 0, 1, 2, but… It's a residual stream.

412
01:04:36.860 --> 01:04:52.560
Benno Krojer: Yeah, sure. So it has to figure out, like… How do you… How to avoid… Ask visual… Oh, you mean how… there's not, sort of, like, the attention doesn't mess up, the attention module doesn't mess up your reputation?

413
01:04:52.960 --> 01:04:59.160
Benno Krojer: How does the LLM know not to touch it until… Yeah, I also don't know, to be honest. It's a high-dimensional space.

414
01:04:59.290 --> 01:05:03.279
Benno Krojer: Yeah. Most things are orthogonal. A lot of large numbers.

415
01:05:03.530 --> 01:05:11.560
Benno Krojer: So this is also the part where I'm curious to hear from you guys.

416
01:05:12.350 --> 01:05:20.779
Benno Krojer: There is no… So, like, we have… I don't know.

417
01:05:20.990 --> 01:05:22.420
Benno Krojer: What? Make a shared message.

418
01:05:22.700 --> 01:05:34.709
Benno Krojer: I mean, it obviously works. It works, and we have, we have also, we have… so these are… I have some backup slides also on this. You know, you've got this adversarial thing.

419
01:05:34.750 --> 01:05:48.009
Benno Krojer: Well, sure, it'll match up some plans. So, like, it's interesting, for example, here… I'm gonna walk it off. We measure, like, for the same token, like, the similarity, codes and similarity across layers. So, like, the value here would mean if I compare the token at layer…

420
01:05:48.010 --> 01:05:54.899
Benno Krojer: 13 to its own… the same token at layer 0, what's the causal similarity? For language, it goes down quickly, so tokens change a lot.

421
01:05:54.990 --> 01:06:04.009
Benno Krojer: For the Vision tokens, they sometimes don't change at all. For the Cyclib and Dino variants, or they go down only at the end. For the OMO variants, they go down a bit more.

422
01:06:04.240 --> 01:06:04.910
Benno Krojer: Yep.

423
01:06:04.980 --> 01:06:28.100
Benno Krojer: They're very high sometimes. They're very high. They're crazy high. Sorry? Sorry, what did you say? That's right, that's how you do it, because it's Lehrenor. You're Mr. Lehrenor. So, sometimes you will have layer norms that are, like… I mean, it's not the highest here, actually, but if you go here, you have layer norms that are, like.

424
01:06:28.100 --> 01:06:28.610
Benno Krojer: Oops.

425
01:06:29.330 --> 01:06:40.100
Benno Krojer: So the vision tokens layer norm is sometimes, like, for the, sort of, on the text side, you have layer norms, like, like, not layer norms, L2 norms, like, maybe, like, 10 or, like, 20.

426
01:06:40.100 --> 01:06:44.369
Benno Krojer: For division tokens, it can go up to 10,000. Like, your highest one would be 10,000,

427
01:06:44.370 --> 01:07:06.249
Benno Krojer: your 1% would stop here. I see. But that's how it's blasting through. It's how it's blasting through layer 8. Have you considered just fixing the… whatever the linear layer gives, and feeding that to… Just freezing it, and not… like, just skipping… basically skipping 8 layers, and then… Just not doing anything with the visual.

428
01:07:06.440 --> 01:07:17.440
Benno Krojer: So, yeah, yeah, so for the image, rather than letting the model compute on it, just literally freezing it to the output of the MLP.

429
01:07:17.600 --> 01:07:32.549
Benno Krojer: Right. For all those token positions, for the visual vision. What do you mean by free… like… like, you take… for one image patch, you have the output of the MLP, right? Yeah. And then what you're doing is you're letting the LLM compute on it. Yeah.

430
01:07:32.550 --> 01:07:41.020
Benno Krojer: instead of letting the LM compute on it, just, like, copy and paste that everywhere at every layer. Right. Do people do that?

431
01:07:41.280 --> 01:08:01.010
Benno Krojer: Because it never has to do a prediction task on this token. And also, like, why would the LLM… Why would the LLM know how the hell to process that thing, or something? It looks like the LLP is bypassing that. The LLM still processes it, it writes stuff, but it is so neat to read it.

432
01:08:01.080 --> 01:08:14.870
Benno Krojer: So then this should work, this should work as well. How strict are we on time, by the way? Okay, so we are quick. We're gonna kick you out. I just want to make sure to be somewhat slower.

433
01:08:15.420 --> 01:08:33.320
Benno Krojer: Great results. Yeah. So I'm actually, like, this is also something where I can probably learn from you guys. I've only been working on interpretability for, like, a year now. I think some here I've worked for 5 or 6 years, so you have probably good intuitions on… well, more than 5 or 6 years. You have probably better intuitions than me about some of this… this residual stream, stuff that's going on here.

434
01:08:33.450 --> 01:08:40.059
Benno Krojer: Is the LM is based on the…

435
01:08:40.229 --> 01:08:48.580
Benno Krojer: So, the prediction task is not doing anything for the prediction of the short process, right? Right. There's no… there's no loss, sort of, on them at all.

436
01:08:50.080 --> 01:08:52.610
Benno Krojer: Don't understand why you're talking about.

437
01:08:52.800 --> 01:08:55.669
Benno Krojer: Yeah, why'd be so nice to these neural networks? Just bless them.

438
01:08:56.010 --> 01:09:10.010
Benno Krojer: Yeah, and to be fair, like, I mean, we are scientifically interested in, like, the frozen setup. If you do fine-tune the LLM, there will be… they will change a bit more. Like, we have the same plots also for… for the Gwen27, and there's a similar trend, but it's a bit less, so, like, instead of

439
01:09:10.029 --> 01:09:16.799
Benno Krojer: being like this, it would maybe be, like, more similar to language. They do change a bit more. Yeah, so…

440
01:09:16.800 --> 01:09:33.349
Benno Krojer: So this whole skipping or, like, freezing thing would be nice scientifically, but I think in practice, people fine-tune, and they want things to change a little throughout the layers. I'm just curious about if there's any performance dip. So isn't… isn't the blue straight line meaning that it doesn't change from layer 0? Yeah. What's blue versus pink here, I'm sorry?

441
01:09:33.350 --> 01:09:42.809
Benno Krojer: Loose visual tokens, red is text tokens, and how much the sort of same token similarity, codes and similarity? Oh, I see, how much they change. So, and text changes a lot.

442
01:09:43.210 --> 01:09:58.950
Benno Krojer: So, for Quen 27 billion, it means that it already outputs whatever it needs in there, 32, right? Yeah, there's no change to the vision tokens at all, basically. I think it's really surprising. Yeah, but you can achieve that by having a very high norm.

443
01:09:59.100 --> 01:10:06.770
Benno Krojer: Right? Because all your, all your, residuals that get added from your MLP and your attention will go through layer norm.

444
01:10:06.850 --> 01:10:26.720
Benno Krojer: So all the deltas are all bounded to this small norm. So if you go in with something with norm a million, that it already knows what it needs to have at layer 32, you know, it's supposed to be, like, a heavy computation, but… But I think the assumption here is just… it just needs the value of the visual token.

445
01:10:26.900 --> 01:10:39.530
Benno Krojer: There's no, like, more information other than this sketch contains a red something. Red. Red, red, red, red, red. Like, that's all it needs. It doesn't, like, because the language model background background.

446
01:10:39.710 --> 01:10:43.400
Benno Krojer: was not fine-tuned. There's no compu-a- like.

447
01:10:43.680 --> 01:11:02.499
Benno Krojer: Why would we assume that there is computation happening? It seems like in some cases… I think, if it's just red, then even just an embedding that contains red would change, right? So why wouldn't a vision… wait, if you all… all you need is just, like, the value red that later layers will fit?

448
01:11:03.140 --> 01:11:22.990
Benno Krojer: But it's still, it's… I mean, it's still a hard task, because this is a 1D sequence, so knowing, like, that red is on, like, sort of this object right now, and, like, also the spatial, like, the sort of the positional embedding reasoning, I think, can be… it's interesting that it works so well, that the model knows it's the red ball here, not the red cube, or, like… Yeah, a lot of stuff has been…

449
01:11:22.990 --> 01:11:34.150
Benno Krojer: The vision encoder already processes it, so it's like starting from the 8th layer in the LLM, right? In a way, yeah, yeah, but it's cool that they're so implicitly aligned that it, yeah, that it works.

450
01:11:34.150 --> 01:11:43.660
Benno Krojer: It's already processed. It doesn't have to have all the questions. No, but it is weird, right? I mean, so… Like, so when you… when Andy asks, you know, why, why don't…

451
01:11:43.740 --> 01:11:48.349
Benno Krojer: we… we try to be a little bit more gentle with the LLM. Like, it's… it's funny.

452
01:11:48.490 --> 01:11:54.779
Benno Krojer: It's funny that it doesn't mess up the LLM attention and things like this, having these high norm things sitting around.

453
01:11:54.980 --> 01:11:57.340
Benno Krojer: Right? It's really not prepared to…

454
01:11:57.550 --> 01:12:05.159
Benno Krojer: have something that's way out of domain. It's just funny that it works. So this is what people call this universal computation engine. You just throw anything at it, and

455
01:12:05.400 --> 01:12:06.800
Benno Krojer: Yeah, sweet, strange.

456
01:12:07.740 --> 01:12:18.739
Benno Krojer: Did you try this on, like, a, like, time series? Like, like a video, like, frames of a video? Like, can it understand, like, if I… if I have, like, a sequence of images of, like, my dog running, like…

457
01:12:18.740 --> 01:12:28.960
Benno Krojer: So I haven't tried it, but it's one of my follow-ups that I would want to study, maybe, like, multi, like, actions across video frames, how are they encode it, do they change? Are they encoding? Are they… Causal knowledge, too, from, like…

458
01:12:29.390 --> 01:12:30.240
Benno Krojer: Yep.

459
01:12:30.530 --> 01:12:48.229
Benno Krojer: Oh, cool. So, like, I mean, this is a future question I'm curious about, like, right now, we find mostly nouns and adjectives as nearest neighbors. If you have videos, would we find more verbs and action phrases? Or would they encode actions also… would they encode actions also more as, like, a sequence of nouns and adjectives, or are actions implicitly encoded in languages… verbs?

460
01:12:48.350 --> 01:12:51.040
Benno Krojer: Just one, one idea, yeah.

461
01:12:51.400 --> 01:12:52.640
Benno Krojer: Do you… so you…

462
01:12:52.980 --> 01:13:07.899
Benno Krojer: In the experience where you fine-tune the LLM, do you only fine-tune the projector and the LLM, or do you also fine-tune the vision encoder? I think in that experience, only the LLM and projector, but you could also… you can do the vision encoder as well. It feels like…

463
01:13:09.790 --> 01:13:13.780
Benno Krojer: doing this kind of, like, grafting. Like, you're almost, like, splitting up the…

464
01:13:14.260 --> 01:13:33.319
Benno Krojer: what the representation has to do. Like, the vision encoder just has to write the information, and… Has to be available for, yeah. Yeah, and the LLM is just reading it, which is why it doesn't change throughout. And, yeah, it feels like there could be, like, a similar kind of analysis, but, like, in the vision encoders. Right, that they know, like.

465
01:13:33.420 --> 01:13:39.030
Benno Krojer: They can get even better at, like, presenting the information in a language-friendly way. Yeah, yeah.

466
01:13:39.290 --> 01:13:43.949
Benno Krojer: Is this result with, the LLM fine-tune, or is it just the MLP connection?

467
01:13:44.090 --> 01:13:56.650
Benno Krojer: This is just MLP connector. So when you, like, we looked at the same plot, I don't have it here on… for the off-the-shelf GUEN2 VL, there you see this trend, but much less pronounced. So, like, then at layer 0, it would connect to layer 4,

468
01:13:57.050 --> 01:14:08.720
Benno Krojer: But then at layer 1 onward, it would be very diagonal. So, like, it only… like, only at layer zero, you have to map to a bit of more contextual space, and then it's, again… But it would be the same thing, but starting… basically, only happening here, at the first one or two layers.

469
01:14:08.970 --> 01:14:16.080
Benno Krojer: image, and just as we are then describing.

470
01:14:16.950 --> 01:14:35.299
Benno Krojer: I think you can do that, because you can have a certain amount of visual tokens. Like, these architecture, kind of, like, as soon as you have an email, you know, it's like… But you can, you can do, like… I mean, that's the idea of, like, when you do patch scopes with a vision token, that's what you do try. You do in context learning, and you patch a single token in.

471
01:14:35.300 --> 01:14:39.450
Benno Krojer: And see if it works. It worked for us, but more on late layers.

472
01:14:40.140 --> 01:14:47.739
Benno Krojer: So you could, you could do it also, like, play with individual tokens, like, maybe take 4 tokens, because most concepts are more than one token, like, you know.

473
01:14:48.430 --> 01:14:58.870
Benno Krojer: So yeah, we already talked about the examples, so I can skip those maybe for time, but just the, sort of, on a qualitative level, now we get nice phrases, some more frequent interpretability that actually describe the thing usually quite well.

474
01:14:59.260 --> 01:15:17.030
Benno Krojer: And as I mentioned, visual text is sort of the most, sort of, reliably interpretable. There's intriguing quirks, also, of LogicLens, like, that it actually does next token prediction, versus, like, latent lens kind of gives you what is actually in the image, or, like, in this token. LogicLens will also give you, like, plausible next token prediction, like, couch potato.

475
01:15:17.030 --> 01:15:22.039
Benno Krojer: And it will even do next token production on the… the actual one, like Tomato here.

476
01:15:22.130 --> 01:15:35.530
Benno Krojer: Where you can argue that either it was already in the vision encoder and coded that… because this is, like, auto-aggressive, so it can't attend tomato, since the tokens only can attend here. So either it knew from the vision encoder this, or it saw it maybe up here in the Google Maps thing.

477
01:15:35.580 --> 01:15:44.669
Benno Krojer: Interesting. I don't know if that's a feature or bug of LogiteLens. I feel like it's sometimes more of a bug, but I don't know. I'm happy that Latent Lens gives us, like, directly couch here.

478
01:15:45.510 --> 01:16:03.960
Benno Krojer: We didn't study downstream effects too much, but this is something people often ask about. One thing we looked at, but didn't find good results, basically, was there's a lot of non-interpoled tokens, or not a lot, but, like, 20-30%. They usually have lower codes and similarity, they're in non-salient regions, so it's a bit like, maybe, could remind you of registered tokens.

479
01:16:04.050 --> 01:16:14.039
Benno Krojer: So we were wondering, are they useful downstream, or can they be thrown away? And we got very strange mixed results. Depending on the model, they would either be more useful than the interpretable ones, or much less useful.

480
01:16:14.310 --> 01:16:20.740
Benno Krojer: So… and we were… what we were doing, we were replacing them with the average image value across our dataset.

481
01:16:20.740 --> 01:16:35.609
Benno Krojer: And you didn't try to triangulate with registered tokens, right? So you don't know if the ones that you're throwing away are registered tokens or not? So I'm not an expert on them, but, like, my def… my intuition was that… or, like, what I understood from the paper is that they find them by just saying the ones with the high norm.

482
01:16:35.610 --> 01:16:47.790
Benno Krojer: So we checked that, and we didn't find a big correlation with whether these are the high norm or not high norm tokens. Or that was, like, one of the other collaborators where I told him, like, hey, can you check if these are registered tokens? And he came and said, like, yeah, with the high norm, I didn't find anything.

483
01:16:47.800 --> 01:16:52.149
Benno Krojer: Which is… because we thought maybe it's just the register tokens, yeah.

484
01:16:52.170 --> 01:17:09.320
Benno Krojer: So this is open still a questionnaire. I played with this kind of VLMs some time ago with saliency, and actually, these were coming up to, like, this kind of high-norm tokens, and in our case, we tried to just mask them, and the task was going to zero, so they were kind of like, you know…

485
01:17:09.320 --> 01:17:12.580
Benno Krojer: Playing this kind of very important role in the model.

486
01:17:12.610 --> 01:17:21.979
Benno Krojer: I wonder if it's the same thing. Yeah, they might contain things like abstract information. Yeah, yeah. So either… either they maintain global information, or since they're also the same across images.

487
01:17:22.170 --> 01:17:29.400
Benno Krojer: maybe it's more of a task, like, task vector that's like, now you're a captioning model. I don't know, that's just hypotheses, yeah.

488
01:17:31.250 --> 01:17:32.639
Benno Krojer: So yeah, I'm…

489
01:17:32.780 --> 01:17:47.120
Benno Krojer: the… coming back to the research question, kind of, the answers now to are these… does the LLM convert visual tokens into words? I would say, kind of, yes, the majority of visual tokens are close to meaningful words, and we had, kind of, we had to develop a new interp method to answer this. We… the existing ones didn't quite work for us.

490
01:17:47.120 --> 01:17:56.960
Benno Krojer: The takeaways are that, yes, this was surprising also for us, that we actually didn't, like, didn't think even this was, like, we… it contradicted our assumptions and intuitions and the literature, in a way.

491
01:17:57.060 --> 01:18:04.010
Benno Krojer: And I would recommend people at least try latent lens next to your usual logic lens if you're curious, like… of course, for vision tokens.

492
01:18:04.520 --> 01:18:12.429
Benno Krojer: Yes. Yeah, I'm, I'm, like, I'm releasing this as a library right now, there's, like, where it's very easy to, like, pip install latent lens.

493
01:18:12.660 --> 01:18:19.050
Benno Krojer: Well, we don't need that. After all, it's already there. How many activations do you need to search program?

494
01:18:21.020 --> 01:18:27.440
Benno Krojer: We get good results. We didn't do a great ablationist, but we did one ablation that was kind of interesting, and which was, like.

495
01:18:27.580 --> 01:18:29.120
Benno Krojer: It's a bit weird.

496
01:18:29.240 --> 01:18:32.470
Benno Krojer: This is some fish situation, I don't know.

497
01:18:32.720 --> 01:18:40.140
Benno Krojer: Basically, what I imagine latent lens v2 to look like is, like, you don't want to have a massive corpus, you want to dynamically generate sentences that maximize the cosine similarity.

498
01:18:40.480 --> 01:18:53.259
Benno Krojer: So we did a small experiment on a couple images, where we have a sort of evolutionary search, where the LM generates… an LM generates variants of the initial sentence. This was what our fixed corpus gave us, men wearing white stripes, which is already a good interpretation.

499
01:18:53.260 --> 01:19:01.260
Benno Krojer: And then, if you ask an LM to modify individual words, and then we take the top whatever, like evolutionary search, we get colorful fish with white stripes.

500
01:19:01.260 --> 01:19:02.809
Benno Krojer: Without seeing the images.

501
01:19:02.810 --> 01:19:12.289
Benno Krojer: No, without seeing… I just asked the LLM, like, hey, like, remove… like, add, remove, or replace tokens, like, 10 at a time, and then we have 6 rounds of it.

502
01:19:13.610 --> 01:19:26.330
Benno Krojer: maybe some sort of gradient search, or it's just, like, kind of, the codes and similarities, you're sort of what you want to… you throw away the other ones and keep the top three codes and similarity… Yeah, but, like, like, to… to…

503
01:19:26.820 --> 01:19:34.670
Benno Krojer: inform which changes you make to the sentence. It was just random. Just random sentence. Because it was just a one-day experiment that I quickly ran, so it was, like, just…

504
01:19:34.780 --> 01:19:47.200
Benno Krojer: But, like, what can… how can I make this better quickly? Is there some, like, jailbreaking literature where they use gradients to form which tokens it might change? Although, it's questionable whether that really helps that button anyway.

505
01:19:49.160 --> 01:19:58.390
Benno Krojer: Yeah, there's some fundamental insight, hopefully, here about also just the structure of visual language representations, and we will, like, I'm right now, like, it's already on GitHub, and I will… I'm…

506
01:19:58.590 --> 01:20:03.540
Benno Krojer: Hopefully in a couple days, it will be as a pip install latent lens you can just run, and it will be sort of pre-computed.

507
01:20:03.720 --> 01:20:12.519
Benno Krojer: databases of these contextual embeddings, because otherwise it is a bit more effort than LogicLens. It takes you half a day to run this, and to store, like, 20 gigabytes of contextual embeddings.

508
01:20:12.700 --> 01:20:18.689
Benno Krojer: But also we now use a smaller corpus that is more concepts from WordNet, so it will be a bit more efficient, yeah.

509
01:20:19.210 --> 01:20:38.809
Benno Krojer: You can send a tensor, you can send an activation, and then it could return to you the top. This must be great. Like, if this was hosted, like… We were thinking about hosting this, but then we were like, if people could host this, this would be amazing, I would be happy to help set this up.

510
01:20:39.310 --> 01:20:57.759
Benno Krojer: And I think it will be useful beyond vision, like, whether you do speech, or latent thinking, or even… even text, or multilinguality, maybe. V6 supports VLNs and diffusion, right? Yeah, or just this KNN databases of vectors in general, I think that's interesting, yeah. Yeah, yeah.

511
01:20:57.760 --> 01:21:06.940
Benno Krojer: Some lessons learned from this project, also for… I included this, maybe, for people, especially starting in the field, or what I wish I had known maybe at the beginning of my PhD, is test your assumptions early.

512
01:21:07.000 --> 01:21:17.230
Benno Krojer: I didn't even think they were interpretable for a couple months and didn't even test it properly. I just believed the papers I was reading, kind of, so the field has not settled, models are different.

513
01:21:17.230 --> 01:21:37.409
Benno Krojer: interactive demos from day one can surface these assumptions early, along with cloud code. There's no excuse, like, just to build a demo, like, on whatever you work on. Yeah, and I'm a bit, like, I would recommend, maybe not to people here, because they're already doing it, but at least one interpretability project in your PhDs, I think, is a good exercise.

514
01:21:37.410 --> 01:21:41.370
Benno Krojer: to, like, compared to, like, just modeling or dataset work that I've done before more.

515
01:21:41.440 --> 01:21:51.300
Benno Krojer: And also when I start this, like, start with an interesting research question, not a particular interim method, I would say. Like, I… I'm plugging this, you know, like, I'm… every…

516
01:21:53.090 --> 01:22:12.739
Benno Krojer: I love it. It's actually very good advice. It's all good advice. And I usually try to capture these lessons, and like, in every paper, I include a behind-the-scenes section, where I say the goal of this section is not to make science more transparent and engaging, showing not just the polished paper at the end, but all the details and lessons learned, and then I just say, like, you know, like, here.

517
01:22:12.760 --> 01:22:22.659
Benno Krojer: So in, the… so the first author, like, originally vision work in visual language was the term to pivot towards understanding, so interpretability was the natural direction.

518
01:22:22.750 --> 01:22:41.699
Benno Krojer: But doing interrupts just for the sake of it did not feel like the right approach, so what was an actual fundamental question that people cared about it, and so on. So then I, like, explained, like, how we got there. Good idea. Yeah. You should highlight it, so we can all see the good advice even easier. Yeah, true, I added this, like, 10 minutes before the presentation, like…

519
01:22:41.780 --> 01:22:50.220
Benno Krojer: Yeah. There's some exciting follow-ups, as I mentioned, like, latent length 2 would look like this. Can we now interpret any non-linguistic token LMs?

520
01:22:50.390 --> 01:22:57.749
Benno Krojer: like, as I said, like, there's all these other things that people have done in the past, and I, like, I'm, like, currently exploring this a little with an undergrad.

521
01:22:58.340 --> 01:23:04.239
Benno Krojer: And how do we study multi-token concepts? Because especially, like, in language, it's already tricky, but especially in vision, like, it's even more…

522
01:23:04.480 --> 01:23:10.310
Benno Krojer: Like, tokens do not represent your object, they're just random parts, so we should get to methods that can do this as well.

523
01:23:10.510 --> 01:23:30.360
Benno Krojer: I have a roadmap now what I actually want to work on in the future, but we're also, like, quite over time, so I'm happy to, like, skip… Okay. Okay, okay, we have until 2. Okay, but if people want to… need to leave, that's also fine, like, I think now is maybe a good time. So yeah, as I said, my North Star is Unified Multimodal Models,

524
01:23:30.380 --> 01:23:33.009
Benno Krojer: And, I mean, the details are,

525
01:23:33.090 --> 01:23:47.500
Benno Krojer: open, but what I broadly imagine is just a flexible modality agnostic reasoning model you can say, to reason deeply, and it just generates any, flexibly, any tokens it needs to do this, whether that's text tokens, vision, speech, or even latent, like, thinking. That's kind of the broad vision.

526
01:23:47.880 --> 01:23:56.170
Benno Krojer: And I would say, like, in the future, maybe in a couple years, we will not even think about anymore, like, whether, like, the type of tokens, and we just assume the model generates some tokens in its thinking.

527
01:23:56.240 --> 01:24:04.879
Benno Krojer: So how do we get there? What's the roadmap towards this? And, or, like, what would I be excited about? And how can interpretability guide us, or me and my work here?

528
01:24:04.880 --> 01:24:17.319
Benno Krojer: I would say the basis has to be, like, just working on better unified architectures and training, and then there's some downstream things that get enabled from it. I'm just showing two here as an example, modality agnostic reasoning and multimodal embedding extraction.

529
01:24:17.570 --> 01:24:21.700
Benno Krojer: And along the way, I think Interp can guide us as a compass.

530
01:24:22.530 --> 01:24:29.120
Benno Krojer: So, for the first one, it's actually, like, when you want to think about future architecture, it's good sometimes to look back to history, how things have evolved.

531
01:24:29.340 --> 01:24:48.310
Benno Krojer: In visual language, and it's funny to see that it has actually… there's no clear trend towards more unification. When Vision… vision birds came up in 2018, they were more unified than past, I would say past models. But then, when multimodal LLMs came around, they were, I would argue, less unified. They were, like, you take two pretense models to stitch them together a little. Then.

532
01:24:49.220 --> 01:24:51.919
Benno Krojer: Everything else is expensive. Yeah, yeah.

533
01:24:51.920 --> 01:25:11.389
Benno Krojer: But then people said, like, in 2024, especially Luke Settlemore's group at Meta, like, we need native unified training. They developed Chameleon and so on, and it didn't work that well. Luke is quite honest about this in his talks. So people did partially, I would say what people are doing now is partially unified. So we can see the field has not really settled at all on, like, what's the best way to unify architectures.

534
01:25:11.410 --> 01:25:18.589
Benno Krojer: Should we train from spread? Should we do shared weights? What are the objectives? Should it be the same one, or diffusion with next token?

535
01:25:20.110 --> 01:25:34.170
Benno Krojer: So we need really fair ablation studies here, which is, to be fair, hard on an academic budget, but maybe we can still get there. And on the way also to interpretability, like, are these, like, whatever changes we make in the ablations, like, are they leading to more internal unification or synergy between the…

536
01:25:34.250 --> 01:25:41.950
Benno Krojer: the modalities. And there's some recent evidence, for example, in the opposite direction, in a way, that there was a paper, like, from, I think, NeurIPS 2025,

537
01:25:42.070 --> 01:25:54.949
Benno Krojer: that when you do fully unified training from scratch, sort of on both modalities, no, like, pre-trained LLMs, you get lower internal unifications, like, the modalities are separate, and actually just a single end of image token is mostly attended by the text, not the rest of the image.

538
01:25:55.190 --> 01:26:03.749
Benno Krojer: So that's… but then they show, like, if you loosen some of the assumptions of fully unification, and you start from a pre-turn LLM, it's much better, sort of, from an intro perspective.

539
01:26:03.890 --> 01:26:05.749
Benno Krojer: So this is, you know…

540
01:26:06.810 --> 01:26:25.040
Benno Krojer: Then multimodal embedding extraction is just one sort of good testbed from, are your representations now, unified? The goal would be here, let's say you have a very flexible, promptable embedding that summarizes your multimodal context, so for search and so on. And on the text-only side, this has worked quite well, like, my lab at, like, with SIVA has released LLM2Vec, like, one or two years ago.

541
01:26:25.040 --> 01:26:33.199
Benno Krojer: And my hope would be, if a model is very unified, it should be much easier, like, out of the box to make it… give you, like, a multimodal embedding that summarizes everything.

542
01:26:33.200 --> 01:26:39.319
Benno Krojer: If it's not unified to representation, maybe it would be very hard that you can ask your LLM to do that, and you'd need a lot of, sort of, additional fine-tuning.

543
01:26:39.760 --> 01:26:46.749
Benno Krojer: So, here would be interesting from an inter-perspective, how does the final end of secrets token actually aggregate all this information into one summarized vector?

544
01:26:47.590 --> 01:26:54.599
Benno Krojer: And then, that's the one I think is the most exciting one, is modality agnostic reasoning. Again, a good testbed, like, you need good reputations to reason over.

545
01:26:54.710 --> 01:27:09.809
Benno Krojer: And here, I think the goal would be to have a model that actually flexibly can choose itself to switch between it. Right now, like, we already see a bit of, like, visual and latent reasoning research, but it's, like, very often we say, like, here's a visual talk, so you have to use visual reasoning. But yeah, the model should just decide this itself.

546
01:27:09.900 --> 01:27:28.759
Benno Krojer: And here, I think here are the most fascinating interop questions, like, how similar are these different thinking tokens? Like, does latent thinking or visual thinking still… are these, like, tokens still very similar to language, like what we found now in… with latent lens? Or are they sort of fundamentally different? And similarly, are there reasoning patterns that we know from language, like the idiosyncrasies of reasoning.

547
01:27:28.760 --> 01:27:32.730
Benno Krojer: like, backtracking and whatever. Is it the same in visual or latent reasoning, or is it…

548
01:27:32.950 --> 01:27:38.239
Benno Krojer: Like, a different form of reasoning, like, that's maybe more like what we would call visual… visualizing spatial reasoning.

549
01:27:39.130 --> 01:27:53.410
Benno Krojer: But there's also many other directions I'm happy to collaborate on, like, as I said, to interp, inserts and tools generalize to multimodality, like the ones we know from LLMs, revisiting the Platonic reputation hypothesis, as I mentioned, understanding not just visual reasoning, but latent reasoning.

550
01:27:53.620 --> 01:28:02.979
Benno Krojer: interpretability of world models, I think maybe you're working on more. And what I added also, like, yesterday is implications of cloud code moment on interp research, or on research in general.

551
01:28:04.860 --> 01:28:22.409
Benno Krojer: some food for thought at the end, so these are questions that motivated me to go into research and still kind of motivate me when I philosophize. What cannot be expressed in language? Are there things we can't express in language, or can everything be expressed? Like, Wittgenstein has thought about this. The best quote I could find from him was, if a lion could talk, we could not understand him.

552
01:28:23.020 --> 01:28:40.319
Benno Krojer: And on vice versa, can the physical world be understood from text alone? Is the abstract thought without language even… so is language, like, the one thing you need? And this relates to the grounding question that Harnard formalized in the 1990s, and other people as well. So how can the meaning of meaningless symbols be grounded in anything but other meaningless symbols?

553
01:28:40.480 --> 01:28:50.219
Benno Krojer: So I'll leave you with that food for thought, and yeah, thank you for listening. These are, like, the people that collaborate with me, and here's the demo, and also, hopefully soon, pip install latent lens.

554
01:28:56.760 --> 01:29:01.689
Benno Krojer: Cool, yeah, if there's… I mean, we've already had a great discussion, but if there's even more questions, let me… let me know.

555
01:29:03.620 --> 01:29:05.419
Benno Krojer: Yeah, whoever was first.

556
01:29:06.530 --> 01:29:18.889
Benno Krojer: Like, can you go back to the Wickenstein quote? Yep. Well, my question spawns before you said the Wittgenstein, but I think that maybe it was a way of testing the Wittgenstein quote.

557
01:29:19.220 --> 01:29:24.039
Benno Krojer: So, I was thinking, like, These vision models,

558
01:29:24.360 --> 01:29:29.399
Benno Krojer: You know, okay, they've processed the image, they… what you're showing.

559
01:29:29.770 --> 01:29:41.519
Benno Krojer: is that, like, for each patch, you know, the thing it produces is sort of, you know, like a… you could think of it maybe as a text summary of that patch. Yep. And then this is sort of how the language model

560
01:29:41.860 --> 01:29:45.700
Benno Krojer: it's gonna work with it. If you ask it questions about the image, like.

561
01:29:45.820 --> 01:29:52.360
Benno Krojer: it's gonna look back to those patches, which are approximately, like, you could think of as text summaries. So, like…

562
01:29:53.030 --> 01:30:02.029
Benno Krojer: Are there things that… this multimodal LLM can do that it can't do if you just, let's say.

563
01:30:03.780 --> 01:30:06.429
Benno Krojer: experiment, like,

564
01:30:06.840 --> 01:30:20.470
Benno Krojer: you literally replace the image token with the text either embedding. That was one ablation we wanted to run and didn't have time to do properly. Or just, like, at the beginning, give the…

565
01:30:20.490 --> 01:30:36.530
Benno Krojer: give, like, a summary of the image, a caption of the image, and then ask it questions about the image. Are there, like… I mean, I think giving it the caption is, like, that's definitely gonna work, that's, like, almost bleeding, because someone has to generate this caption. Well, but, like, aren't… shouldn't there be things that, you know.

566
01:30:36.530 --> 01:30:43.129
Benno Krojer: aren't capturable by a caption or something like this? That… like, are there… are there things…

567
01:30:43.180 --> 01:30:45.940
Benno Krojer: That this model, that this multi-model

568
01:30:46.630 --> 01:30:57.520
Benno Krojer: multimodal model can do that it can't do by just, like, bottlenecking image to caption, and then from the caption, answering a question. I think, in theory.

569
01:30:57.650 --> 01:31:09.809
Benno Krojer: there should be things that are just much easier, like, I mean, they look similar to language, but I think there is still some component in that vector that is a bit… encoding something more continuous, or, like, visual. But I think in practice, it's not… What is that stuff?

570
01:31:10.200 --> 01:31:19.669
Benno Krojer: I mean, it could just be, like, the… for example, the exact… I mean, I don't think that's actually happening in practice, but it could be the exact angle at which things are at. I don't think that's happening at this level, you need to train it, like.

571
01:31:19.740 --> 01:31:39.350
Benno Krojer: Much more than just in a frozen way. But… and you can… in principle, of course, you could explain in language, like, exactly how the angles of things and the depth of things are. It's just a very inefficient way, often. So yeah, I feel like it'd be interesting to try to find a task… Rohan should talk about this. Like, to try to find a task where, like.

572
01:31:39.760 --> 01:31:41.540
Benno Krojer: The multimodal model

573
01:31:41.880 --> 01:31:58.359
Benno Krojer: can do it really well, but if you convert… if you, like, introduce a text bottleneck, like, converting the image to a text, and then you just give the text to an LLM, it can't do the task. And this would get at something that's sort of…

574
01:31:58.380 --> 01:32:02.969
Benno Krojer: undescribable, or like… Or at least very inefficiently describing. Yeah, right.

575
01:32:03.360 --> 01:32:05.300
Benno Krojer: Looking to a question of the, like.

576
01:32:05.390 --> 01:32:13.349
Benno Krojer: impression, right? Because also, in theory, we go to multimodal rag and those kind of fields. So the question is how you compress the image. So if you will write everything that

577
01:32:13.350 --> 01:32:30.649
Benno Krojer: ever can… someone can ask about the question in a caption, so it will be answered… be able to answer it. But if you want the end to represent it, I don't know, like, in a thousand embedded inside, so then you need to choose what information is. So then the question, if you know the question before, and then you can expect the specific information to the image.

578
01:32:30.650 --> 01:32:34.070
Benno Krojer: Or not. I think this is, like… In some sense, that's what failing to school.

579
01:32:34.470 --> 01:32:38.060
Benno Krojer: Like, when it looks at the question, it extracts the red button.

580
01:32:38.210 --> 01:32:39.000
Benno Krojer: Gently.

581
01:32:39.340 --> 01:32:44.050
Benno Krojer: Well, you could have… what if you had, like, an Alan, give it two images.

582
01:32:44.910 --> 01:32:52.339
Benno Krojer: Like, then it feels like, oh, you might have a permanent difference. Like, you could say, is it the same person?

583
01:32:53.110 --> 01:32:59.580
Benno Krojer: In these two images, and it seems like it might be really hard. That's a good… that's a good example, yeah, yeah.

584
01:32:59.890 --> 01:33:04.059
Benno Krojer: But indefinitely, if you have, like, an image captioner.

585
01:33:04.540 --> 01:33:08.080
Benno Krojer: I think you can always find, like, the adversarial question.

586
01:33:08.190 --> 01:33:15.790
Benno Krojer: That you would be able to answer about the email that you can't answer from the caption. Like, they're always something… Something that the caption missed, yeah.

587
01:33:16.660 --> 01:33:31.029
Benno Krojer: But I think this is a good example, like, we humans also are very good at telling faces, so we could probably tell if, even with twins, that one twin looks very, like, slightly different, like, but putting that into words is very hard. And it gets a bit, actually, at this image code task here as well, like, what is the smallest difference you can still put into words?

588
01:33:31.190 --> 01:33:46.509
Benno Krojer: Like you said, if you have a lot of objects in the image, if you want any, like, any angle between two objects, try to describe it may be very hard, right? Or, like, you have, like, a lot of, like, emotions that are coming from lightings, or…

589
01:33:47.240 --> 01:33:52.909
Benno Krojer: Even without a pair of images, like, there is a lot of stuff that you…

590
01:33:53.050 --> 01:33:55.819
Benno Krojer: I think feel when we see an image.

591
01:33:56.500 --> 01:34:00.079
Benno Krojer: But you cannot actually describe it. Yeah, but can the MLP get it out?

592
01:34:00.560 --> 01:34:14.129
Benno Krojer: Yeah, I was gonna… I was gonna say, like, I think this just works because it's also the task of captioning that we are, like, training and evaluating it on. Like, if it was other tasks, like, even just simple spatial reasoning, it might be… it's already breaking, especially in the frozen setup.

593
01:34:14.770 --> 01:34:15.600
Benno Krojer: You can leave.

594
01:34:15.930 --> 01:34:23.240
Benno Krojer: have multiple embeddings, because the standard MLP would also, like, an early exit kind of skip connection to, like, capture both, like, a low-level

595
01:34:23.440 --> 01:34:24.110
Benno Krojer: Bye.

596
01:34:24.210 --> 01:34:28.820
Benno Krojer: you know, the angle question, compositionality, and also the eye level sentences.

597
01:34:29.520 --> 01:34:31.060
Benno Krojer: Yeah, it is…

598
01:34:31.370 --> 01:34:44.499
Benno Krojer: It's, like, very… apparently, like, way embedded. Yeah, with this kid. Oh, so you're… Oh, you mean, like, using early embeddings from the vision encoder? Yeah, yeah, yeah, like, from multiple different, like, depths to get, like, different levels of science.

599
01:34:44.500 --> 01:34:54.800
Benno Krojer: We ran this, but never looked at it much. Earlier VRT layers, yeah. That's a very good question. But we didn't study it much, so we just ran.

600
01:34:54.840 --> 01:35:03.029
Benno Krojer: We ran it briefly here earlier via T layers, but we didn't. I think I ran it, and then I didn't see major differences in inter… like, the kind of nearest neighbors. Maybe because you have the wrong kind of captions.

601
01:35:03.160 --> 01:35:14.940
Benno Krojer: Because we're training on captions, yeah. If the caption said, oh, we have all, you know, upper right diagonal lines, you know, high contrast colors, stuff like that, then you might, you might be like, oh, we have a good layer 1,

602
01:35:15.160 --> 01:35:30.269
Benno Krojer: Yeah. That would be really nice. Light source, this and that, or this picture, like, this patch gives me the emotion of, I don't know. Yeah, where's the limit of this? But are all features of images…

603
01:35:30.330 --> 01:35:38.010
Benno Krojer: describable in language? Like, or does there exist some stuff that, like, actually, no matter what caption you use.

604
01:35:38.890 --> 01:35:42.800
Benno Krojer: I don't know, eventually read the whole research paper about the thing.

605
01:35:42.980 --> 01:35:54.000
Benno Krojer: How do you… how would you express the thing that's not expressible in language? Do you write the paper about that? It's… I feel like it's almost… because right now we're trying to express something to you. I guess, almost hard to, like, talk about it, right? Like…

606
01:35:54.240 --> 01:36:12.689
Benno Krojer: I would say, like, for me, honestly, it might be… but it might be… but, like, I was gonna say, like, for me, the… personally, the easiest answer is something emotional, where, like, someone else might never be able to relate to some experience I had, emotionally. Like, I think visual stuff I can probably put into words on some level, if I take a lot of words.

607
01:36:13.040 --> 01:36:27.110
Benno Krojer: But… How to ride bicycle? Yeah, like, if it's more, like, real-world experience, like, like that… Yeah, tell you about my hackathon project. Yeah. I think… so my hackathon project is gonna be,

608
01:36:27.430 --> 01:36:31.200
Benno Krojer: Instead of using a vision model, I'm gonna take a chess model.

609
01:36:31.510 --> 01:36:51.120
Benno Krojer: A chess model. Yeah. And I want… I want to train a, like, hybrid, thing to… I want the chess model to explain what it's… So, connecting the chess model to the LLM, like, almost in a similar way to here. Yeah. But here, it's, like, I think it's sort of interesting, because if you look at a given patch of the, like, the patch.

610
01:36:51.190 --> 01:37:02.969
Benno Krojer: Instead of image patches, it's like a board. So actually meaningful patches now? Yeah, but it's like a board, like, A1. It has a roof on it, or something, right? And so, like, if I…

611
01:37:03.160 --> 01:37:09.840
Benno Krojer: if you just write the description of the board, like, okay, you know what pieces are on the board. Yeah. But, like…

612
01:37:09.950 --> 01:37:15.220
Benno Krojer: you, like, an LLM might not know what, like, higher level concepts

613
01:37:15.270 --> 01:37:24.250
Benno Krojer: are there, or something. Like, the higher level concepts are not all captionable or something. Like, you could caption them, but…

614
01:37:24.250 --> 01:37:40.230
Benno Krojer: Well, like, like, white is attacking… is threatening the queen, or, like, there's a queen side attack, or, like, even white is winning, is it? Wait, can you convince me that, like, a specific setting on your board, is that fully?

615
01:37:40.290 --> 01:37:42.400
Benno Krojer: The spiral inflex.

616
01:37:43.180 --> 01:37:43.920
Benno Krojer: No.

617
01:37:44.200 --> 01:37:58.930
Benno Krojer: But there are, like, so many concepts in chess, or, like, possible things that you could point out. It's about secrets of upcoming moves. Right, but why do you… you just need to understand the logic, and not necessarily, like, the right language for me?

618
01:37:59.250 --> 01:38:11.170
Benno Krojer: I think it's… I'll think of it more like, is it efficiently describable in language? Like, I think you can describe it, but is it the most efficient way? Like, probably not. Yeah, I guess what I'm getting at is, like, with an image…

619
01:38:11.290 --> 01:38:14.719
Benno Krojer: Like, you can take each patch and describe each patch.

620
01:38:15.220 --> 01:38:20.939
Benno Krojer: And then from looking at all of those, you can… Have a description of… of the thing.

621
01:38:21.140 --> 01:38:24.619
Benno Krojer: With chess, like, if you take a square and you know that there's a rook there.

622
01:38:24.740 --> 01:38:35.229
Benno Krojer: And then you take the… like, it doesn't really tell you what's going on at the proper level of, I can argue the same for images. That's right, because… is that a column?

623
01:38:35.230 --> 01:38:56.510
Benno Krojer: Right? You can't tell unless you actually see it's a building, right? Otherwise, if you really just looked at the patch, it could be a piece of a refrigerator, right? But I also… I think this would be more of a fact that just your chess model has been fine-tuned on this to learn these concepts, whereas your LLM might not have been fine-tuned. Like, if you fine-tune your LLM to learn these concepts with, like, thousands of chess games, like, maybe it would also learn them. Just from looking at the board. Yeah, like the textual description of the board.

624
01:38:57.340 --> 01:39:03.790
Benno Krojer: It's a good project. Don't dissuade him, it's a good project. No, no, yeah, I'm willing to take bets, I think it's gonna work.

625
01:39:03.950 --> 01:39:15.559
Benno Krojer: That you, like, the mapping it into the elements. Right, and then… and then you… then you'll have to train a latent lens for it. Yeah, this would be interesting, if it maps to, like, sentences about chess, so… Yeah, I'll send you the blog post tomorrow.

626
01:39:15.820 --> 01:39:23.160
Benno Krojer: It's a one-day project. Two-day. Two-day project. So wait, the questions you're asking are…

627
01:39:23.680 --> 01:39:26.539
Benno Krojer: really different from what the BLM is asking for.

628
01:39:26.730 --> 01:39:28.690
Benno Krojer: It's like asking if this image…

629
01:39:28.860 --> 01:39:30.610
Benno Krojer: If it's going to rain tomorrow.

630
01:39:31.270 --> 01:39:35.219
Benno Krojer: Or it's the builder that built the… The cloth done.

631
01:39:35.740 --> 01:39:48.120
Benno Krojer: Yeah, it's not the naturalistic setting of using it, that's what you're saying? Yeah, the visual encoder is trained to say that's a clock, and that's a building, there's a shadow. So this is… you're characterizing the MLP.

632
01:39:48.120 --> 01:40:11.670
Benno Krojer: Right, and that's how… that's what the vision… that's what this… this MLP is trained on. Now, you could… you could have trained the vision encoder on something else, right? Yes. Are you talking about chess still, or… But I'm saying the vision encoder. Yes, yes. Oh, the underlying vision encoder. So there's two parts. There's… there's, like, three things that have learned something. So, task, it's just captioning the image, but the database chess model listens to playing chess.

633
01:40:11.800 --> 01:40:20.530
Benno Krojer: Knowing if it's winning or losing. That's right. It's a sequential model. That's right. And he wants to ask the question, given a vote state, does it know which…

634
01:40:20.760 --> 01:40:25.469
Benno Krojer: what side of attack is happening, who's going to win? Right, right. So, I think these are, like.

635
01:40:25.470 --> 01:40:40.880
Benno Krojer: Slightly different question. I don't know how you recapture them, I don't know where the captions come from. I don't think it would be… captioning would be objective to train the LM. It would just be, like, again, predicting… I don't know, I am, but… I can… Okay, okay. Okay, okay. Fair enough. That'll be the next… that'll be next week's time.

636
01:40:40.880 --> 01:40:53.189
Benno Krojer: All right. Oh, one more question before? One closing visual question. So, I like the… the push for having unified architectures, more unified, more unified architecture. Yep.

637
01:40:54.190 --> 01:40:57.900
Benno Krojer: But also, you have this, vision to become interpreter.

638
01:40:58.030 --> 01:40:59.739
Benno Krojer: Like, don't you think they…

639
01:40:59.880 --> 01:41:05.520
Benno Krojer: Or, like, end of our covenant, like, maybe if you become more unified, they'll become less of interpretable.

640
01:41:06.580 --> 01:41:07.980
Benno Krojer: Mmm…

641
01:41:09.130 --> 01:41:20.380
Benno Krojer: That's a good question, yeah. There was a paper where you talked about, like, pre-training, where it, like, decreased the ability, right? Yeah, so, like, the Chameleon and EMO are two models, they're, like, that are fully natively multimodian, they're…

642
01:41:20.520 --> 01:41:23.820
Benno Krojer: But they don't perform very well. They don't perform well?

643
01:41:23.960 --> 01:41:34.850
Benno Krojer: Yeah. Like, if you got a really, truly unified body… I think it boils down to, like, whether we think of language as the modality of, like, an interpretable explanation or not. Like, you're thinking, like.

644
01:41:35.160 --> 01:41:39.870
Benno Krojer: Like, they might be less interpret in the sense that we humans want mostly textual explanations.

645
01:41:40.250 --> 01:41:45.220
Benno Krojer: So in that sense, maybe there will be less, yeah. So do you think, like, there's a, like, a…

646
01:41:45.960 --> 01:41:50.010
Benno Krojer: A side of research that's ready to be born, like a vision intervention.

647
01:41:50.120 --> 01:41:53.879
Benno Krojer: Something where you're just integrating with images, not text.

648
01:41:54.190 --> 01:42:02.369
Benno Krojer: Yeah, I mean, this exists as far as I know, but, like, well, like, not exists, but I mean, this is a field, like, yes, I feel like this would be… would be nice, and maybe there will be…

649
01:42:02.430 --> 01:42:17.780
Benno Krojer: tokens where, like, you can only interpret them via that, like… Yeah, yeah, finding nearest images, like, for that, or, like… I think, profits, scaling monosyticity at this, like, they…

650
01:42:18.040 --> 01:42:31.340
Benno Krojer: like, for a given essay featuring a language model, they… or in a multimodal language model, they not only searched over maximum text activations, but also max activating image. Also, like, for Golden Gate feature, like.

651
01:42:31.440 --> 01:42:40.849
Benno Krojer: You see pictures of the Golden Gate. Pretty interesting. Nice. That's cool, yeah. So yeah, you can do the opposite here as well, like, latent lens on finding nearest neighbors, image neighbors, two text tokens.

652
01:42:41.280 --> 01:42:44.230
Benno Krojer: like, Yep.

653
01:42:45.400 --> 01:42:46.690
Benno Krojer: or chessings.

654
01:42:46.820 --> 01:42:50.819
Benno Krojer: Thank you very much, Janelle. Thank you.

655
01:42:52.260 --> 01:43:02.649
Benno Krojer: Yeah, I'm glad we had so much time, because usually, like, I try to, like, nail it, like, in practice, like, to exactly 50 minutes. No, it's very, very well… we're very good at messing up a very well-tuned talk, so it's.