WEBVTT

00:00:01.000 --> 00:00:04.000
So…

00:00:04.000 --> 00:00:06.000
Today, we have…

00:00:06.000 --> 00:00:08.000
Uh, not visiting speaker.

00:00:08.000 --> 00:00:10.000
Nikhil Prakash?

00:00:10.000 --> 00:00:12.000
Um, and uh, so…

00:00:12.000 --> 00:00:14.000
Nikhil is actually one of…

00:00:14.000 --> 00:00:17.000
is a very unusual case.

00:00:17.000 --> 00:00:19.000
For a PhD student.

00:00:19.000 --> 00:00:25.000
Go ahead. Because…

00:00:25.000 --> 00:00:30.000
Because, you know, when he arrived, uh, sort of his first month here,

00:00:30.000 --> 00:00:35.000
Uh, we sat down, we chatted about what kinds of things he was interested in studying,

00:00:35.000 --> 00:00:37.000
And we looked at a whole variety of different things.

00:00:37.000 --> 00:00:43.000
And, um, and he pretty quickly says, you know, that right there, this topic is interesting to me.

00:00:43.000 --> 00:00:46.000
What is that? Theory of mind.

00:00:46.000 --> 00:00:48.000
Right? Do… do… do, uh…

00:00:48.000 --> 00:00:52.000
Do these AI systems that we're building?

00:00:52.000 --> 00:01:00.000
Do they think like humans? Do they look at other agents and think about what they're thinking? Do they have a theory of mind? Can they… can they… can they…

00:01:00.000 --> 00:01:02.000
put themselves…

00:01:02.000 --> 00:01:04.000
In the… in the shoes of others.

00:01:04.000 --> 00:01:06.000
That's what I'd like to study.

00:01:06.000 --> 00:01:11.000
And… and uh… and so he… he started working on it right away.

00:01:11.000 --> 00:01:16.000
Um, but from the beginning, it was, like, challenges. Like, the models that were available,

00:01:16.000 --> 00:01:20.000
his first year. They had no theory of mind.

00:01:20.000 --> 00:01:23.000
We've had some papers about this.

00:01:23.000 --> 00:01:29.000
Right? Like, they just, they couldn't do it at all. And so it's taken a few years to get to the point when Miles maybe credibly have some.

00:01:29.000 --> 00:01:31.000
some evidence of even having this.

00:01:31.000 --> 00:01:36.000
But… but the interesting thing, and the thing that makes them unusual, is that

00:01:36.000 --> 00:01:40.000
If I… yeah, if I look at them now, you know, a few years later,

00:01:40.000 --> 00:01:45.000
Oh, he's, like, working on the thing that he said that he was gonna work on.

00:01:45.000 --> 00:01:48.000
And so, um…

00:01:48.000 --> 00:01:51.000
So today, he's not going to talk about a theory of mind.

00:01:51.000 --> 00:01:54.000
Paper-based. But it's very closely related.

00:01:54.000 --> 00:02:02.000
It's very closely related. It's binding. Right? Which is, which is a… which is a large aspect of the theory of mind task.

00:02:02.000 --> 00:02:05.000
Um, and he's looking at an interesting setting.

00:02:05.000 --> 00:02:08.000
Um, there's not language models. And so, uh, so…

00:02:08.000 --> 00:02:11.000
Um, now, I think that he also is…

00:02:11.000 --> 00:02:13.000
gonna do the talking in an unusual way, because…

00:02:13.000 --> 00:02:19.000
Are you? No. No, you're not. No. So, I've been asking him to…

00:02:19.000 --> 00:02:22.000
Um, practice his, his, his, uh…

00:02:22.000 --> 00:02:28.000
A sort of talk technique and trim down his talks to really short, really polished,

00:02:28.000 --> 00:02:32.000
you know, to, like, you know, the kind of, like, more formal presentations.

00:02:32.000 --> 00:02:37.000
Um, I thought, well, maybe he could practice that today, but it sounds like he's not going to practice that today, so we'll see.

00:02:37.000 --> 00:02:42.000
But welcome, welcome, Nikhil.

00:02:42.000 --> 00:02:45.000
Yeah, thank you for that great introduction. Um…

00:02:45.000 --> 00:02:51.000
Yeah, unfortunately, I… so, I had prepared the slides by yesterday, but then I…

00:02:51.000 --> 00:02:53.000
But yeah, and I had, uh…

00:02:53.000 --> 00:02:56.000
chat with David about the slides.

00:02:56.000 --> 00:02:59.000
And he gave me a lot of feedback.

00:02:59.000 --> 00:03:01.000
And I just couldn't incorporate

00:03:01.000 --> 00:03:06.000
all of the feedback in, like, 3, 4, 5 hours.

00:03:06.000 --> 00:03:19.000
That's the… if… effectively, that's the time that I had. So, I started making the changes in the slide, but in the middle, I figured out I wouldn't be able to complete it. I mean, the kind of quality that I think you were…

00:03:19.000 --> 00:03:26.000
trying to push me towards that was too much for me to get in 3 or 4 to 5 hours. Okay.

00:03:26.000 --> 00:03:28.000
So, I did a few changes.

00:03:28.000 --> 00:03:35.000
In fact, you might see that there might be significant difference between the quality of slides.

00:03:35.000 --> 00:03:39.000
like, the earlier sites versus later sites. Um…

00:03:39.000 --> 00:03:45.000
But hopefully, I think I should be able to explain the ideas and the main research from the paper still clearly.

00:03:45.000 --> 00:03:51.000
I think that's the main goal here. So, I think, yeah, so this is… this is gonna… still gonna be our normal conversation, kind of.

00:03:51.000 --> 00:03:54.000
presentation, not like a formal, polished.

00:03:54.000 --> 00:04:00.000
presentation. So feel free, feel free to interrupt me and ask as many questions as you want.

00:04:00.000 --> 00:04:09.000
Okay. Um, yeah, so this is basically the title of the paper, the dual mechanisms of Spatial Reasoning in VLMs. This is under…

00:04:09.000 --> 00:04:11.000
review right now at ICML.

00:04:11.000 --> 00:04:14.000
Let's see what happens.

00:04:14.000 --> 00:04:16.000
Okay, so…

00:04:16.000 --> 00:04:23.000
Okay, I'll start the presentation by actually describing this idea or concept of variable binding.

00:04:23.000 --> 00:04:33.000
Um, yeah, I think David briefly mentioned about it, uh, that this paper is about binding as well. It's, I think, a very old idea. I think it's an idea which is older than most of us.

00:04:33.000 --> 00:04:39.000
Um, people have been talking about in neuroscience and cox science for the last… at least 3 decades.

00:04:39.000 --> 00:04:44.000
That I've known of. And the idea is actually pretty simple and fundamental.

00:04:44.000 --> 00:04:51.000
It's just the ability to associate features of an object. So now, right now, you're seeing a lot of people

00:04:51.000 --> 00:04:53.000
And most of us are wearing different color…

00:04:53.000 --> 00:04:56.000
t-shirts or sweaters.

00:04:56.000 --> 00:05:03.000
And you can ascribe… you can figure out that, okay, the color of the t-shirt that this person is wearing is this color.

00:05:03.000 --> 00:05:09.000
And you're not confusing between the colors of different people. And the reason that you are able to do that, that's because…

00:05:09.000 --> 00:05:15.000
Because your mind is able to do, or is able to solve the binding problem.

00:05:15.000 --> 00:05:19.000
Um, yeah, so essentially the binding problem is basically to…

00:05:19.000 --> 00:05:22.000
Associate features of an object.

00:05:22.000 --> 00:05:26.000
And keep it separate between different objects.

00:05:26.000 --> 00:05:29.000
So, for instance, in this particular slide, uh…

00:05:29.000 --> 00:05:32.000
This is a nice photo of, uh…

00:05:32.000 --> 00:05:36.000
horse wearing a fedora and a cat wearing a…

00:05:36.000 --> 00:05:38.000
cap. Now…

00:05:38.000 --> 00:05:50.000
If any intelligence system, be it human brain or a neural model, if we say that that system has variable binding capability, what I mean is, it can sort of

00:05:50.000 --> 00:05:54.000
Take that image, and then create this list of…

00:05:54.000 --> 00:06:01.000
a set of tuples, where each tuple is actually representing an object and its corresponding feature.

00:06:01.000 --> 00:06:07.000
Yeah, if a system can do that, then we say that the system has a variable binding capability.

00:06:07.000 --> 00:06:14.000
And why do we care about variable binding in the context of spatial reasoning? Because…

00:06:14.000 --> 00:06:19.000
I would argue that this is one of the fundamental properties that any system needs to have.

00:06:19.000 --> 00:06:21.000
to be able to do spatial reasoning.

00:06:21.000 --> 00:06:29.000
So, for instance, if you give this particular image to a neural model to generate a caption, and let's say it creates this caption, which describes the image.

00:06:29.000 --> 00:06:37.000
I would argue that to be able to represent or generate that caption coherently, it needs to do variable binding in its…

00:06:37.000 --> 00:06:45.000
internal thinking process. If it cannot do variable binding properly, then the caption that it's gonna generate

00:06:45.000 --> 00:06:49.000
It's more likely be incorrect, and even if it's correct, it's…

00:06:49.000 --> 00:06:53.000
It's something that is not really trustworthy.

00:06:53.000 --> 00:06:59.000
So that's why variable binding is an essential skill, or essential task for…

00:06:59.000 --> 00:07:03.000
any system to do spatial reasoning.

00:07:03.000 --> 00:07:06.000
In fact, number of previous works have shown that

00:07:06.000 --> 00:07:18.000
the… the limited spatial capability of VLMs can be attributed to their restricted variable binding capabilities.

00:07:18.000 --> 00:07:23.000
Okay.

00:07:23.000 --> 00:07:30.000
Okay, so variable mining seems very important for spatial reasoning. Then the question…

00:07:30.000 --> 00:07:34.000
comes up, the main question that comes up, at least for us,

00:07:34.000 --> 00:07:45.000
the McKinta people is, can we better understand it in the visual space, in the hope that maybe we can probably improve the model performance as well with the better insights of how

00:07:45.000 --> 00:07:47.000
The models are actually…

00:07:47.000 --> 00:07:52.000
Forming those tuples in its internal activations.

00:07:52.000 --> 00:07:54.000
Um…

00:07:54.000 --> 00:07:57.000
Before going into…

00:07:57.000 --> 00:08:00.000
the VLM space.

00:08:00.000 --> 00:08:04.000
Um, actually, people have looked into this problem in the language model space.

00:08:04.000 --> 00:08:08.000
And I'm first going to describe what we all… You're so humble.

00:08:08.000 --> 00:08:14.000
People have looked into it. I wonder who people.

00:08:14.000 --> 00:08:19.000
Okay, yeah, so some of those contributions have been from our lab as well.

00:08:19.000 --> 00:08:26.000
Okay, okay, okay, from me.

00:08:26.000 --> 00:08:29.000
Um… yeah, okay, so…

00:08:29.000 --> 00:08:31.000
Yeah. Um…

00:08:31.000 --> 00:08:37.000
Yeah, so let me first describe what we already know about this variable binding in the language model space, and

00:08:37.000 --> 00:08:40.000
then maybe we can start thinking about, and talking about…

00:08:40.000 --> 00:08:43.000
The same problem with the vision space.

00:08:43.000 --> 00:08:48.000
So, assume this is, uh, the task, the task that you see on the slide.

00:08:48.000 --> 00:08:54.000
You have a context something like, Apple is in box A, banana is in box B, cheese in box C, and then you ask,

00:08:54.000 --> 00:08:59.000
Or you just have a query sentence. Box A contains the, and the answer should be apple.

00:08:59.000 --> 00:09:02.000
Pretty simple, right? Um…

00:09:02.000 --> 00:09:06.000
And what previous are some of my works have shown

00:09:06.000 --> 00:09:15.000
is the way models, or especially language models, solve this task is by first creating some kind of, like, abstract

00:09:15.000 --> 00:09:23.000
um, ordering representation for each kind of important token that you have in the prompt.

00:09:23.000 --> 00:09:25.000
Okay. Abstract…

00:09:25.000 --> 00:09:31.000
ordering representation, and each kind of important token in the prompt. Those are the two main key

00:09:31.000 --> 00:09:37.000
key dumps in what I said. So essentially what that means is, here in this particular prompt,

00:09:37.000 --> 00:09:41.000
I would say that there are two main types of important tokens.

00:09:41.000 --> 00:09:43.000
Uh, first is the object.

00:09:43.000 --> 00:09:46.000
And the second one is the box label.

00:09:46.000 --> 00:09:49.000
Now, what the model does, it basically creates a label,

00:09:49.000 --> 00:09:54.000
for each of these two type of, uh, information.

00:09:54.000 --> 00:09:59.000
So, for Apple, since it's the first one in the prompt, it says that, okay, this is the first

00:09:59.000 --> 00:10:03.000
object in the prompt. And A is the first label in the prompt.

00:10:03.000 --> 00:10:09.000
Similarly, for banana, it says that, okay, this is the second object, and B is the second label.

00:10:09.000 --> 00:10:11.000
So, it creates this abstract

00:10:11.000 --> 00:10:18.000
representations, uh, to encode important information which is present in the prompt.

00:10:18.000 --> 00:10:21.000
And then, finally, when you ask a question, or…

00:10:21.000 --> 00:10:29.000
some kind of features about a particular, uh, label or a particular box, what it does is basically uses that

00:10:29.000 --> 00:10:32.000
ordering information.

00:10:32.000 --> 00:10:38.000
to fetch its corresponding features that it will predict as the next token, where

00:10:38.000 --> 00:10:43.000
In this case, the feature would be actual object, which is present in that box.

00:10:43.000 --> 00:10:48.000
Nikhil, I have a question. I'm online, sorry, sorry. Can I ask now, or should I wait till the end?

00:10:48.000 --> 00:10:50.000
Yeah, yeah, this sorting.

00:10:50.000 --> 00:10:57.000
Okay, um, so my question was, uh, so in this case, like, the structure of the sentence is such that the…

00:10:57.000 --> 00:11:04.000
Uh, it has, like, this very constrained structure, right? Where you have, you know, word followed by is in box.

00:11:04.000 --> 00:11:11.000
Uh, and is inbox is identical, you know, in all the three segments, and then you have, uh, kind of these labels that come up.

00:11:11.000 --> 00:11:13.000
So, um…

00:11:13.000 --> 00:11:30.000
Do you think that even if we didn't have this constraint structure and, you know, if the sentence was pretty free-flowing, that semantically similar elements would still share this ordering ID, like, the model would… like, is the model doing it… doing this based on…

00:11:30.000 --> 00:11:35.000
Like, some kind of semantic similarity between these components, or is there, like, some other…

00:11:35.000 --> 00:11:41.000
reason for… for why these, you know, for why these IDs exist.

00:11:41.000 --> 00:11:43.000
Um, so I think for the first question, if…

00:11:43.000 --> 00:11:50.000
Uh, so the question was, for a given, like, free-form text, would we expect to see similar kind of…

00:11:50.000 --> 00:11:52.000
ordering IDs…

00:11:52.000 --> 00:12:00.000
Um, or not. I think my intuition is, I think we would still see these kinds of ordering ID. But the problem there is

00:12:00.000 --> 00:12:06.000
Since those are, like, general texts, it will be more difficult to actually prove

00:12:06.000 --> 00:12:09.000
that they are indeed there.

00:12:09.000 --> 00:12:13.000
Um, I mean, we can still do, like, causal experiments to

00:12:13.000 --> 00:12:18.000
say that, okay, they are exp… they are present there, but in that case, the…

00:12:18.000 --> 00:12:26.000
the process of coming up with the causal experiments would be way more difficult, because there would be way more confounds.

00:12:26.000 --> 00:12:28.000
Um…

00:12:28.000 --> 00:12:30.000
Yeah, I think… but I… but my general…

00:12:30.000 --> 00:12:37.000
intuition is that even in the freeform text, the model does create this kind of, uh, ordering ID representation.

00:12:37.000 --> 00:12:41.000
Uh, because I think we have seen this kind of representation across

00:12:41.000 --> 00:12:44.000
bunch of settings now.

00:12:44.000 --> 00:12:56.000
Even though most of those settings are still synthetic kind of still, in a sense, synthetic, but I think we have seen this… this kind of representation now across a bunch of tasks, bunch of settings.

00:12:56.000 --> 00:13:00.000
Which kind of give me the confidence that, okay, this is a generic thing, it's not just…

00:13:00.000 --> 00:13:04.000
constrained to this kind of synthetic setting.

00:13:04.000 --> 00:13:13.000
Yeah, okay, awesome. Thank you so much, yeah. I think URV's work also is probably, like, relevant here, maybe. Like, he also had these similar findings across many, many tasks.

00:13:13.000 --> 00:13:16.000
But they were all, again, synthetic, so…

00:13:16.000 --> 00:13:21.000
Yeah, even, yeah, the task that he used, I think he… those were…

00:13:21.000 --> 00:13:29.000
more or less synthetic as well. I don't think so they… they had general… general text.

00:13:29.000 --> 00:13:35.000
we didn't see generalization to what the minimum deficit, which is much less.

00:13:35.000 --> 00:13:40.000
So, like, even if you can't find them in a website, you can show generalization to another.

00:13:40.000 --> 00:13:49.000
Yeah, that's right. I mean, yeah, it depends what you call synthetic funnel. The belief tracking was also not super synthetic.

00:13:49.000 --> 00:13:54.000
Yeah, but this mechanism is clear.

00:13:54.000 --> 00:14:00.000
So, okay, so that was the language model space. Now, let's talk about the…

00:14:00.000 --> 00:14:12.000
What if you have more than one object?

00:14:12.000 --> 00:14:16.000
Yeah, good question. I'm not super sure what will that have… what will that do.

00:14:16.000 --> 00:14:21.000
What do you think? Yeah, I'm not 100% sure.

00:14:21.000 --> 00:14:25.000
Yeah.

00:14:25.000 --> 00:14:27.000
But then how…

00:14:27.000 --> 00:14:30.000
So it's not an order, something else?

00:14:30.000 --> 00:14:38.000
So it's gonna still be, like, the first object, first object? Yeah, I think both are gonna… it's gonna be counted as one object, that's what I…

00:14:38.000 --> 00:14:47.000
This way is you will be… what is the object inside box A that starts with A, or something like that? Like, if you target a specific one.

00:14:47.000 --> 00:14:51.000
Yes.

00:14:51.000 --> 00:14:58.000
It should be… yeah, it shouldn't be exactly the same. There should be some difference, but…

00:14:58.000 --> 00:15:00.000
For box A contains B.

00:15:00.000 --> 00:15:03.000
We'll give the same.

00:15:03.000 --> 00:15:05.000
more or less the same marker for Aplin.

00:15:05.000 --> 00:15:09.000
No, it decides about the Margaret sees the query.

00:15:09.000 --> 00:15:14.000
decides about the markers before it sees the query.

00:15:14.000 --> 00:15:18.000
Yeah, but I would guess it creates different markers for usage.

00:15:18.000 --> 00:15:21.000
Yeah. I think, I think, uh…

00:15:21.000 --> 00:15:23.000
Andrew from…

00:15:23.000 --> 00:15:25.000
Mm-hmm.

00:15:25.000 --> 00:15:29.000
Uh, how to work… hold off, word for you then.

00:15:29.000 --> 00:15:34.000
yours that shows that there are different markers that the model… Oh, he has a paper on this? I've not seen that.

00:15:34.000 --> 00:15:36.000
Is it out now? Yeah.

00:15:36.000 --> 00:15:39.000
I think so. Oh, I've not seen that paper.

00:15:39.000 --> 00:15:46.000
Okay, Adish, yeah, if you can send that, it would be really helpful.

00:15:46.000 --> 00:15:52.000
Do you have any idea what he's getting?

00:15:52.000 --> 00:16:03.000
What is getting binded to what?

00:16:03.000 --> 00:16:06.000
works because…

00:16:06.000 --> 00:16:13.000
Too early.

00:16:13.000 --> 00:16:16.000
I go, oh, you asked about the…

00:16:16.000 --> 00:16:18.000
that there is no happiness.

00:16:18.000 --> 00:16:21.000
No, no, at least in the class.

00:16:21.000 --> 00:16:29.000
You ask a bullshit about everyone else.

00:16:29.000 --> 00:16:42.000
Uh-huh.

00:16:42.000 --> 00:16:48.000
Uh, publication.

00:16:48.000 --> 00:16:51.000
And it would be…

00:16:51.000 --> 00:16:56.000
something seeming to happen in the morning stations as well.

00:16:56.000 --> 00:17:00.000
Nikka, wouldn't you say that your diagram…

00:17:00.000 --> 00:17:04.000
Uh-huh. Suggest, because you call it IDs. Mm-hmm.

00:17:04.000 --> 00:17:08.000
that what you're showing here is that, no, neither one.

00:17:08.000 --> 00:17:11.000
they're both being bound.

00:17:11.000 --> 00:17:13.000
to number one.

00:17:13.000 --> 00:17:17.000
They're not… it's not apples being bound to A, or A is being bound to apple.

00:17:17.000 --> 00:17:19.000
They've been found to it.

00:17:19.000 --> 00:17:22.000
this ID, right? Mm-hmm.

00:17:22.000 --> 00:17:24.000
There's an indirect.

00:17:24.000 --> 00:17:25.000
binding happening through the ID.

00:17:25.000 --> 00:17:35.000
Yeah, yeah, yeah, yeah, yeah, cool, thank you.

00:17:35.000 --> 00:17:38.000
the finding that happens.

00:17:38.000 --> 00:17:43.000
Mm-hmm. Mm-hmm.

00:17:43.000 --> 00:17:46.000
So is it more like a pointer? Yes, exactly.

00:17:46.000 --> 00:17:48.000
Exactly, yeah.

00:17:48.000 --> 00:17:57.000
Yeah, but the… yeah, it's more like a pointer, but still the question of many-to-one binding, where, let's say, we have multiple objects, like, multiple…

00:17:57.000 --> 00:18:00.000
You have multiple objects in a single box.

00:18:00.000 --> 00:18:03.000
how that would be binded.

00:18:03.000 --> 00:18:08.000
is something that I'm still not very sure.

00:18:08.000 --> 00:18:11.000
Okay, we can move forward. Um…

00:18:11.000 --> 00:18:17.000
So, there was the existing results in the language model space. Um, now let's talk

00:18:17.000 --> 00:18:20.000
start talk… talking about the VLM…

00:18:20.000 --> 00:18:24.000
results. So, before we get into the results, I just wanted to…

00:18:24.000 --> 00:18:28.000
brush up on the architecture of a VLM. It's actually simple.

00:18:28.000 --> 00:18:30.000
Um…

00:18:30.000 --> 00:18:33.000
Yes.

00:18:33.000 --> 00:18:39.000
Um, so, yes, so there are primarily two components, or actually three components. There is a vision encoder,

00:18:39.000 --> 00:18:42.000
And a projector on top of…

00:18:42.000 --> 00:18:46.000
the vision encoder, and then finally, we have a language model backboard.

00:18:46.000 --> 00:18:48.000
So the image is first…

00:18:48.000 --> 00:18:54.000
uh, broken down into smaller batches. Each of the patches becomes a specific token.

00:18:54.000 --> 00:18:57.000
Which is fed into the vision encoder.

00:18:57.000 --> 00:19:01.000
Which is just VIT. It processes those…

00:19:01.000 --> 00:19:09.000
token information, and uh… yeah, then output of each of the token is basically given to the projector, which is supposed to transfer

00:19:09.000 --> 00:19:12.000
Transpose the…

00:19:12.000 --> 00:19:16.000
those vision encoder into language… language model space.

00:19:16.000 --> 00:19:18.000
And then those…

00:19:18.000 --> 00:19:21.000
Transformed Vision Encoder.

00:19:21.000 --> 00:19:29.000
vision embedding are actually passed on to the language model, just as normal token embedding.

00:19:29.000 --> 00:19:31.000
In addition, with the…

00:19:31.000 --> 00:19:33.000
The token embeddings from the text.

00:19:33.000 --> 00:19:41.000
That's it. And then, finally, the language model predicts next token in the text domain.

00:19:41.000 --> 00:19:45.000
Yeah, so there was this paper last year,

00:19:45.000 --> 00:19:51.000
Um, which show that the kind of mechanism that I showed for the language model in the last-to-last slide.

00:19:51.000 --> 00:19:55.000
actually generalizes to BLMs as well.

00:19:55.000 --> 00:19:58.000
Though they only studied the language model backbone,

00:19:58.000 --> 00:20:03.000
But they show that this is the same kind of mechanism actually generalizes, uh…

00:20:03.000 --> 00:20:07.000
In that setting as well. So, just to it.

00:20:07.000 --> 00:20:14.000
We don't need to, I'm sorry, but they only show, like, WhatsAppings of his text documents. They didn't say anything about the visual text.

00:20:14.000 --> 00:20:19.000
So they… they have just one… one experiment. They say that the information comes from the mirror.

00:20:19.000 --> 00:20:26.000
But they do one experiment on the key vectors of visual tokens.

00:20:26.000 --> 00:20:31.000
Yeah, it's… so they don't say it as clearly as I'm gonna explain it to you.

00:20:31.000 --> 00:20:38.000
But I think, based on what I've explained to you in the language setting, and based on their result, I think we can infer that.

00:20:38.000 --> 00:20:41.000
Okay, so essentially, this is what happens, so…

00:20:41.000 --> 00:20:44.000
Yeah, the image gets passed on

00:20:44.000 --> 00:20:46.000
from Vision Encoder and the projector.

00:20:46.000 --> 00:20:50.000
And then, in the language model,

00:20:50.000 --> 00:20:56.000
In the language model backbone, we… or they have shown that the model creates this ordering ID,

00:20:56.000 --> 00:21:00.000
for, let's say, both horse…

00:21:00.000 --> 00:21:03.000
and CAD, as well as Fedora and…

00:21:03.000 --> 00:21:05.000
cap.

00:21:05.000 --> 00:21:08.000
in the visual token residual string.

00:21:08.000 --> 00:21:14.000
And then when we ask about cat is wearing a, and the answer should be capped, the model basically

00:21:14.000 --> 00:21:18.000
The language model basically fetches the ordering ID of…

00:21:18.000 --> 00:21:24.000
the cat, which is 2, from the vision token, to the cat token residual stream.

00:21:24.000 --> 00:21:30.000
And then it passes on that piece of information to the last token, the A token.

00:21:30.000 --> 00:21:34.000
Which is used to, again, fetch in the corresponding, uh…

00:21:34.000 --> 00:21:37.000
object ordering ID.

00:21:37.000 --> 00:21:41.000
And then finally, the model uses this object ordering ID to

00:21:41.000 --> 00:21:44.000
actually fetch its value, the cap.

00:21:44.000 --> 00:21:49.000
Which is actually the final prediction as the answer.

00:21:49.000 --> 00:21:57.000
It doesn't expect folks to…

00:21:57.000 --> 00:22:04.000
How does we know that if this rectangle represents a cat?

00:22:04.000 --> 00:22:09.000
How… I'm not saying that if it… it could be representing anything.

00:22:09.000 --> 00:22:11.000
Uh, but…

00:22:11.000 --> 00:22:14.000
Let's say… so when we…

00:22:14.000 --> 00:22:17.000
break down the image into smaller patches.

00:22:17.000 --> 00:22:20.000
Let's say the last batch.

00:22:20.000 --> 00:22:24.000
Which the last patch, which has some…

00:22:24.000 --> 00:22:28.000
region of the cat.

00:22:28.000 --> 00:22:31.000
will have this ordering ID.

00:22:31.000 --> 00:22:39.000
Okay. Even that's not… I don't think so they explained that in that paper. I think… I think in that paper… in their paper, they just had…

00:22:39.000 --> 00:22:42.000
like, shapes, which…

00:22:42.000 --> 00:22:45.000
can be covered with a single token.

00:22:45.000 --> 00:22:48.000
My single batch.

00:22:48.000 --> 00:22:51.000
So if you have an object which

00:22:51.000 --> 00:22:56.000
uh, spread across multiple patches.

00:22:56.000 --> 00:22:58.000
then I think…

00:22:58.000 --> 00:23:02.000
things become a little bit more complicated, but I think we can say that at least the last token of that

00:23:02.000 --> 00:23:06.000
the last batch of that particular object should encode

00:23:06.000 --> 00:23:11.000
this ordering ID in the language model space.

00:23:11.000 --> 00:23:16.000
It's the same as Rome, basically. It's kind of like the last entity of the…

00:23:16.000 --> 00:23:19.000
Like, the last token of the entity and call the same for the…

00:23:19.000 --> 00:23:25.000
Kind of like that. Interesting, yeah. It's the same in… The two of the passions are not, um…

00:23:25.000 --> 00:23:30.000
Um, sequential. Like, the cat can be…

00:23:30.000 --> 00:23:34.000
Yeah, yeah, yeah. That's right.

00:23:34.000 --> 00:23:39.000
That's right. There could be a lot of space in between, yeah.

00:23:39.000 --> 00:23:45.000
Though they don't show it in the paper, I… yeah, this is my understanding, and uh…

00:23:45.000 --> 00:23:48.000
Little bit of our experiment.

00:23:48.000 --> 00:23:49.000
Okay, but so this is what…

00:23:49.000 --> 00:24:00.000
Uh, I have another… I have another question. So, you're saying that each patch, then, is associated directly with an object, and the object is not spread across multiple patches?

00:24:00.000 --> 00:24:05.000
in the datasets that you're using in this experiment?

00:24:05.000 --> 00:24:08.000
in the paper that I'm talking about.

00:24:08.000 --> 00:24:09.000
Oh, in the paper that you're talking about. Okay, okay, got it, yeah.

00:24:09.000 --> 00:24:15.000
Yeah, so in this particular way.

00:24:15.000 --> 00:24:18.000
Okay, so, yeah, basically the same story in the…

00:24:18.000 --> 00:24:22.000
VLM setting.

00:24:22.000 --> 00:24:24.000
But, I think there's still a lot

00:24:24.000 --> 00:24:27.000
There are still many questions that remains unanswered.

00:24:27.000 --> 00:24:31.000
Um, so I think this is one of the first questions that we study in this paper.

00:24:31.000 --> 00:24:34.000
Which is where…

00:24:34.000 --> 00:24:37.000
Where does… where does the ordering ID get formed?

00:24:37.000 --> 00:24:45.000
Is it in the language model backbone itself, where we have already some evidence that it is present there?

00:24:45.000 --> 00:24:51.000
Or maybe it is not getting generated in the language model, it is actually getting generated in the vision encoder.

00:24:51.000 --> 00:24:53.000
And it's being passed on to the…

00:24:53.000 --> 00:24:57.000
language model backbone.

00:24:57.000 --> 00:25:02.000
Okay, so that's, like, one of the questions that we study.

00:25:02.000 --> 00:25:05.000
Another question that we study is how are they represented?

00:25:05.000 --> 00:25:08.000
Is it actually localized?

00:25:08.000 --> 00:25:11.000
Or, uh…

00:25:11.000 --> 00:25:14.000
Is it diffused across tokens?

00:25:14.000 --> 00:25:16.000
When I say it, uh, I mean…

00:25:16.000 --> 00:25:20.000
Yeah, okay, I mean, the ordering ID.

00:25:20.000 --> 00:25:22.000
Uh, and again,

00:25:22.000 --> 00:25:27.000
So we know that the ordering IDs are present in the language model backbone.

00:25:27.000 --> 00:25:30.000
Uh, so can we better characterize them?

00:25:30.000 --> 00:25:34.000
I mean, what kinds of…

00:25:34.000 --> 00:25:38.000
What kinds of representation are they? Are they encoding, like, a relative?

00:25:38.000 --> 00:25:44.000
sort of like a relative position, or are they encoding more like an abstract position? Like, okay, this is the…

00:25:44.000 --> 00:25:51.000
2, coordinates in the image, or is it, like, the first object in the image?

00:25:51.000 --> 00:25:58.000
So those are the three main questions that we study, and I think it will become more clear, I think, as I move… as I show the result and the experimental…

00:25:58.000 --> 00:26:01.000
setups.

00:26:01.000 --> 00:26:06.000
Okay. Um, yeah, so we studied these four settings.

00:26:06.000 --> 00:26:13.000
Um, 3 of them are synthetic, and one is actually coming from a real benchmark.

00:26:13.000 --> 00:26:21.000
Um, the first one is actually the simplest one, where we generate this squares of equal sizes, but different colors.

00:26:21.000 --> 00:26:29.000
They are either spread horizontally or vertically. Here, I'm only showing the horizontal one, but we have settings where…

00:26:29.000 --> 00:26:33.000
the squares and the shapes and the objects are spread vertically.

00:26:33.000 --> 00:26:36.000
And the question that we ask is,

00:26:36.000 --> 00:26:41.000
So, let's say… let's assume the square, this particular square exam… square image example.

00:26:41.000 --> 00:26:44.000
So, for this particular square, we could ask,

00:26:44.000 --> 00:26:49.000
What is the color of the square to the left of the green square?

00:26:49.000 --> 00:26:52.000
or the color of the square to the

00:26:52.000 --> 00:26:56.000
left off, green square is…

00:26:56.000 --> 00:26:59.000
And the answer should be… red.

00:26:59.000 --> 00:27:02.000
That's basically the task.

00:27:02.000 --> 00:27:04.000
Um,

00:27:04.000 --> 00:27:09.000
And we study do models, Quinn and Gemma models.

00:27:09.000 --> 00:27:13.000
And what the models can do this task perfectly.

00:27:13.000 --> 00:27:17.000
Okay, so that was the setup. Now…

00:27:17.000 --> 00:27:22.000
We are starting to get into the experimental setups, or the experiments that we did in the paper.

00:27:22.000 --> 00:27:27.000
Now, the first thing that we did was to actually cross-check if

00:27:27.000 --> 00:27:30.000
The VLM is…

00:27:30.000 --> 00:27:38.000
Again, using this ordering ID information to solve this task or not. Even though the previous work has already shown that for… but that was for a different task.

00:27:38.000 --> 00:27:45.000
So we wanted to confirm that this particular result actually generalizes to our setting or not.

00:27:45.000 --> 00:27:50.000
So, for that, we did, uh, a patching experiment.

00:27:50.000 --> 00:27:53.000
Which is pretty simple, actually. So we have this…

00:27:53.000 --> 00:27:55.000
two samples. The first one is the

00:27:55.000 --> 00:27:58.000
clean sample, where…

00:27:58.000 --> 00:28:01.000
The answer to the question is red.

00:28:01.000 --> 00:28:07.000
And for the sample on the right, the answer to the question is black.

00:28:07.000 --> 00:28:09.000
Okay?

00:28:09.000 --> 00:28:13.000
The other major difference between these two samples are…

00:28:13.000 --> 00:28:17.000
In the first sample, the first square is the correct answer.

00:28:17.000 --> 00:28:22.000
And in the second sample, the third square is the correct sample.

00:28:22.000 --> 00:28:28.000
Okay? And we are doing the intervention at the last text token, which is… is…

00:28:28.000 --> 00:28:36.000
In this particular setting. So, we are taking the residual vector at the ACE token from the counterfactual run, and pasting it

00:28:36.000 --> 00:28:39.000
onto its corresponding location, and layer.

00:28:39.000 --> 00:28:44.000
In the clean run. And then checking how does the final output change.

00:28:44.000 --> 00:28:48.000
And we are expecting that if a particular

00:28:48.000 --> 00:28:54.000
layer is encoding the ordering ID information.

00:28:54.000 --> 00:29:01.000
When that layer is patched from the counterfactual run to the clean run, the final outputs of the clean run should change from

00:29:01.000 --> 00:29:06.000
red to blue.

00:29:06.000 --> 00:29:10.000
Is that… does that make sense? I have just spoken a lot of words.

00:29:10.000 --> 00:29:15.000
Okay. So, we did that experiment, and this is the result.

00:29:15.000 --> 00:29:17.000
uh… we see that

00:29:17.000 --> 00:29:22.000
in the later-ish layer, something after layer 20.

00:29:22.000 --> 00:29:26.000
We do see that… that the last

00:29:26.000 --> 00:29:31.000
token is encoding this ordering ID representation,

00:29:31.000 --> 00:29:33.000
And if we patch that…

00:29:33.000 --> 00:29:37.000
from the counterfactual to the clearant, we ex… we get blue color as our…

00:29:37.000 --> 00:29:39.000
It's the final output. Uh-huh.

00:29:39.000 --> 00:29:50.000
Uh, Nikhil, can you explain this again? Like, why would you not expect the color to be black? Like, why would you expect it to be blue? Like, when you're doing this patching?

00:29:50.000 --> 00:30:01.000
Okay, so idea here is, we are… By the way, would you expect it to be black?

00:30:01.000 --> 00:30:05.000
Yeah, right? That's reasonable. It should be black.

00:30:05.000 --> 00:30:10.000
Didn't… didn't you figure in the next slide show that at the later layers it does start getting black?

00:30:10.000 --> 00:30:22.000
I think so. Yeah, so you guys… yeah, you already have partial answer. I mean, you… yeah, almost have the answer. So, essentially, what the model does it at the last token, it first forms this ordering ID information.

00:30:22.000 --> 00:30:27.000
of the correct square, which it needs to predict as the next token.

00:30:27.000 --> 00:30:33.000
And once it has formed that ordering ID information, then it uses that piece of information to actually fetch the

00:30:33.000 --> 00:30:36.000
feature associated with that.

00:30:36.000 --> 00:30:43.000
square that it needs to answer. So, in this case, or in this task, the feature is the color of that square.

00:30:43.000 --> 00:30:48.000
So, in the later-ish layer, after it has formed the ordering ID, it actually uses it

00:30:48.000 --> 00:30:49.000
To fetch the color of that square.

00:30:49.000 --> 00:30:51.000
Uh, I don't know, like, I feel like it should be black, yeah, I do expect it to be black.

00:30:51.000 --> 00:30:57.000
And hence, in further later layers, we see black color as the final output.

00:30:57.000 --> 00:31:08.000
Oh, so we… sorry, so if I… if I understand correctly, both the feature information as well as the ordering information is encoded in that representation you're using to patch.

00:31:08.000 --> 00:31:15.000
But the ordering information from the representation is recovered earlier, like, before the feature information.

00:31:15.000 --> 00:31:21.000
Yes, you can say that. So, model is first forming the ordering information, and then using it.

00:31:21.000 --> 00:31:25.000
to fetch the feature information, add further layers.

00:31:25.000 --> 00:31:28.000
Or later layers.

00:31:28.000 --> 00:31:41.000
Could it… would this be, like, a correct, very high-level understanding of, like, what this represents? Like, around Layer 20, when that blue peak is happening, it's forming, like, the index to look up, but it's not actually doing the lookup yet, so when you patch right at those layers,

00:31:41.000 --> 00:31:43.000
You intervene on, like,

00:31:43.000 --> 00:31:49.000
the location, um, that it's looking up before it actually retrieves the value there.

00:31:49.000 --> 00:31:53.000
Okay. That's exactly good. That's 100% correct.

00:31:53.000 --> 00:32:03.000
I even had a question? Yeah, so what exactly did you test? This is just a residual vector.

00:32:03.000 --> 00:32:07.000
Um… okay, so from this experiment, we can say that…

00:32:07.000 --> 00:32:10.000
Even for this particular task, the VLM

00:32:10.000 --> 00:32:15.000
is using this ordering ID information to solve the task.

00:32:15.000 --> 00:32:17.000
Okay, so now, coming to our…

00:32:17.000 --> 00:32:20.000
First, like,

00:32:20.000 --> 00:32:24.000
like, one of the main questions that we study in this paper.

00:32:24.000 --> 00:32:27.000
Which is to understand…

00:32:27.000 --> 00:32:32.000
Where exactly this ordering ID information is getting generated in the VLM.

00:32:32.000 --> 00:32:37.000
Yeah. Look at, um…

00:32:37.000 --> 00:32:40.000
varying in colors themselves.

00:32:40.000 --> 00:32:45.000
Right. Just seeing, you know, this should be an invariant.

00:32:45.000 --> 00:32:47.000
So, if we change, so, like, you know.

00:32:47.000 --> 00:32:51.000
orange, blue, and brown, but…

00:32:51.000 --> 00:32:57.000
Yes, I think I should have mentioned it. So this… this graph that you see is…

00:32:57.000 --> 00:33:00.000
is actually averaged over 50 samples.

00:33:00.000 --> 00:33:05.000
Yeah, I think we should mention that. It's… yeah, and those 50 samples have

00:33:05.000 --> 00:33:08.000
various different combinations of colors.

00:33:08.000 --> 00:33:15.000
So, it's not really specific to this particular sample. And yeah, so results are actually…

00:33:15.000 --> 00:33:20.000
Uh, averaged over a bunch of samples with different colors.

00:33:20.000 --> 00:33:25.000
Okay, uh-huh. Um, so you did residual, I think? Uh-huh.

00:33:25.000 --> 00:33:27.000
Do you also look at…

00:33:27.000 --> 00:33:29.000
any specific mechanism, like…

00:33:29.000 --> 00:33:32.000
Does this, uh…

00:33:32.000 --> 00:33:34.000
ordering readings.

00:33:34.000 --> 00:33:38.000
specific mechanism that does… Well, if you patch, you get 100% blue.

00:33:38.000 --> 00:33:41.000
With, like, I see that it's…

00:33:41.000 --> 00:33:43.000
the 80%?

00:33:43.000 --> 00:33:49.000
But even, even, even…

00:33:49.000 --> 00:33:52.000
that, I would say, maybe…

00:33:52.000 --> 00:33:55.000
Maybe there is some information in a…

00:33:55.000 --> 00:34:04.000
in previous token, one or two previous token, like Square token might have some information about the ordering ID, which we are not really touching upon.

00:34:04.000 --> 00:34:08.000
Maybe that still holds the ordering ID of…

00:34:08.000 --> 00:34:19.000
of the left square, and that's why we're not seeing super high intervention. Well, you know more than this. I mean, when you get to the end of the paper, you're gonna say that…

00:34:19.000 --> 00:34:22.000
Instead of saying maybe, you're going to say, oh, there is more than one.

00:34:22.000 --> 00:34:24.000
mechanism, right?

00:34:24.000 --> 00:34:34.000
Uh, so I think his question was why we are not seeing this… 100%. Yeah, 100% yeah. For this one mechanism. Yeah. It's because there's more than one mechanism. No, but this is… this is the end of…

00:34:34.000 --> 00:34:41.000
both the mechanism. Both the mechanisms have already combined. They both come together. Yeah, yeah, yeah. This is almost the end of the process.

00:34:41.000 --> 00:34:45.000
So the end is… the end is merged. The end is merged. Okay, I didn't know that. Yeah.

00:34:45.000 --> 00:34:48.000
End is almost mocked. This is almost the end of the compilation.

00:34:48.000 --> 00:34:52.000
Yeah, so it's like, if the thing that you're patching in is indeed…

00:34:52.000 --> 00:34:54.000
this, like, the right side index.

00:34:54.000 --> 00:34:58.000
Mm-hmm. Then what? Cleanly patch it, then you should probably get it.

00:34:58.000 --> 00:35:02.000
All the time. Yeah, so my guess is maybe we are…

00:35:02.000 --> 00:35:06.000
there are… there is some information maybe diffused over some other…

00:35:06.000 --> 00:35:14.000
tokens that we are not patching upon, and that's why we are not seeing a completely high…

00:35:14.000 --> 00:35:16.000
intervention effect.

00:35:16.000 --> 00:35:21.000
So maybe because the gap between saying red and saying that.

00:35:21.000 --> 00:35:24.000
He's so small. So weak.

00:35:24.000 --> 00:35:26.000
surprised at me.

00:35:26.000 --> 00:35:31.000
But there is, like, there is overlap between where they were the ordering representation is affected and the value of representation.

00:35:31.000 --> 00:35:36.000
So basically, what we see, when we see the blue peaks…

00:35:36.000 --> 00:35:38.000
It's already offered up.

00:35:38.000 --> 00:35:43.000
Yeah, that's also a point. And also, like, since it's…

00:35:43.000 --> 00:35:47.000
It's… it captured the ID, and it has to go look for it.

00:35:47.000 --> 00:35:50.000
Can we say that most of this happens in a country?

00:35:50.000 --> 00:35:53.000
Yeah, yeah, we can send it, yeah.

00:35:53.000 --> 00:35:57.000
Yeah, in this work, we don't look into retention heads, but…

00:35:57.000 --> 00:36:05.000
Yeah, I think previous works. I'm not sure if you would get a better performance, like, better effect just by…

00:36:05.000 --> 00:36:07.000
Batching in the heads. It… it…

00:36:07.000 --> 00:36:10.000
One point is what Tamar said.

00:36:10.000 --> 00:36:14.000
Maybe because of that, you might see…

00:36:14.000 --> 00:36:17.000
But I'm not 100% sure about that.

00:36:17.000 --> 00:36:23.000
Yeah, if it's different heads doing the validation. Yeah, yeah, then you might… you might get bets slightly better, yeah.

00:36:23.000 --> 00:36:27.000
Yeah. But we do hypothesize that it's very similar.

00:36:27.000 --> 00:36:33.000
Overall. Right, yeah. It's basically a look back. Yeah.

00:36:33.000 --> 00:36:38.000
Okay, so where are this ordering ID information are getting formed?

00:36:38.000 --> 00:36:42.000
Um…

00:36:42.000 --> 00:36:46.000
So, the first experiment that we did was actually probing experiment.

00:36:46.000 --> 00:36:51.000
We tried to… we… we asked this question of whether the…

00:36:51.000 --> 00:36:58.000
The vision embedding, vision token embeddings, which is fed into the language model backbone. If they already contain

00:36:58.000 --> 00:37:00.000
This ordering ID information or not.

00:37:00.000 --> 00:37:04.000
And what we do is, we…

00:37:04.000 --> 00:37:07.000
We take the representation of…

00:37:07.000 --> 00:37:12.000
the embedding of each of the square tokens.

00:37:12.000 --> 00:37:14.000
And, um…

00:37:14.000 --> 00:37:17.000
basically just train a linear three-class classifier.

00:37:17.000 --> 00:37:19.000
On top of, uh…

00:37:19.000 --> 00:37:21.000
those representations.

00:37:21.000 --> 00:37:27.000
Um, we use, like, 90 samples for training and 30 samples for testing.

00:37:27.000 --> 00:37:30.000
And the probe accuracy was almost perfect.

00:37:30.000 --> 00:37:33.000
Uh, so this, in a sense, shows that

00:37:33.000 --> 00:37:41.000
the ordering information is already present in the vision token embedding, which is fed into the language model.

00:37:41.000 --> 00:37:44.000
However, this was the most interesting result.

00:37:44.000 --> 00:37:52.000
So what we did was… so we trained the probe only on the square token, like, the embeddings corresponding to the square token.

00:37:52.000 --> 00:37:56.000
But now, once we have the probe, we can basically apply the probe to every

00:37:56.000 --> 00:37:59.000
uh, visual, uh, visual…

00:37:59.000 --> 00:38:01.000
token embedding. So that's what we did.

00:38:01.000 --> 00:38:07.000
We took the probe, we applied it on each

00:38:07.000 --> 00:38:09.000
Each of the…

00:38:09.000 --> 00:38:16.000
each of the patch, or its corresponding embedding. And this is the result that we get. What we found was…

00:38:16.000 --> 00:38:24.000
For the pro that was trained to find the ordering information of the first square, it has a

00:38:24.000 --> 00:38:26.000
good effect, or it has a good…

00:38:26.000 --> 00:38:28.000
uh… probing accuracy.

00:38:28.000 --> 00:38:35.000
On the test set, on entire strip, which is actually covering that square.

00:38:35.000 --> 00:38:43.000
So, this is where the square is supposed to be, somewhere here, but we found that this ordering ID information is actually spread across

00:38:43.000 --> 00:38:45.000
this entire strip.

00:38:45.000 --> 00:38:48.000
Um, yeah, question.

00:38:48.000 --> 00:38:53.000
Okay. So, yeah, so this… so this is the image.

00:38:53.000 --> 00:38:58.000
Okay. We break down the image into individual patches.

00:38:58.000 --> 00:39:01.000
those individual patches become the tokens.

00:39:01.000 --> 00:39:05.000
Which is fed into the VIT.

00:39:05.000 --> 00:39:09.000
And then the projector, and then we have this embeddings.

00:39:09.000 --> 00:39:12.000
Which are fed into the language model backbone.

00:39:12.000 --> 00:39:14.000
We talk… we took…

00:39:14.000 --> 00:39:19.000
those embeddings, which are fed into the language model backbone.

00:39:19.000 --> 00:39:22.000
Corresponding to the square… square…

00:39:22.000 --> 00:39:27.000
tokens only. So, we took the…

00:39:27.000 --> 00:39:29.000
So, we know that which token

00:39:29.000 --> 00:39:34.000
Which, like, the 100th token corresponds to the first square.

00:39:34.000 --> 00:39:37.000
200 token corresponds to the second square.

00:39:37.000 --> 00:39:41.000
So, we only picked up 100, 200, and 300.

00:39:41.000 --> 00:39:44.000
uh… vision tokens.

00:39:44.000 --> 00:39:47.000
Okay.

00:39:47.000 --> 00:39:51.000
Because those corresponding to… those are the ones

00:39:51.000 --> 00:39:56.000
for the Square tokens. Those are the ones which is encoding Square.

00:39:56.000 --> 00:40:00.000
You can physically point to it.

00:40:00.000 --> 00:40:04.000
You took it from the image? That's how we created the…

00:40:04.000 --> 00:40:08.000
image. So, we created the image, we have full control of where we can

00:40:08.000 --> 00:40:10.000
put the square in the image.

00:40:10.000 --> 00:40:17.000
So we put it in a specific position in that image, so that when we break it down, it actually corresponds to the 100th

00:40:17.000 --> 00:40:21.000
token which gets generated.

00:40:21.000 --> 00:40:23.000
Okay. And then what's the problem?

00:40:23.000 --> 00:40:29.000
Yeah, so the… once you have this 100, 200, and 300 active, like, token embedding,

00:40:29.000 --> 00:40:38.000
Then the 100th one corresponding… is corresponds to, like, the first square, 201 corresponds to the second square, and 300 corresponds to the third square.

00:40:38.000 --> 00:40:40.000
That's what the…

00:40:40.000 --> 00:40:45.000
probe is tasked to classify.

00:40:45.000 --> 00:40:48.000
increasing the size of the pitch.

00:40:48.000 --> 00:40:50.000
Square, now we are talking about vectors.

00:40:50.000 --> 00:40:52.000
This is emitting.

00:40:52.000 --> 00:41:01.000
When you say school, I'm not sure what you're… Yeah, there's so many squares along with this background, so you have to hang… I think that it might be helpful to actually… So this…

00:41:01.000 --> 00:41:06.000
different shapes. So, so this is a vector here now.

00:41:06.000 --> 00:41:09.000
That's not a square.

00:41:09.000 --> 00:41:14.000
That's a vector. That's a vector. This is the way. Yeah, this is a vector.

00:41:14.000 --> 00:41:17.000
Okay, well, so each square on the left…

00:41:17.000 --> 00:41:22.000
Which one is the square? That's a square. This is… when I say square, I mean this… this square. This…

00:41:22.000 --> 00:41:27.000
No. Is this only one? Well, it just happens to be. It could have been, I mean, so, like…

00:41:27.000 --> 00:41:38.000
Oh, sorry, we have four vectors… sorry. But for… for understanding, let's say there is only one vector. No, actually, for understanding, it might be clearer. It's 4 patches.

00:41:38.000 --> 00:41:45.000
They cover that square. Okay, let's say, yeah, in technical… in actual terms, there are, like, 4 vectors corresponding to each of…

00:41:45.000 --> 00:41:47.000
the red, green, and blue square.

00:41:47.000 --> 00:41:50.000
Okay.

00:41:50.000 --> 00:41:52.000
So, okay, so now you have…

00:41:52.000 --> 00:41:56.000
444. 12 vectors.

00:41:56.000 --> 00:42:02.000
You got it? So for the first four vectors corresponds to the first square.

00:42:02.000 --> 00:42:04.000
That's the advantageous label of them.

00:42:04.000 --> 00:42:09.000
The next 4 squared corresponds to the second one, and the last one corresponds to the third one.

00:42:09.000 --> 00:42:13.000
Essentially, that's what the probe is trying to

00:42:13.000 --> 00:42:21.000
a classified. We feed each of these 12 vectors in the probe, and the probe is supposed to give me 1, 2, 3.

00:42:21.000 --> 00:42:23.000
And then you use the same probe.

00:42:23.000 --> 00:42:24.000
to see other… other vectors.

00:42:24.000 --> 00:42:27.000
Okay.

00:42:27.000 --> 00:42:31.000
Yeah, exactly. So is it one pro for all horizontal, vertical?

00:42:31.000 --> 00:42:36.000
No, uh… uh… horizontal is 1.

00:42:36.000 --> 00:42:41.000
I mean, so this is a 3-class classifier. So we basically have, like, 3… 3 vectors.

00:42:41.000 --> 00:42:49.000
That makes sense. So, you… the same image, if you flip it, you can still use the same probe, right? Because the token dimensions, everything is the same.

00:42:49.000 --> 00:42:53.000
No, I think we trained a different one.

00:42:53.000 --> 00:43:00.000
It might work the… it might work that the horizontal work might be able to work in the vertical work, but I don't think so we tried it.

00:43:00.000 --> 00:43:05.000
Yeah, I think it might be more convincing that it's a positional…

00:43:05.000 --> 00:43:08.000
like, relative to missionary.

00:43:08.000 --> 00:43:15.000
Yeah, that is so… It looks like the… from the burning sphere, it's just queued into the, like…

00:43:15.000 --> 00:43:18.000
positional coding of the token, not necessarily, like, the…

00:43:18.000 --> 00:43:23.000
semantic relationship between the school districts. Could be literally the X and Ys, is what you're saying. Yeah. Yeah, it should be like that.

00:43:23.000 --> 00:43:27.000
But you have some of the results that, too, that it's morning.

00:43:27.000 --> 00:43:32.000
Yeah, but I'm sorry, why do you think it's X and Y? Because, uh, from the heat map?

00:43:32.000 --> 00:43:43.000
It just looks like what the signal the probe has cued into is just the positional coding of the token from the VIT encoder. When you say position, you mean X and Y coordinates? Yeah, just X and Y coordinates, yeah. Okay.

00:43:43.000 --> 00:43:53.000
It's not necessarily, like, the… because it seems like what you're after is, like, the romantic relationship, or, like, the orderings of, like, people… So, the counter for that argument is, you think the…

00:43:53.000 --> 00:43:57.000
X and Y of this particular square is…

00:43:57.000 --> 00:44:04.000
Isn't the X and Y coordinate of this particular square, or this particular patch is at the same distance at some square here.

00:44:04.000 --> 00:44:07.000
No, no, the X is different.

00:44:07.000 --> 00:44:10.000
But the distance, if you look at

00:44:10.000 --> 00:44:13.000
Just the X. So, it has picked up X…

00:44:13.000 --> 00:44:17.000
Yeah, so… But we do know now that we can change the participant.

00:44:17.000 --> 00:44:21.000
If you change the position of this object inside the image,

00:44:21.000 --> 00:44:24.000
It does generate blues. It does generate… what do you think?

00:44:24.000 --> 00:44:31.000
So, if you take… Do you have that experiment somewhere? Yeah, but yeah, I do have that experiment.

00:44:31.000 --> 00:44:37.000
I'm gonna go through that, but I'm still not sure about it. Now, to answer the question.

00:44:37.000 --> 00:44:48.000
Yeah. Because if you… if Yudero has said, which is, like, flip the horizontal and vertical, and it still generalizes, I would be, like, more convinced with it. Yeah, that I agree with. That makes sense as well.

00:44:48.000 --> 00:44:52.000
But, wait, wait, wait…

00:44:52.000 --> 00:44:55.000
We have a probing one?

00:44:55.000 --> 00:45:00.000
I just sent it. Oh, no, I did not include that.

00:45:00.000 --> 00:45:12.000
Should I open Discord now? Yeah. You said, you said this interactive, not the formal talk, so here we are. I feel like we see all of our talk.

00:45:12.000 --> 00:45:16.000
Oh, man. Okay, right, it's okay, nobody saw it.

00:45:16.000 --> 00:45:21.000
Okay, then you, you, you, you were to explain it, you've got to explain.

00:45:21.000 --> 00:45:24.000
Thomas need to explain the experiments. I'm not… we can't see.

00:45:24.000 --> 00:45:28.000
Oh, that's better, you can't say it.

00:45:28.000 --> 00:45:33.000
I don't know, it's not coming. That's right.

00:45:33.000 --> 00:45:38.000
Early about you. Earlier about me, that's right. That's right, that's right, especially about… Especially overhead.

00:45:38.000 --> 00:45:42.000
Rohit's in trouble now.

00:45:42.000 --> 00:45:52.000
Okay, never mind, it's not getting caught in.

00:45:52.000 --> 00:45:54.000
Can I say that again?

00:45:54.000 --> 00:46:04.000
Okay. And then what should I press?

00:46:04.000 --> 00:46:10.000
Watch out. He feels behind the kill.

00:46:10.000 --> 00:46:13.000
Okay, I'm chaining my entire screen now.

00:46:13.000 --> 00:46:16.000
It's okay, it's okay, it's okay.

00:46:16.000 --> 00:46:19.000
You want to explain what… what is this?

00:46:19.000 --> 00:46:22.000
Okay, so we train… it's the same probe, the same as…

00:46:22.000 --> 00:46:28.000
Um, but, uh, Anatoly asks, so we trained the problem on the squares on the top,

00:46:28.000 --> 00:46:30.000
Squares under top. Um…

00:46:30.000 --> 00:46:33.000
And something funny that we saw is that

00:46:33.000 --> 00:46:37.000
If you, um, try to train the probe,

00:46:37.000 --> 00:46:41.000
on each layer of their vision encoder, it actually…

00:46:41.000 --> 00:46:49.000
you get, like, 100% accuracy from layer zero, because it can overfit to the position embedding, and that's it.

00:46:49.000 --> 00:46:52.000
Yeah. Um…

00:46:52.000 --> 00:46:58.000
But what if you try to train the probe on one image and then test the generalization on another image?

00:46:58.000 --> 00:47:02.000
We're the position of the input is different.

00:47:02.000 --> 00:47:08.000
And what we see if we do that is that we still get a nice accuracy on top of…

00:47:08.000 --> 00:47:11.000
the square itself.

00:47:11.000 --> 00:47:15.000
But we still see these, like, strips.

00:47:15.000 --> 00:47:19.000
Um, in the original position. So the bottom one is generally… Yeah.

00:47:19.000 --> 00:47:23.000
Yeah. Um, do you have the… do you have the plot?

00:47:23.000 --> 00:47:30.000
So, here, just look at the, um, blue, green, and orange. So, the blue and green,

00:47:30.000 --> 00:47:36.000
Um, it's just a huge thing, and the layer on the top, on the bottom is the visual encoder layers.

00:47:36.000 --> 00:47:45.000
Um, so here you can see that just from layer 0, you can get nice accuracy if you just train them on the squares, or shifted squares.

00:47:45.000 --> 00:47:47.000
It can get it right?

00:47:47.000 --> 00:47:52.000
Um, in the orange one, it's actually the other image, so you…

00:47:52.000 --> 00:47:57.000
train on one setting, and then test on different locations of the office squares.

00:47:57.000 --> 00:48:04.000
And you can see that it peaks on layer 15, and it doesn't get 100% accuracy, meaning that there is

00:48:04.000 --> 00:48:11.000
kind of like a mix of information, but some of the information is not the absolute location, but something relative about the order.

00:48:11.000 --> 00:48:14.000
So, let's say, for your training center.

00:48:14.000 --> 00:48:16.000
The first square is always in token 100.

00:48:16.000 --> 00:48:19.000
Yes, exactly. Now, when your test said, it said, like…

00:48:19.000 --> 00:48:25.000
Yeah, but even be at the location of the third one. But it's still in the first question.

00:48:25.000 --> 00:48:29.000
You're not shifting the squares, it's zoomed out or zooming version.

00:48:29.000 --> 00:48:35.000
And then I think Nikit has a nice, uh, quasal experiment that also validates that, but…

00:48:35.000 --> 00:48:40.000
Can you maybe say that, and I wasn't confused? What's the input and what's the output of the probe?

00:48:40.000 --> 00:48:45.000
The input of the probe is the embedding of the visual tokens.

00:48:45.000 --> 00:48:50.000
And the output is the order bit, whether it's the first one in the image, the second one in the image, or the third one.

00:48:50.000 --> 00:48:52.000
So, 3 labels? Yes.

00:48:52.000 --> 00:48:56.000
Okay. Oh, you know what you should do? Like, uh, check… Get another one.

00:48:56.000 --> 00:49:01.000
No, no, the same probe for horizontal, vertical, and, like, differences.

00:49:01.000 --> 00:49:05.000
Yeah. Yeah, yeah, yeah, yeah. For training, I mean.

00:49:05.000 --> 00:49:14.000
Yeah, yeah, I think it could, yeah, yeah, yeah. I think that's the… I thought that's what you meant, uh, initially. No, I meant that one. I forgot about the…

00:49:14.000 --> 00:49:21.000
The joint one? Yeah, yeah. Yeah, okay, okay. Yeah, no, I think that's a good one. Yeah, we can try that. So what is the training data that were…

00:49:21.000 --> 00:49:24.000
memorize the position again?

00:49:24.000 --> 00:49:26.000
It's this… yeah, so it's…

00:49:26.000 --> 00:49:28.000
I think Nikki will… will…

00:49:28.000 --> 00:49:37.000
We'll get to it, right? Showing that it's that inject generalization, I'm not sure. The generalization is a mix of probably some of it.

00:49:37.000 --> 00:49:41.000
is actually memorizing the absolute position of the…

00:49:41.000 --> 00:49:44.000
project in the image.

00:49:44.000 --> 00:49:48.000
But some of it is actually the relative position between the objects.

00:49:48.000 --> 00:49:50.000
So whether it's the first one, the second one.

00:49:50.000 --> 00:49:52.000
So, if you want to…

00:49:52.000 --> 00:49:58.000
If you want to encode the order of things in your image, it can either be by just looking at the XY coordinate,

00:49:58.000 --> 00:50:06.000
Right, yeah, so in your data, your training data, the red one is always in position 100? Yes. What's the difference between the samples in the training data?

00:50:06.000 --> 00:50:13.000
colors, different colors. Oh, just colors. Or in the other setting. So the first one is always in the same location.

00:50:13.000 --> 00:50:15.000
But it does generalize this to…

00:50:15.000 --> 00:50:17.000
different location, to some extent.

00:50:17.000 --> 00:50:22.000
different locations coming in different orders. Different pic… different pixels. Different X and Ys. You ship them all.

00:50:22.000 --> 00:50:24.000
Yeah. Oh, okay.

00:50:24.000 --> 00:50:28.000
Smaller squares, more tighter together.

00:50:28.000 --> 00:50:31.000
Stasher is, like, super related, but that…

00:50:31.000 --> 00:50:33.000
I would say, like…

00:50:33.000 --> 00:50:41.000
Uh, so right now, you're training the probe in a very specific way, where the squares are always at the same location, and you're stress testing, and you…

00:50:41.000 --> 00:50:47.000
any defense nowadigos. I wonder whether things can go the other way around. You can actually train a stronger probe.

00:50:47.000 --> 00:50:54.000
by training, um, more diversity of, uh, script decisions. Yeah. And then passing on another distribution.

00:50:54.000 --> 00:50:59.000
That's sort of for his second suggestion, right? Is that right? Yeah. And it's okay.

00:50:59.000 --> 00:51:01.000
Right, I think that's right.

00:51:01.000 --> 00:51:09.000
Yeah, so that'd be… your data is a regularizer then, and you're… you're trying to get the generalized data out of it.

00:51:09.000 --> 00:51:14.000
And so, another question would be, I'm really curious about, like, whether, say, like…

00:51:14.000 --> 00:51:16.000
It works only on the VIT.

00:51:16.000 --> 00:51:20.000
or Quan, which is trained end-to-end for visual question answering tasks, or…

00:51:20.000 --> 00:51:23.000
It works for, like, Dino or Cliff, or…

00:51:23.000 --> 00:51:26.000
I don't know, maybe not quite, but, like, Dino World, it's like…

00:51:26.000 --> 00:51:32.000
So, we didn't study Dino, but uh… this… the result that I've shown here,

00:51:32.000 --> 00:51:34.000
They generalize, at least,

00:51:34.000 --> 00:51:39.000
Uh… at least to JAMA models. I don't know if Gemma is clear.

00:51:39.000 --> 00:51:46.000
Oh, I think Lava is lava. Lava is clip. I think…

00:51:46.000 --> 00:51:53.000
Like, using a pre-trained vision code. Oh, I see, then I don't know. Uh, but…

00:51:53.000 --> 00:51:57.000
Okay. But, yeah, so at least the results generalized to…

00:51:57.000 --> 00:52:00.000
JAMA model, and…

00:52:00.000 --> 00:52:04.000
Most of the results, uh, also generalize to pixel.

00:52:04.000 --> 00:52:09.000
But some of the experiments, uh, I just still need to do on Pixel.

00:52:09.000 --> 00:52:13.000
But so far, the experiments that I've done on Pixtrel,

00:52:13.000 --> 00:52:18.000
Seems to be okay, or consistent with the results that we have.

00:52:18.000 --> 00:52:22.000
Yeah. Someone asked about sleep.

00:52:22.000 --> 00:52:25.000
augmentation, regularizing these approaches.

00:52:25.000 --> 00:52:28.000
I just want to be sure that…

00:52:28.000 --> 00:52:31.000
It's, like, one…

00:52:31.000 --> 00:52:33.000
We're catching the same mechanism.

00:52:33.000 --> 00:52:36.000
What if there's just, you know, there's a mechanism for…

00:52:36.000 --> 00:52:39.000
Let's take the relative ordering.

00:52:39.000 --> 00:52:45.000
horizontal pages that are put to the middle, a different one for… I guess, like, you know, the top of the same time.

00:52:45.000 --> 00:52:47.000
We think there's, like,

00:52:47.000 --> 00:52:52.000
Yeah, you think there's, like, a risk of…

00:52:52.000 --> 00:52:55.000
when you're training probes, uh…

00:52:55.000 --> 00:52:59.000
like, the kinds of probes that we…

00:52:59.000 --> 00:53:05.000
train, or is it more like training the probes with all different kinds of configuration?

00:53:05.000 --> 00:53:08.000
Um, I mean, I guess in general…

00:53:08.000 --> 00:53:14.000
Uh-huh. Because it's… there's, like, a range of waiting there how specificity versus how a generalized.

00:53:14.000 --> 00:53:18.000
the kind of settings that you're trading these folks.

00:53:18.000 --> 00:53:21.000
you know, there's something to be considered there with respect to, like,

00:53:21.000 --> 00:53:29.000
the annoying mechanisms and how specialists are generalized they do be.

00:53:29.000 --> 00:53:34.000
Okay, so my…

00:53:34.000 --> 00:53:38.000
My answer to that would be if…

00:53:38.000 --> 00:53:43.000
the mechanism between these two types of, like,

00:53:43.000 --> 00:53:46.000
configuration are significantly different.

00:53:46.000 --> 00:53:48.000
then maybe the…

00:53:48.000 --> 00:53:53.000
The training or the test accuracy of those probes will tell you that.

00:53:53.000 --> 00:53:55.000
that maybe you're not able to train

00:53:55.000 --> 00:54:00.000
a very good classifier, like, a very good probe.

00:54:00.000 --> 00:54:06.000
either in terms of your training law… training accuracy, or in terms of your test accuracy.

00:54:06.000 --> 00:54:10.000
I think that's what I would assume.

00:54:10.000 --> 00:54:20.000
Makes sense, yeah, it feels like it's very hard, like… Yeah. It's not grounded.

00:54:20.000 --> 00:54:21.000
Okay, so…

00:54:21.000 --> 00:54:33.000
I had another question. Uh, if… how similar are these auditing IDs to, like, the position embeddings of these tokens? I guess in the… in the vision language model, it's, um…

00:54:33.000 --> 00:54:41.000
it's probably less useful to look at this, but say in the case of just, like, the LLM, uh, just like in terms of the words,

00:54:41.000 --> 00:54:51.000
How similar are the vectors, like, from the… from just the position embeddings and these ordering, uh, IDs?

00:54:51.000 --> 00:54:54.000
I didn't mean she just showed us… can you go back to the image view?

00:54:54.000 --> 00:55:03.000
Showed from the Discord?

00:55:03.000 --> 00:55:05.000
This one?

00:55:05.000 --> 00:55:08.000
Yeah, this one.

00:55:08.000 --> 00:55:11.000
No, no, no, no. That's the one you were just on.

00:55:11.000 --> 00:55:16.000
Like, my understanding of what this is saying is that, like,

00:55:16.000 --> 00:55:18.000
if we, like…

00:55:18.000 --> 00:55:23.000
If we try to deduce the algorithm, the linear probe is learning. It's learning, like, a mix of, like,

00:55:23.000 --> 00:55:30.000
This is evidence that is just reading off the positional embedding, right? And then this faint little patch here is evidence.

00:55:30.000 --> 00:55:37.000
that there's… there's actual generalization, or, like, relative. Is that a correct interpretation of results? Okay.

00:55:37.000 --> 00:55:38.000
And the same pro…

00:55:38.000 --> 00:55:40.000
Got it, thank you, yeah.

00:55:40.000 --> 00:55:45.000
Yeah, I think it's clearest in, like, this third image, right? You see this?

00:55:45.000 --> 00:55:47.000
This little guy, which definitely corresponds.

00:55:47.000 --> 00:55:49.000
Or maybe… To the relative.

00:55:49.000 --> 00:55:52.000
And then maybe if we train to…

00:55:52.000 --> 00:55:54.000
Rohit Shenyu probe.

00:55:54.000 --> 00:55:56.000
And you could just get that. Yeah.

00:55:56.000 --> 00:55:59.000
By itself, right?

00:55:59.000 --> 00:56:04.000
Okay, you should name that. Yeah.

00:56:04.000 --> 00:56:11.000
I'm willing to share the business.

00:56:11.000 --> 00:56:16.000
Yeah, I hope that, yeah, I think that answers the question. Um…

00:56:16.000 --> 00:56:22.000
So I guess we can… No, I mean, I… I want you to explain to me the figure.

00:56:22.000 --> 00:56:24.000
This figure? Yeah.

00:56:24.000 --> 00:56:30.000
Okay, so you understood the probe right? Yes. So we just take that probe. So, the probe was trained only on the square tokens.

00:56:30.000 --> 00:56:35.000
But now we apply the same probe across all the tokens, including the background tokens.

00:56:35.000 --> 00:56:37.000
And this is the result that we get.

00:56:37.000 --> 00:56:41.000
This is the test accuracy. Test accuracy.

00:56:41.000 --> 00:56:46.000
Yeah, test accuracy of the probe.

00:56:46.000 --> 00:56:49.000
So, so, so assume that you, uh…

00:56:49.000 --> 00:56:52.000
You're looking at the first Vision token.

00:56:52.000 --> 00:56:57.000
The first vision token. In our image, that would be something like this.

00:56:57.000 --> 00:57:00.000
Just a blank batch.

00:57:00.000 --> 00:57:03.000
White patch?

00:57:03.000 --> 00:57:06.000
No, that's not in the training data.

00:57:06.000 --> 00:57:08.000
So how does the test data work?

00:57:08.000 --> 00:57:13.000
The test data just take…

00:57:13.000 --> 00:57:16.000
the input to the product is the token embedding.

00:57:16.000 --> 00:57:18.000
During training, they only show

00:57:18.000 --> 00:57:31.000
tokens from the squares exclusively. With different, different colors. Different colors. And it's always the exact same shape. Always the same. No, shape definitions. We also have other datasets where we…

00:57:31.000 --> 00:57:42.000
This is the simple dataset of squares. It can be objects, it can be… for this setting that we're talking about right now, the training is these three squares, just different colors. Yeah, okay, and then how does the test data look?

00:57:42.000 --> 00:57:45.000
This data loop also includes the embedding,

00:57:45.000 --> 00:57:51.000
of all the tokens in the image. Oh, but it's still the same shape? Yes, but it's… I held out set of…

00:57:51.000 --> 00:57:53.000
Let me just see. Okay.

00:57:53.000 --> 00:58:03.000
But still, there's 3. For example, in the object setting, when you have, like, tiny images with an object, it will be completely different objects.

00:58:03.000 --> 00:58:06.000
But he has the same configuration, just different colors.

00:58:06.000 --> 00:58:09.000
Okay, and then we apply the probe.

00:58:09.000 --> 00:58:11.000
on all the patches, all the…

00:58:11.000 --> 00:58:14.000
tokens corresponding to each of the patches.

00:58:14.000 --> 00:58:19.000
And then we see that the test accuracy is not only high on the square tokens, which is…

00:58:19.000 --> 00:58:26.000
It… it is supposed… it was trained on. We see high test accuracy also on the background tokens.

00:58:26.000 --> 00:58:31.000
background tokens, which are sort of, like, creating a strip around the square, which

00:58:31.000 --> 00:58:33.000
It was Trainer. So it needs to say…

00:58:33.000 --> 00:58:37.000
No, there's… this is not the first position. It knows that

00:58:37.000 --> 00:58:40.000
This is the first position.

00:58:40.000 --> 00:58:43.000
So, so if you take that first patch that I was talking about,

00:58:43.000 --> 00:58:48.000
You take its corresponding embedding, and then use your probe.

00:58:48.000 --> 00:58:51.000
then it will say that, okay, this is the first…

00:58:51.000 --> 00:58:56.000
position. Like, this is… this is the first one. Now…

00:58:56.000 --> 00:58:59.000
200 labels that it can predict.

00:58:59.000 --> 00:59:06.000
It's still 3 limits. Yeah, labels is still 3. It's doing something silly, like, instead of doing… So, here are the labels, like this? Yeah.

00:59:06.000 --> 00:59:13.000
Okay. Yeah, so this is… first, labels are just first square, second square, third square, that's it. There's these three class… classifiers.

00:59:13.000 --> 00:59:21.000
Yeah, I understand, but for the Y1, what is the label? That was my question. What is the gold label for the white patch? It doesn't have, like… It doesn't have it, yeah.

00:59:21.000 --> 00:59:25.000
We didn't film it, and now we say, oh, interesting, what…

00:59:25.000 --> 00:59:30.000
So, like, something that will be very not surprising will be if the white

00:59:30.000 --> 00:59:35.000
The ground will have zero accuracy for all the people observed.

00:59:35.000 --> 00:59:41.000
When you say accuracy, what is the gold label of the white background? Accuracy is whether the problem… we have three prompts.

00:59:41.000 --> 00:59:47.000
which prompts is accurate, if it predicts the right order. So the first problem predicts first order.

00:59:47.000 --> 00:59:49.000
Right? I'm the first object.

00:59:49.000 --> 00:59:58.000
The second problem, predict, I'm the second. So the best one, what's the order of the background? It doesn't have an order, but it seems like the vision encoder assigned an order to the background.

00:59:58.000 --> 01:00:04.000
Although it doesn't… it doesn't supposed to… So it's not accuracy. It is accuracy. Oh, I understand what your question is.

01:00:04.000 --> 01:00:07.000
So when you give the white patch…

01:00:07.000 --> 01:00:13.000
When you view the white patch embedding as an input, the output is a softmax of 3 levels.

01:00:13.000 --> 01:00:16.000
whatever the accuracy is for the first token.

01:00:16.000 --> 01:00:18.000
like, the first class that…

01:00:18.000 --> 01:00:24.000
plotting that. Yeah, that's the prediction. Assume that we have… I think I got the…

01:00:24.000 --> 01:00:31.000
accurate. 90% confident. That's confidence, actually. 90% confident that it's coming from classwork.

01:00:31.000 --> 01:00:36.000
Yeah. So that's not accurate. That's not accurate, so that's confidence. Okay. Yeah, it's first-class confidence.

01:00:36.000 --> 01:00:42.000
It's very confident that the white patch from above the red square. But maybe what they're showing here is…

01:00:42.000 --> 01:00:46.000
Now, this is ac- this is accuracy. So we… let's assume that…

01:00:46.000 --> 01:00:50.000
the ground truth is the first one, is the first square for the first batch.

01:00:50.000 --> 01:00:56.000
No, what you're showing here is you're saying… you're saying accuracy, if every patch… Uh-huh.

01:00:56.000 --> 01:01:06.000
The ground truth was zero. Like, for the whole anime. Yeah, exactly. Then you're showing accuracy there. And that's not accuracy theory, because everything has the same label.

01:01:06.000 --> 01:01:12.000
Right, but it's just… it's just number of predictions. That's not accuracy? The accuracy of the problem.

01:01:12.000 --> 01:01:15.000
path to show that the background…

01:01:15.000 --> 01:01:19.000
those hands actually contains information relevant to the plot.

01:01:19.000 --> 01:01:29.000
As if the probe was trained to predict the background level. Right, but I shouldn't treat it as accuracy, but look at it as the probe prediction. Yes, okay, yes.

01:01:29.000 --> 01:01:39.000
Yeah, the board accuracy was confusing, Jose, because there are labels. Wait, I'm not sure if I follow completely. This… you're saying this… you

01:01:39.000 --> 01:01:41.000
Accuracy? That's great also.

01:01:41.000 --> 01:01:49.000
I don't know. We… I think we can find a way to define it, but what we actually care about is what is the prediction of the problem in these tokens.

01:01:49.000 --> 01:01:51.000
Yeah, and we check it if it's…

01:01:51.000 --> 01:01:53.000
We check it by saying…

01:01:53.000 --> 01:01:55.000
Let's assume all the labels were 1.

01:01:55.000 --> 01:02:01.000
Yes, exactly. So, we are checking the accuracy here. Except that… except that there's this objection.

01:02:01.000 --> 01:02:03.000
To use the word accuracy,

01:02:03.000 --> 01:02:18.000
when what you're measuring is not whether it's right or not. Yeah, it's not like… Because normally you use the word accuracy to talk about, you get one if it's right, and zero if it's wrong. And you're using accuracy just as units. Yeah, accuracy for

01:02:18.000 --> 01:02:22.000
particular label. Just, just, uh…

01:02:22.000 --> 01:02:25.000
I find it very well. What's the top…

01:02:25.000 --> 01:02:27.000
If you do maximum and the submax, what would be the prediction?

01:02:27.000 --> 01:02:31.000
Okay. 90 times out of 100.

01:02:31.000 --> 01:02:34.000
The white label has been predicted as plus 1.

01:02:34.000 --> 01:02:37.000
So you might call that predictable. 90 out of…

01:02:37.000 --> 01:02:40.000
Like, if that's 90, like, let's say, the accuracy. Prediction rate.

01:02:40.000 --> 01:02:43.000
I don't know. That sounds like accuracy to me.

01:02:43.000 --> 01:02:50.000
Well, but not to other readers, right? Okay, okay, okay, maybe we can change the wording there, okay.

01:02:50.000 --> 01:02:52.000
Inaccuracy implies that…

01:02:52.000 --> 01:02:58.000
Right, it's like, if you have a high accuracy, that's good, but if you have a high prediction rate of something that

01:02:58.000 --> 01:03:04.000
a white token is privileged to be, like, the red… Okay, that makes sense, that makes sense. Okay.

01:03:04.000 --> 01:03:07.000
Problem is, uh, we're calling this prediction is…

01:03:07.000 --> 01:03:11.000
we didn't expect this prediction. It's not like we texted

01:03:11.000 --> 01:03:27.000
The hypothesis. It's okay, but it's all pretty. Yeah. Yeah, that's okay, that's okay. It's not that you predicted the model. Yeah, yeah, that's okay. Yeah, we should have a look up how we look at this. Okay, but yeah, okay, so the main, main result, main takeaway is…

01:03:27.000 --> 01:03:40.000
that the ordering information of the square is not concentrated or localized only on the square tokens. It is diffused across background tokens as well. Can I say it again?

01:03:40.000 --> 01:03:42.000
Yes. That I said in the start.

01:03:42.000 --> 01:03:46.000
I was like, what? It's a surprise, like…

01:03:46.000 --> 01:03:48.000
You have objects?

01:03:48.000 --> 01:03:52.000
you… they have a certain order inside the image.

01:03:52.000 --> 01:03:59.000
But now it seems like that order is not associated with the object itself, but kind of like a strip around.

01:03:59.000 --> 01:04:01.000
like, uh…

01:04:01.000 --> 01:04:12.000
So it means that the group thinks there, right?

01:04:12.000 --> 01:04:15.000
The yellow… the yellow points show that

01:04:15.000 --> 01:04:21.000
let's say on the top, the probe predicts this to be in position 0. Yes. That's right, yeah.

01:04:21.000 --> 01:04:24.000
This is what it needs. The prediction rate. Yes.

01:04:24.000 --> 01:04:33.000
Do you have also English? Yes, that's, like, next year. What do you think about this, like…

01:04:33.000 --> 01:04:38.000
So, like, here's, like, a formal construction that I'm thinking about in reference to this.

01:04:38.000 --> 01:04:44.000
if I trained a linear probe on the same classification task, but the only inputs I gave it

01:04:44.000 --> 01:04:47.000
were the positional embeddings with, like,

01:04:47.000 --> 01:04:59.000
maybe, like, the Y values of, like, with the information related to Y ablated out. Like, no square, no nothing. Yeah. Like, I… for one, for this, for the specific task you described, which is not the rowhead version, it would do a perfect job.

01:04:59.000 --> 01:05:07.000
And then if you actually produce… reproduce this figure, it would look, like, perfect columns, right? So what's, like… I don't… what's the evidence that it's not that…

01:05:07.000 --> 01:05:10.000
Your linear probe is not just doing that. Are you ready for review?

01:05:10.000 --> 01:05:13.000
I think we should move to the causal experiments.

01:05:13.000 --> 01:05:16.000
Okay. Just last question, what layer is this on?

01:05:16.000 --> 01:05:24.000
Just the embedding of the… This is the input embedding to the language. The last layer of the vision model is after the projector.

01:05:24.000 --> 01:05:31.000
No, after the MLP. After the projector. So, input to the language model backbone.

01:05:31.000 --> 01:05:34.000
Even past the last day of the vision. Yes.

01:05:34.000 --> 01:05:41.000
just… if you're gonna go back, sorry, to delay this one a bit more, but to maybe dissuade, like, that objection,

01:05:41.000 --> 01:05:45.000
If you trained on something like this same dataset,

01:05:45.000 --> 01:05:47.000
But the three squares are kind of on the top.

01:05:47.000 --> 01:05:53.000
then you would have examples where you have the exact same square, but in this case, it's the second one.

01:05:53.000 --> 01:05:58.000
And in this case, it's the first one. So it's exactly the same positional embeddings, but it's kind of…

01:05:58.000 --> 01:06:04.000
you're kind of showing that it's figuring out… Yeah, you also have the article. I think just continue.

01:06:04.000 --> 01:06:12.000
Okay. So, this is probing result. This shows that, okay, maybe the ordering information is diffused, or it at least at least

01:06:12.000 --> 01:06:24.000
at least present across a bunch of different tokens, but we all know that probing only is correlational results. It does not say anything about the causality, the information which should be present there, and…

01:06:24.000 --> 01:06:33.000
It might not be used by the model. So we do this intervention experiment to actually show that the information present in the strip is causal in nature.

01:06:33.000 --> 01:06:35.000
Um, and this is the…

01:06:35.000 --> 01:06:37.000
two experiments that we did.

01:06:37.000 --> 01:06:39.000
Um…

01:06:39.000 --> 01:06:43.000
So the difference between clear and counterfactual is that the order of…

01:06:43.000 --> 01:06:46.000
Left and right.

01:06:46.000 --> 01:06:50.000
squares are reversed. In the clean one, red is the left one, and…

01:06:50.000 --> 01:06:53.000
On the right, uh, on the counterfactual, the right…

01:06:53.000 --> 01:06:57.000
Right one is the right one. Okay, so if we do…

01:06:57.000 --> 01:07:02.000
If the model is creating this ordering ID information, then when we do…

01:07:02.000 --> 01:07:09.000
this to patching, the color of the square should remain the same.

01:07:09.000 --> 01:07:12.000
While the ordering information should get reversed.

01:07:12.000 --> 01:07:19.000
And because of that reversal of the ordering ID, the final answer of the clean wrench should change from

01:07:19.000 --> 01:07:22.000
um, red to blue.

01:07:22.000 --> 01:07:24.000
Here we are asking what is to the left of the green color.

01:07:24.000 --> 01:07:29.000
Okay, so that's pretty much the experiment. And we do intervention.

01:07:29.000 --> 01:07:35.000
on either only the Square tokens, or on the entire strip tokens.

01:07:35.000 --> 01:07:38.000
Including this one. Including the square.

01:07:38.000 --> 01:07:43.000
So, this is the result for patching in the entire strip.

01:07:43.000 --> 01:07:46.000
X-axis is the embedding.

01:07:46.000 --> 01:07:50.000
Embedding layer, and the later layers in the language model.

01:07:50.000 --> 01:07:53.000
Backbone, and Y-axis is…

01:07:53.000 --> 01:07:55.000
let's say it is pro… yeah.

01:07:55.000 --> 01:08:02.000
Um, so we see that, okay, even when you patch in the embedding layer,

01:08:02.000 --> 01:08:05.000
We have very high causal effect.

01:08:05.000 --> 01:08:12.000
that the model starts to think that the square to the left of the green square is actually blue.

01:08:12.000 --> 01:08:18.000
And not red. Right from there, zero. Right from layer zero. Right from layer zero, it says the…

01:08:18.000 --> 01:08:20.000
incorrect answer.

01:08:20.000 --> 01:08:23.000
Yes. One second, one second.

01:08:23.000 --> 01:08:29.000
But if we do the patching on the square tokens only, not the entire strip, we don't see that effect.

01:08:29.000 --> 01:08:33.000
The model still thinks that the red is the correct answer and not the blue.

01:08:33.000 --> 01:08:36.000
If we do patching only on the square token.

01:08:36.000 --> 01:08:38.000
So, well, shit, I…

01:08:38.000 --> 01:08:41.000
Okay, so if you come… think of…

01:08:41.000 --> 01:08:45.000
These two results combinedly. What that means is,

01:08:45.000 --> 01:08:47.000
There is a causal effect

01:08:47.000 --> 01:08:51.000
Associated with the background tokens. If you don't touch the background tokens,

01:08:51.000 --> 01:08:53.000
The model…

01:08:53.000 --> 01:08:57.000
model does not really have enough causal effect

01:08:57.000 --> 01:09:00.000
to be able to change the final output.

01:09:00.000 --> 01:09:03.000
So, essentially, what that means is that entire strip

01:09:03.000 --> 01:09:07.000
has ordering information, which is causal in nature. It is not only

01:09:07.000 --> 01:09:14.000
Just there. Which is causally relevant to its accuracy, right? Exactly. Okay. Yeah.

01:09:14.000 --> 01:09:15.000
Question? Yes, it's…

01:09:15.000 --> 01:09:17.000
I wanted to ask what you bench.

01:09:17.000 --> 01:09:20.000
What do I patch?

01:09:20.000 --> 01:09:25.000
And as I understand later, is that you pass the ball trick.

01:09:25.000 --> 01:09:28.000
we do two kinds. We either patch the square,

01:09:28.000 --> 01:09:30.000
or we patch the entire strip.

01:09:30.000 --> 01:09:35.000
And when you touch this square, uh, you dip all the flow? Yes.

01:09:35.000 --> 01:09:40.000
It's only those four, and when we pass the strip, it's those four, plus…

01:09:40.000 --> 01:09:46.000
all the background tokens as in that strip.

01:09:46.000 --> 01:09:51.000
I wasn't until Sunday.

01:09:51.000 --> 01:09:54.000
This is more convincing.

01:09:54.000 --> 01:09:56.000
Maybe we should do only the wine.

01:09:56.000 --> 01:10:00.000
This feels like it's just, uh, this woman in…

01:10:00.000 --> 01:10:01.000
patchings that are happening, and maybe it's just a person.

01:10:01.000 --> 01:10:05.000
Uh, question?

01:10:05.000 --> 01:10:09.000
Your internet connection is unstable.

01:10:09.000 --> 01:10:10.000
I don't know.

01:10:10.000 --> 01:10:15.000
No, uh, I… I had a question about, uh, instead of patching the whole script, uh, whole strip,

01:10:15.000 --> 01:10:25.000
If you just patch that red square, but onto a different position on that first column,

01:10:25.000 --> 01:10:26.000
Yeah, I think we have…

01:10:26.000 --> 01:10:27.000
Uh, you know, so you're essentially changing the position of the square. Uh, I don't know, this is an annoying question, like, yeah, what, what…

01:10:27.000 --> 01:10:33.000
No, I think there's some dynamic, that's a good question. We have that experiment in one of our later slides. I'm gonna show that.

01:10:33.000 --> 01:10:40.000
So, can we look at the graphs in the meantime, or if we ask some questions?

01:10:40.000 --> 01:10:43.000
When you do the patches…

01:10:43.000 --> 01:10:45.000
I found one side to the other.

01:10:45.000 --> 01:10:48.000
is the positional encoding

01:10:48.000 --> 01:10:51.000
embedded in those tokens when you're shooting people.

01:10:51.000 --> 01:11:00.000
like, is it, like, it appears to have the same position coatings of having been on the right side, and it's swapped over, or is it…

01:11:00.000 --> 01:11:04.000
the positional coding. It's given the same positional encoding as if it was on the left side.

01:11:04.000 --> 01:11:09.000
Does that… is that question? Um…

01:11:09.000 --> 01:11:11.000
No, I guess…

01:11:11.000 --> 01:11:17.000
Maybe what you mean by position? Because I think the previous discussion about the probing was like, oh, like, maybe, like,

01:11:17.000 --> 01:11:23.000
the ordering is just comes from, like, the positional coating that you give the tokens before it enters all on the stuff, and…

01:11:23.000 --> 01:11:26.000
I guess what I'm worried about is, like, is…

01:11:26.000 --> 01:11:31.000
is, like, is the tokens, do they contain the…

01:11:31.000 --> 01:11:34.000
relative position on coding.

01:11:34.000 --> 01:11:38.000
when you do the matching. Uh-huh.

01:11:38.000 --> 01:11:44.000
So it's almost an architectural question you're asking, because for some transformer implementations,

01:11:44.000 --> 01:11:46.000
the position I'm coding is…

01:11:46.000 --> 01:11:49.000
directly added into the…

01:11:49.000 --> 01:11:52.000
you know, the vectors, and in other transformer…

01:11:52.000 --> 01:11:56.000
representations, the positional encoding has never materialized.

01:11:56.000 --> 01:12:00.000
It's just implicitly added as part of attention.

01:12:00.000 --> 01:12:04.000
Uh, later on. And so, I think you're asking, oh, when you patch over…

01:12:04.000 --> 01:12:07.000
Are you patching over from…

01:12:07.000 --> 01:12:12.000
are you patching vectors where the positional encoding has already been added?

01:12:12.000 --> 01:12:15.000
Are you passing over vectors where the position…

01:12:15.000 --> 01:12:17.000
Including has not been added yet.

01:12:17.000 --> 01:12:22.000
And it'll be added after you've done the patch. So that's the question? Yeah, yeah. So the answer for that is,

01:12:22.000 --> 01:12:24.000
So even in the VIT, the vision encoder,

01:12:24.000 --> 01:12:26.000
There is positional information.

01:12:26.000 --> 01:12:31.000
And… You mean where? Before you pass or after you pass? At each…

01:12:31.000 --> 01:12:39.000
Before, before… Yeah. Yeah, yeah, yeah. Right, before you patch. Yeah, before I patch. There is rope there in the version encoder as well. Okay.

01:12:39.000 --> 01:12:42.000
So, technically, it has… it has…

01:12:42.000 --> 01:12:51.000
access to that traditional information. Okay. And the other thing is, here, I'm showing not the… not only the result of the embedding layer, I result across

01:12:51.000 --> 01:12:56.000
bunch of layers in the language model. So even in the language model, there is rope.

01:12:56.000 --> 01:13:02.000
Okay. So, it has access to both those push information.

01:13:02.000 --> 01:13:09.000
Okay, well, like, I want to clarify the question a little bit. So, I feel like, because Transformers permutation and variant,

01:13:09.000 --> 01:13:12.000
There's no way, like, the model…

01:13:12.000 --> 01:13:20.000
able to figure out this test without position code, yes. So the question is more like how, like, these models leverage this passage encoding, whether it's just, like,

01:13:20.000 --> 01:13:23.000
Hard coding, like, the absolute position, or actually…

01:13:23.000 --> 01:13:25.000
learning a generated notion of one.

01:13:25.000 --> 01:13:28.000
finding, like, object-relative associations with, like…

01:13:28.000 --> 01:13:30.000
They're relative. Uh… Mm-hmm.

01:13:30.000 --> 01:13:35.000
Yeah, yeah. It's more like how they're using this. Yeah, I think that's a good question. I think before I was…

01:13:35.000 --> 01:13:44.000
Before we ran some results a few weeks back, my understanding was it's the second thing. The models are creating some form of, like,

01:13:44.000 --> 01:13:47.000
abstract relative positional information.

01:13:47.000 --> 01:13:56.000
to actually do the binding. And that's what we have seen in most of the results in the language model space. But I think I have a few results in some of the letters

01:13:56.000 --> 01:13:58.000
slides where this…

01:13:58.000 --> 01:14:06.000
this ordering ID business in the vision encoder seems a bit more complex, a bit more complicated. It's not only

01:14:06.000 --> 01:14:10.000
those relative abstract position information, but…

01:14:10.000 --> 01:14:13.000
The model seems to be encoding some form of absolute

01:14:13.000 --> 01:14:16.000
additional information as well.

01:14:16.000 --> 01:14:21.000
Well, even though they're using Rope, 2D row. Yeah. Okay.

01:14:21.000 --> 01:14:23.000
It's also…

01:14:23.000 --> 01:14:25.000
at least.

01:14:25.000 --> 01:14:29.000
easier for language models we submitted last year.

01:14:29.000 --> 01:14:32.000
Yeah, this, like, idea of delusional meaning.

01:14:32.000 --> 01:14:36.000
Specifically because of the causal mask.

01:14:36.000 --> 01:14:40.000
You don't even need any, like, rope, but you can just do language modeling.

01:14:40.000 --> 01:14:44.000
Without any positional encoding, and it works because of impacts.

01:14:44.000 --> 01:14:47.000
It's another, like, extra signal that we have.

01:14:47.000 --> 01:14:50.000
Yeah, so that's, like, that's, like, one of the evidences.

01:14:50.000 --> 01:14:53.000
That you can use to say that the models can create

01:14:53.000 --> 01:14:56.000
Or at least language models can create more abstract

01:14:56.000 --> 01:15:01.000
uh, sort of, like, positional information in its internal representation.

01:15:01.000 --> 01:15:06.000
The defense of the vision encoder, you can also say that

01:15:06.000 --> 01:15:09.000
Because it doesn't have this frozen masking,

01:15:09.000 --> 01:15:12.000
If we generate these, um, strip-like.

01:15:12.000 --> 01:15:19.000
Yeah. Otherwise…

01:15:19.000 --> 01:15:22.000
Otherwise, each token can only see.

01:15:22.000 --> 01:15:24.000
The one before it.

01:15:24.000 --> 01:15:32.000
I think Rogue still allows it to do it, but I just mean that for the language models, it's, like, more of an inductive bias. Yeah.

01:15:32.000 --> 01:15:35.000
Yeah, I agree with that.

01:15:35.000 --> 01:15:41.000
about the causal experiment. Mm-hmm.

01:15:41.000 --> 01:15:44.000
What if the…

01:15:44.000 --> 01:15:47.000
And the inflammation, it just…

01:15:47.000 --> 01:15:49.000
Explained. Not only in this…

01:15:49.000 --> 01:15:51.000
industry, but…

01:15:51.000 --> 01:15:55.000
You just were all over the image.

01:15:55.000 --> 01:15:59.000
And you need to change enough.

01:15:59.000 --> 01:16:04.000
content in order to the language model.

01:16:04.000 --> 01:16:09.000
Because the inflammation is all around, not only with the strength. And you may…

01:16:09.000 --> 01:16:14.000
You made the experiment only with this trip, so this is what we see.

01:16:14.000 --> 01:16:16.000
We need a baseboat.

01:16:16.000 --> 01:16:19.000
But maybe if you just, um…

01:16:19.000 --> 01:16:21.000
Yeah, it's both been…

01:16:21.000 --> 01:16:25.000
Enough. Um…

01:16:25.000 --> 01:16:29.000
Yeah, so I think I would say that

01:16:29.000 --> 01:16:35.000
I mean, yeah, you can just take a few tokens around the Square token, and maybe if you patch that, it might work.

01:16:35.000 --> 01:16:44.000
That's okay, but I think the probing results show that the result is diffused across the strip, so I think it becomes more methodological to patch the entire strip.

01:16:44.000 --> 01:16:50.000
Because there is no…

01:16:50.000 --> 01:16:55.000
various in the data set, like, all… all the…

01:16:55.000 --> 01:16:58.000
All the squares are in the same position.

01:16:58.000 --> 01:17:01.000
But I think…

01:17:01.000 --> 01:17:16.000
If there was… I think probability in the y-axis… Yeah, I know printing the problem with, uh… But that becomes more easier thing. So if I would have trained it across a bunch of different squares in, like, different positions of,

01:17:16.000 --> 01:17:22.000
the Y axis, then it becomes much more expected that, okay, you will see the strip.

01:17:22.000 --> 01:17:24.000
But now that we are only

01:17:24.000 --> 01:17:27.000
training it on just one position,

01:17:27.000 --> 01:17:28.000
And even then, we are seeing the effect on the entire ship. I think that's more surprising.

01:17:28.000 --> 01:17:34.000
Okay, okay.

01:17:34.000 --> 01:17:36.000
Well, being feed dogs.

01:17:36.000 --> 01:17:40.000
It's much easier for the job to take into account only the x.

01:17:40.000 --> 01:17:43.000
Um, X.

01:17:43.000 --> 01:17:47.000
Because it's always the same place.

01:17:47.000 --> 01:17:51.000
In the Y, it's always in the same place, so the problem…

01:17:51.000 --> 01:17:57.000
doesn't need the information of the YX.

01:17:57.000 --> 01:18:01.000
Okay. I think a nice baseline would be…

01:18:01.000 --> 01:18:03.000
to do the Pajik experiment.

01:18:03.000 --> 01:18:05.000
Please, um…

01:18:05.000 --> 01:18:07.000
Last week, or whatever?

01:18:07.000 --> 01:18:11.000
like, the test, the exact same number of tokens.

01:18:11.000 --> 01:18:15.000
is in the original 3-puts Department?

01:18:15.000 --> 01:18:18.000
I've, like, shaped otherwise. Like, it can be, or… I don't know.

01:18:18.000 --> 01:18:21.000
Instead of, for example, Vertica, it can be…

01:18:21.000 --> 01:18:29.000
Yeah, diagonal, it can be in square over the square, but the same number of tokens. So a random number.

01:18:29.000 --> 01:18:32.000
And what will that tell us? That it's not…

01:18:32.000 --> 01:18:37.000
just the portion of image information that you took.

01:18:37.000 --> 01:18:39.000
13!

01:18:39.000 --> 01:18:42.000
But it's exactly that, like…

01:18:42.000 --> 01:18:46.000
structure inside the…

01:18:46.000 --> 01:18:49.000
Okay, as a baseline, maybe, you know, you can use it, okay.

01:18:49.000 --> 01:18:53.000
Okay. Again, I have, like, a technical question.

01:18:53.000 --> 01:18:55.000
And let's talk suggestions.

01:18:55.000 --> 01:18:59.000
So, technical question is, like, these are four tokens.

01:18:59.000 --> 01:19:01.000
in your expert. Mm-hmm.

01:19:01.000 --> 01:19:03.000
And you are giving…

01:19:03.000 --> 01:19:06.000
the top two and the bottom top.

01:19:06.000 --> 01:19:09.000
Right? Like, in the red square of the first.

01:19:09.000 --> 01:19:13.000
So the two tokens from top and two tokens from local.

01:19:13.000 --> 01:19:16.000
So, like, a good neural network should learn.

01:19:16.000 --> 01:19:18.000
that, okay, anything that is…

01:19:18.000 --> 01:19:22.000
Like, stacked on top of, like, these tokens should be in, like, class 1.

01:19:22.000 --> 01:19:25.000
So…

01:19:25.000 --> 01:19:27.000
like, let's say the Rohit…

01:19:27.000 --> 01:19:29.000
I'm seeing you through.

01:19:29.000 --> 01:19:31.000
was just paying dog.

01:19:31.000 --> 01:19:37.000
this and that, like, only these two cases, but different countries.

01:19:37.000 --> 01:19:46.000
Okay. Same proof. Okay. Same positions, don't… Yeah, yeah, yeah, yeah. No changes in position. So, if my hypothesis is correct…

01:19:46.000 --> 01:19:52.000
Uh-huh. Then you should see only the…

01:19:52.000 --> 01:19:55.000
like this, uh, the position that you're seeing.

01:19:55.000 --> 01:19:58.000
only, like, in a L shape.

01:19:58.000 --> 01:20:05.000
in a L shape, yes? Yeah, because the first… Yeah, the first, yeah.

01:20:05.000 --> 01:20:11.000
So that tells it's probably not positional embedding, it's something… sorry, probably not ordering.

01:20:11.000 --> 01:20:13.000
representation, it's more positional number.

01:20:13.000 --> 01:20:18.000
If you don't see it, if you still see this happening, like, where you are seeing, like, all these…

01:20:18.000 --> 01:20:22.000
Entire columns coming out to be…

01:20:22.000 --> 01:20:29.000
Uh, in the test, if you're seeing the same structure, then maybe I'll be more convinced that, like, it's an ordering result.

01:20:29.000 --> 01:20:31.000
So if it is L,

01:20:31.000 --> 01:20:34.000
If it's an L, then I think it's…

01:20:34.000 --> 01:20:40.000
Then it is just, like, absolute… it is encoding learning X and Y coordinates. Yeah.

01:20:40.000 --> 01:20:44.000
It is not in action y coordinates.

01:20:44.000 --> 01:20:47.000
Okay.

01:20:47.000 --> 01:20:51.000
Okay.

01:20:51.000 --> 01:20:53.000
Like, the other probe, when you still see…

01:20:53.000 --> 01:21:00.000
You have a different arrangement of the image, specifically, there's three, you know, in the same location as the original focus.

01:21:00.000 --> 01:21:08.000
Yeah. So that's the over fee to the XY here for me. Yeah. But you do see some generalization plus with the object flow techniques.

01:21:08.000 --> 01:21:10.000
So, that…

01:21:10.000 --> 01:21:18.000
like, I have another counterpoint for you. Like, it's probably, like, uh, this is a square.

01:21:18.000 --> 01:21:20.000
like a square representation.

01:21:20.000 --> 01:21:23.000
Right, but it does… I don't know. But it does know…

01:21:23.000 --> 01:21:26.000
To extract the order.

01:21:26.000 --> 01:21:29.000
So it knows, okay, oh,

01:21:29.000 --> 01:21:31.000
This is our background, and I actually need to…

01:21:31.000 --> 01:21:38.000
in the background, I need to overview the XY, but if I have a square, then suddenly I have more information.

01:21:38.000 --> 01:21:44.000
And that's the information it includes us. So there is some information of the square that is relevant to the patient.

01:21:44.000 --> 01:21:46.000
Okay, so…

01:21:46.000 --> 01:21:49.000
Um… that makes sense.

01:21:49.000 --> 01:21:52.000
And, like, it'd be nice, like…

01:21:52.000 --> 01:21:55.000
It's, like, slightly more strong, I think. Yeah.

01:21:55.000 --> 01:21:59.000
Also, a second thing is when you do patching from your…

01:21:59.000 --> 01:22:01.000
entire white slab. Mm-hmm.

01:22:01.000 --> 01:22:05.000
Um, so this is, again, autoregressive in language space.

01:22:05.000 --> 01:22:12.000
Correct. All these tokens. So the bottom token of thread probably already knows it comes from the column of the red token.

01:22:12.000 --> 01:22:19.000
But the top white patch might not, you know, no way, it still knows, because it comes from this encoder.

01:22:19.000 --> 01:22:21.000
That's the claim.

01:22:21.000 --> 01:22:23.000
No, that… that is the architecture.

01:22:23.000 --> 01:22:25.000
Yeah, that's it, yeah.

01:22:25.000 --> 01:22:28.000
Okay, anyways, I think it's, uh…

01:22:28.000 --> 01:22:31.000
Practical question.

01:22:31.000 --> 01:22:35.000
How many more slides do you have?

01:22:35.000 --> 01:22:40.000
So I think that's one of the main results. Oh, that's good.

01:22:40.000 --> 01:22:45.000
No, actually, there's one more main other thing. One is into the other main result?

01:22:45.000 --> 01:22:48.000
So, uh, should I take 5 more minutes? Sure.

01:22:48.000 --> 01:22:54.000
Okay, so… okay, so those are the results for showing that the vision encoder

01:22:54.000 --> 01:22:58.000
already in courses, ordering information.

01:22:58.000 --> 01:23:01.000
of the squares.

01:23:01.000 --> 01:23:06.000
Even before the language model gets into the picture. So then, what's the… then the question is, what's this language model doing?

01:23:06.000 --> 01:23:12.000
Is it just using that piece of information, or does it create its own set of…

01:23:12.000 --> 01:23:15.000
Recording information, or what is happening?

01:23:15.000 --> 01:23:17.000
So, to answer that question, we did, uh…

01:23:17.000 --> 01:23:20.000
Another patching experiment.

01:23:20.000 --> 01:23:22.000
Here, the idea is to basically

01:23:22.000 --> 01:23:27.000
Ablate out or remove all the ordering information from coming from the vision encoder.

01:23:27.000 --> 01:23:29.000
So the way we do that…

01:23:29.000 --> 01:23:35.000
is like we create a synthetic image, like a representation of a synthetic image,

01:23:35.000 --> 01:23:40.000
Where the representation of each square

01:23:40.000 --> 01:23:44.000
is… is, like, taken from

01:23:44.000 --> 01:23:47.000
the… like, a different image.

01:23:47.000 --> 01:23:52.000
With the same square placed in the middle. And that's the only square.

01:23:52.000 --> 01:23:54.000
So the idea here is,

01:23:54.000 --> 01:24:01.000
If the model is encoding that this is the first square, second square, third square, then if we just take their representation from a single…

01:24:01.000 --> 01:24:04.000
square image. So, uh…

01:24:04.000 --> 01:24:06.000
then…

01:24:06.000 --> 01:24:12.000
The representation of each square should just be saying that, okay, I'm the first one, I'm the first one, I'm the first one.

01:24:12.000 --> 01:24:18.000
And we also patch with the activations of the backward… sorry, not backward… background tokens.

01:24:18.000 --> 01:24:22.000
to destroy the information in those background tokens.

01:24:22.000 --> 01:24:25.000
Okay, so if we do that batch patching…

01:24:25.000 --> 01:24:32.000
those set of patching. The… the ordering information from the vision encoder should get destroyed.

01:24:32.000 --> 01:24:34.000
And if you do that, the behavioral performance decreases.

01:24:34.000 --> 01:24:37.000
It decreases significantly.

01:24:37.000 --> 01:24:45.000
Which is kind of expected. But the point here is it's still… is above the chance. Chance is 33%. It's not…

01:24:45.000 --> 01:24:51.000
Near 33%. So that means language model does seem to do something.

01:24:51.000 --> 01:24:58.000
Um, so then we check if the models or the language model does create its own, like, ordering information or not.

01:24:58.000 --> 01:25:01.000
By just doing the same piece of experiment,

01:25:01.000 --> 01:25:03.000
But now, the…

01:25:03.000 --> 01:25:07.000
The ordering information from the version encoder are destroyed.

01:25:07.000 --> 01:25:14.000
And then, in this experiment, we only need to do the patching on the squares, we don't need to do the patching on the entire strip.

01:25:14.000 --> 01:25:18.000
And it's the same experiment, and this is the result.

01:25:18.000 --> 01:25:25.000
What this shows is, even the language model does seem to create its own ordering information in some of its middle…

01:25:25.000 --> 01:25:28.000
Middle layers.

01:25:28.000 --> 01:25:30.000
Um…

01:25:30.000 --> 01:25:33.000
Yeah, so that's… So the x-axis in the previous experiment…

01:25:33.000 --> 01:25:43.000
With layers of the vision encoder, and then the x-axis… No, this is both language, both the previous experiment and this experiment, but they're both language models. Yes, language model backwards.

01:25:43.000 --> 01:25:47.000
There, we saw the effect right from the start.

01:25:47.000 --> 01:25:49.000
Okay. Because the information was…

01:25:49.000 --> 01:25:56.000
provided by the vision encoder itself. So the input of the language model backbone already has the information. I see. But here,

01:25:56.000 --> 01:25:58.000
That information is not present.

01:25:58.000 --> 01:26:04.000
So, we only see the alignment, only the middle-ish layer of the English model.

01:26:04.000 --> 01:26:06.000
This is much more similar to…

01:26:06.000 --> 01:26:10.000
the first example with the apple is in Box A.

01:26:10.000 --> 01:26:12.000
This is the mechanism that drives that.

01:26:12.000 --> 01:26:18.000
Right. Yeah. Yeah.

01:26:18.000 --> 01:26:33.000
But is this a language model dealing with the tokens, the language tokens, or is this a language model dealing with the spatial token information? Like, you're saying that there's some, uh, transformation of that representation which also exists in the LM backbone?

01:26:33.000 --> 01:26:39.000
It's the second one, and I'm not sure about the transformation. Transformation occurs before the LLM back…

01:26:39.000 --> 01:26:42.000
backbone, which is done by the projector.

01:26:42.000 --> 01:26:49.000
So, it's just like some… it's retaining some information, but from the vision model, but in the LM space as well.

01:26:49.000 --> 01:26:56.000
the spatial information from the pictures, from the images are retained in the LM space as well.

01:26:56.000 --> 01:27:02.000
So that's the general claim, that the vision encoder is encoding the ordering information.

01:27:02.000 --> 01:27:09.000
But let's say if you get rid of that information coming from the vision encoder, the language model also

01:27:09.000 --> 01:27:15.000
Uh, generates its own ordering information.

01:27:15.000 --> 01:27:23.000
got it done, but the information generated by the language model is pertaining to the spatial information, right? This is not from the language token, this is, again, like, related to the…

01:27:23.000 --> 01:27:25.000
Yeah, yeah, exactly. We are operating on the visual token.

01:27:25.000 --> 01:27:32.000
spatial… spatial… okay, okay.

01:27:32.000 --> 01:27:38.000
So that's… that's the second main result, that even the language model does seem to create its own

01:27:38.000 --> 01:27:42.000
ordering information. And it's a little bit better left to right than it is up and down.

01:27:42.000 --> 01:27:44.000
Yes, that is right.

01:27:44.000 --> 01:27:52.000
What is that? So, is the assumption that the LLM does this?

01:27:52.000 --> 01:27:56.000
Because you are basically preventing the vision of a great issue.

01:27:56.000 --> 01:28:00.000
from doing that? Or do you think that they both do it twice?

01:28:00.000 --> 01:28:05.000
Because I think you could validate this by doing the patching of the original LLM states.

01:28:05.000 --> 01:28:09.000
well, you're not watching the vision encoder.

01:28:09.000 --> 01:28:17.000
BCLM states, while you're patching the vision encoder to test whether the LLM is having a compensatory behavior, right?

01:28:17.000 --> 01:28:26.000
Or not. Say again? Yeah, I did not follow the experiment. Yeah, so kind of like, right now, you're patching the vision, right? So you know that in the vision, there's nothing going on in terms of positions.

01:28:26.000 --> 01:28:28.000
But now you're wondering, is…

01:28:28.000 --> 01:28:42.000
are these results from the other end emerging because the LLM is compensating from the lack of positional information? Or are there also before? Yes. So then you could try to get the original activations from the previous LLM,

01:28:42.000 --> 01:28:45.000
and push them over the current ones.

01:28:45.000 --> 01:28:48.000
while also patching the vision part, right?

01:28:48.000 --> 01:28:52.000
So using the previous LLM where the vision was normal.

01:28:52.000 --> 01:28:58.000
was not doing this because it's so kind of like, oh, the vision part is already handling this, so I don't have to do it.

01:28:58.000 --> 01:29:07.000
Uh, now you would see that basically patching both would cause, like, the performance to go to zero, something like that, right?

01:29:07.000 --> 01:29:12.000
I'm still not 100% sure. If you patch in… yeah, I think we can talk about it. It's not very…

01:29:12.000 --> 01:29:19.000
Uh, simple thing to worry about. But I think that's a good question. Uh, here, what we are arguing is, it is…

01:29:19.000 --> 01:29:24.000
The second case, I think what you said, that both of them are happening simultaneously.

01:29:24.000 --> 01:29:31.000
Right. It's not the case, like, it's not just the case that the language model mechanism comes into the picture only when

01:29:31.000 --> 01:29:37.000
the… the information from the vision encoder is updated. I think it is there

01:29:37.000 --> 01:29:45.000
in normal cases as well. But we have not done very thorough experiment for that. I think the argument that we are using is,

01:29:45.000 --> 01:29:49.000
You see, the result for the below and above here?

01:29:49.000 --> 01:29:51.000
Slightly growing. Yeah, that's true.

01:29:51.000 --> 01:29:58.000
And we say that that growing part is coming from the language model ordering ID. Makes sense.

01:29:58.000 --> 01:30:01.000
But maybe we can talk about their experiment later.

01:30:01.000 --> 01:30:03.000
Um, so yes.

01:30:03.000 --> 01:30:06.000
What? I didn't understand. Why did you have 60… not here, sorry, here.

01:30:06.000 --> 01:30:11.000
The next one? This one? Next one, next one. Next one. No, you had, like…

01:30:11.000 --> 01:30:17.000
It's actually not the intervention of QSD, so it probably would be…

01:30:17.000 --> 01:30:20.000
the probability to save it, or the probability to save it.

01:30:20.000 --> 01:30:22.000
So it backs up the one.

01:30:22.000 --> 01:30:28.000
I don't see it adding… It may not add to one, I mean, is this distributed, yeah.

01:30:28.000 --> 01:30:32.000
We did not do a softmax on color tokens. We've just…

01:30:32.000 --> 01:30:35.000
take the probability of this particular color.

01:30:35.000 --> 01:30:37.000
It may not add to 1.

01:30:37.000 --> 01:30:42.000
But, so this is just the most simple thing.

01:30:42.000 --> 01:30:48.000
you make the model, you have the vocabulary distribution, all the logic, pick the logic of the red, pick the logic of the blue.

01:30:48.000 --> 01:30:50.000
But there's also another $50,000.

01:30:50.000 --> 01:30:53.000
they would have some logic, right?

01:30:53.000 --> 01:31:00.000
budget value. Okay. So it might not add to 1. The logit of blue and logit of red might… it should be close to 1,

01:31:00.000 --> 01:31:06.000
But it may not be exactly what. Y is 0 to 1?

01:31:06.000 --> 01:31:12.000
like, the scale, why is it 0 to 1 if it's just large? What's average population.

01:31:12.000 --> 01:31:18.000
You do the softbox first, then you read off the probabilities corresponding to the literal token read and the literal token blue, right? Yeah. Okay.

01:31:18.000 --> 01:31:20.000
Yeah, that's it.

01:31:20.000 --> 01:31:22.000
Yeah, the IA was incorrect.

01:31:22.000 --> 01:31:25.000
But I think the reserve will still hold.

01:31:25.000 --> 01:31:28.000
Yeah. Same twin?

01:31:28.000 --> 01:31:34.000
We want to quickly show that it's… Okay, so the main… yeah, so I think that's the… one of the main arguments, that

01:31:34.000 --> 01:31:45.000
Both version encoder and the language model are creating or forming this ordering information, which is being used to do spatial reasoning task in VLM.

01:31:45.000 --> 01:31:49.000
And then we use this insight to improve the performance of

01:31:49.000 --> 01:31:57.000
One of the models on a benchmark called WhatsApp, which is okay, I think. Yeah, we, we, we are, we were able to improve the performance.

01:31:57.000 --> 01:32:03.000
significantly better than just using a random baseline, which is just to pick up random directions.

01:32:03.000 --> 01:32:05.000
I mean…

01:32:05.000 --> 01:32:08.000
Yeah, it depends how you call your signatures. 15 to 50?

01:32:08.000 --> 01:32:10.000
Is that the jump you did?

01:32:10.000 --> 01:32:19.000
So, read this number, I think. The original model performance was 90%. And then we got it till 95.

01:32:19.000 --> 01:32:20.000
Yeah, but this is unsupervised. We don't need any supervision, you can do it on any model, in any image.

01:32:20.000 --> 01:32:25.000
Okay, okay.

01:32:25.000 --> 01:32:27.000
All you need to do is just…

01:32:27.000 --> 01:32:29.000
train these probes.

01:32:29.000 --> 01:32:38.000
Which is sort of… should figure out whether it's the first object, second object, or third object in the image, and that's it. And you can use those

01:32:38.000 --> 01:32:40.000
probes as a steering vector.

01:32:40.000 --> 01:32:41.000
Oh, bless it.

01:32:41.000 --> 01:32:50.000
What's the difference… what's the difference between the accuracy and the percent corrected failure? Because the percent corrected failure seems, like, really huge, right? Improvements in both cases. I mean, much higher.

01:32:50.000 --> 01:32:55.000
Yeah, so… yeah, so this is 90%, so 10% are incorrect.

01:32:55.000 --> 01:33:00.000
Among those 10%, we are able to fix 50%.

01:33:00.000 --> 01:33:04.000
How many of the incorrect answers?

01:33:04.000 --> 01:33:12.000
We were able to… Flip, yeah, to correct.

01:33:12.000 --> 01:33:17.000
So the… these results are on the non-synthetic images, right? Yeah, what's up.

01:33:17.000 --> 01:33:20.000
And the probe was also trained on the ground? Yeah.

01:33:20.000 --> 01:33:22.000
Yeah. Yeah.

01:33:22.000 --> 01:33:26.000
Do you know how much, uh, when 2.5 would do on that?

01:33:26.000 --> 01:33:28.000
data set.

01:33:28.000 --> 01:33:31.000
Like, you did 8 billion, right? Or $7 billion.

01:33:31.000 --> 01:33:33.000
I think…

01:33:33.000 --> 01:33:36.000
Okay. Okay, so…

01:33:36.000 --> 01:33:39.000
One more piece of experiment which we have not put into the paper, but…

01:33:39.000 --> 01:33:45.000
I've been doing it in the past few weeks, is something that we… that has come into the discussion as well.

01:33:45.000 --> 01:33:52.000
Which is… okay, so in the language model space, previous works has shown that

01:33:52.000 --> 01:33:56.000
This ordering information is encoding the relative positional information.

01:33:56.000 --> 01:34:03.000
Something like this is the first object, this is the first box. So, we asked the same question,

01:34:03.000 --> 01:34:07.000
Uh, for the ordering information generated by the vision encoder.

01:34:07.000 --> 01:34:11.000
Now that we know that Vision Encoder also generates these kinds of information.

01:34:11.000 --> 01:34:17.000
So for that, we… we also did causal experiment with slightly different task.

01:34:17.000 --> 01:34:19.000
So, this is the task.

01:34:19.000 --> 01:34:24.000
two differences that you should notice. First, in the image,

01:34:24.000 --> 01:34:31.000
The squares are shifted a little bit. It's not equally spread.

01:34:31.000 --> 01:34:34.000
The squares are actually shifted towards…

01:34:34.000 --> 01:34:39.000
left-hand side in this particular case, and the question that we ask is, the leftmost

01:34:39.000 --> 01:34:43.000
The color of the leftmost square is, and the answer should give us…

01:34:43.000 --> 01:34:47.000
green. And we do patching experiments again.

01:34:47.000 --> 01:34:49.000
And uh… this is the…

01:34:49.000 --> 01:34:56.000
counterfactual sample, which you can think of it as, like, just a mirror image of your clean image.

01:34:56.000 --> 01:34:59.000
Um, one important point is,

01:34:59.000 --> 01:35:03.000
The exact XY coordinate of this blue square

01:35:03.000 --> 01:35:06.000
is the same between these two sample.

01:35:06.000 --> 01:35:08.000
Okay?

01:35:08.000 --> 01:35:13.000
Um… okay, and then we basically take the… we pass the strip from…

01:35:13.000 --> 01:35:21.000
this particular area, region, to this particular region, and this particular strip to this particular strip.

01:35:21.000 --> 01:35:24.000
Okay. Now, there are two hypotheses.

01:35:24.000 --> 01:35:26.000
Whether the model is…

01:35:26.000 --> 01:35:28.000
forming relative, or whether the model is

01:35:28.000 --> 01:35:31.000
forming absolute positional information.

01:35:31.000 --> 01:35:37.000
If it is forming relative position information, this is what would happen.

01:35:37.000 --> 01:35:41.000
Here, it will say that this is the first square.

01:35:41.000 --> 01:35:45.000
So, after patching, this blue square will become the first square.

01:35:45.000 --> 01:35:48.000
And this green square will become the third square.

01:35:48.000 --> 01:35:53.000
Okay. So, the answer to this question will change from green.

01:35:53.000 --> 01:35:56.000
to blue. If it is relative.

01:35:56.000 --> 01:36:00.000
Now, the rational for absolute is something like this. Let's say…

01:36:00.000 --> 01:36:06.000
It is not encoding first, second, or third, it is encoding

01:36:06.000 --> 01:36:11.000
3.1, 3 gamma 1, 3 gamma2, 3 comma 3.

01:36:11.000 --> 01:36:15.000
And this would be 3, 3 comma 4, 3 comma 5.

01:36:15.000 --> 01:36:17.000
Okay, so after we do the batching…

01:36:17.000 --> 01:36:20.000
This green will become 3,5.

01:36:20.000 --> 01:36:22.000
The brown will remain 3,2.

01:36:22.000 --> 01:36:26.000
And uh… blue will remain 3,3.

01:36:26.000 --> 01:36:30.000
So we have 3 comma 5, 3, 2, 3, 3.

01:36:30.000 --> 01:36:33.000
After the patching, if we are assuming absolute position.

01:36:33.000 --> 01:36:38.000
So then, the leftmost square should be 3 comma 2, the brown one.

01:36:38.000 --> 01:36:40.000
That's why the expected

01:36:40.000 --> 01:36:48.000
color after the intervention. If the position information is including the absolute one, it's the brown one.

01:36:48.000 --> 01:36:52.000
That's a little bit diff… does that make sense?

01:36:52.000 --> 01:36:54.000
Okay. So…

01:36:54.000 --> 01:36:58.000
This is the result. If you do the patching of…

01:36:58.000 --> 01:37:01.000
Just the version embeddings.

01:37:01.000 --> 01:37:03.000
Vision token embeddings.

01:37:03.000 --> 01:37:05.000
Um, we see that.

01:37:05.000 --> 01:37:08.000
And this is probability.

01:37:08.000 --> 01:37:11.000
It is sort of like… yeah.

01:37:11.000 --> 01:37:13.000
like, distributed…

01:37:13.000 --> 01:37:17.000
equally between brown and blue.

01:37:17.000 --> 01:37:22.000
And that's why I say… and this… this is only for this…

01:37:22.000 --> 01:37:34.000
Clean images shifted towards left, but the same, like, similar results also holds when the squares in the clean images are actually shifted towards right, or top, or bottom.

01:37:34.000 --> 01:37:36.000
And that's why I think…

01:37:36.000 --> 01:37:39.000
what I think in the vision encoder, it's not that…

01:37:39.000 --> 01:37:48.000
clear. As clear as what we have seen in the language model space, which… where we have seen that the model clearly uses this relative portion information.

01:37:48.000 --> 01:37:51.000
But in the vision space, I think it seems to be including

01:37:51.000 --> 01:37:55.000
both this kind of, like, absolute and relative…

01:37:55.000 --> 01:38:02.000
position. I was gonna say absolute, it's a collar, or something else? When I say absolute, I mean coordinate, like, three, like…

01:38:02.000 --> 01:38:05.000
X comma Y. And when I say relative, I mean

01:38:05.000 --> 01:38:12.000
a bit more abstract. It's like, first object… sorry, first square, or the second square, or the third square.

01:38:12.000 --> 01:38:14.000
Can you repeat what you mean by absolute?

01:38:14.000 --> 01:38:19.000
Just the coordinates, X comma Y.

01:38:19.000 --> 01:38:20.000
But you're also patching the white space.

01:38:20.000 --> 01:38:21.000
Okay.

01:38:21.000 --> 01:38:24.000
from much earlier, right?

01:38:24.000 --> 01:38:27.000
We are patching the white space.

01:38:27.000 --> 01:38:31.000
On the right side. We are also patching the first column tokens.

01:38:31.000 --> 01:38:36.000
Yeah, they might have a different…

01:38:36.000 --> 01:38:38.000
they might have a…

01:38:38.000 --> 01:38:44.000
So I think we went with the assumption that the strip has… The strip will always encode everything blue.

01:38:44.000 --> 01:38:46.000
Yeah.

01:38:46.000 --> 01:38:53.000
It made sense when you're doing boats, the exact… because they're all equidistant.

01:38:53.000 --> 01:39:00.000
Maybe you should also do probes here and play with it. We saw that it's always distributing them.

01:39:00.000 --> 01:39:05.000
According to the patient. So even if you shift everything.

01:39:05.000 --> 01:39:06.000
There's peopleshooting.

01:39:06.000 --> 01:39:08.000
Or if you have, like, now, uh…

01:39:08.000 --> 01:39:14.000
3 and 1, you would have, like, uh, kind of little ones, uh, a rectangle, rectangular rectangle is right.

01:39:14.000 --> 01:39:19.000
Well, I'm very… I think that… so I don't know if it'll work, but…

01:39:19.000 --> 01:39:25.000
I think that there are, like, this naive things that you could do to scramble it even further to have totally dead fish.

01:39:25.000 --> 01:39:27.000
You could say things like,

01:39:27.000 --> 01:39:30.000
We're gonna just do a…

01:39:30.000 --> 01:39:34.000
Officer Yates shuffle of all the patches.

01:39:34.000 --> 01:39:36.000
Right, so that, you know, they're white anyway.

01:39:36.000 --> 01:39:40.000
You say, oh, I'll patch my… I don't know, that white has a lot of pollution on stuff in it.

01:39:40.000 --> 01:39:43.000
What about? We'll just shuffle all the patches.

01:39:43.000 --> 01:39:45.000
They don't have any information anyway.

01:39:45.000 --> 01:39:48.000
If they have any, we're just gonna scramble it. It's gone.

01:39:48.000 --> 01:39:52.000
Right? And then… and then just put in exactly that.

01:39:52.000 --> 01:39:55.000
this feature, this red thing, this featureless blue thing.

01:39:55.000 --> 01:39:59.000
on a scrambled, you know, field of…

01:39:59.000 --> 01:40:01.000
shuffled light.

01:40:01.000 --> 01:40:07.000
And then if your model can still tell what's what, then I'm, like, then that, to me, that's more convincing.

01:40:07.000 --> 01:40:12.000
And I think if you do that, maybe your brown accuracy would go up.

01:40:12.000 --> 01:40:14.000
wait, so you're saying that…

01:40:14.000 --> 01:40:22.000
We… what we see here is a lot of the effect of, actually, the white background encoding. It could be, it could be things like that. Right, yeah, it could be.

01:40:22.000 --> 01:40:26.000
And so you could… but you could… you could get rid of that, too. I believe… I believe.

01:40:26.000 --> 01:40:28.000
You really want to ask me something?

01:40:28.000 --> 01:40:33.000
Wait… So you think that… I'm okay.

01:40:33.000 --> 01:40:36.000
That said, it will be more absolutely relevant.

01:40:36.000 --> 01:40:41.000
then I think you wouldn't get, like, this confusion between brown and blue. I think one of what would stand out.

01:40:41.000 --> 01:40:46.000
I think it was a… I think, actually, the hypothesis is this building.

01:40:46.000 --> 01:40:49.000
more blue, more of morality, because…

01:40:49.000 --> 01:40:55.000
with the generalization of the problem, we saw that the white background is basically an open pig, so I believe.

01:40:55.000 --> 01:41:00.000
Like the flowers and buildings, I thought.

01:41:00.000 --> 01:41:03.000
That's very interesting.

01:41:03.000 --> 01:41:06.000
It's very interesting work, and it says, you know, I think that

01:41:06.000 --> 01:41:08.000
you're getting a little preview.

01:41:08.000 --> 01:41:11.000
of your review process, is going through review right now?

01:41:11.000 --> 01:41:14.000
Yes, we got really nice review for the workshop.

01:41:14.000 --> 01:41:22.000
I mean, there was workshop, yeah. But it's good, I think it's… I think it's really interesting, Mark. I think that a lot of your…

01:41:22.000 --> 01:41:25.000
sort of what you, like, sort of your appendix?

01:41:25.000 --> 01:41:28.000
Experiments are very strong. I think that

01:41:28.000 --> 01:41:31.000
you know, your stuff is very defensible.

01:41:31.000 --> 01:41:40.000
Uh, but you know, you'll have to gird yourself for review, depending on your reviewers, you might have… you might have to prevent a lot of extra data to them.

01:41:40.000 --> 01:41:42.000
And also, we need to give a credit to Kelly.

01:41:42.000 --> 01:41:46.000
really the master's student, right? Yeah, yeah, yeah, yeah.

01:41:46.000 --> 01:41:49.000
So, yeah.

01:41:49.000 --> 01:41:55.000
Yeah. By the way, you make claim, but then you nail the point later.

01:41:55.000 --> 01:41:57.000
Okay. You should keep that for…

01:41:57.000 --> 01:42:02.000
like, is it this? And then you nail it with… You nail it, but yeah, yeah, give me that feedback.

01:42:02.000 --> 01:42:04.000
You tend to, like, take… you sort of…

01:42:04.000 --> 01:42:06.000
show things in the front.

01:42:06.000 --> 01:42:09.000
Instead of impressing people at the end.

01:42:09.000 --> 01:42:15.000
You can… you can press them later. Like, for instance, you showed probe and you said it's diffused across all tokens. Uh-huh.

01:42:15.000 --> 01:42:18.000
like, if no Macintur researcher would trust.

01:42:18.000 --> 01:42:27.000
by pro, but you know the answer, like, you did cause a experiments, you should show the causal experiment and say, wow, this is so pressing. Then I'll be like, yeah, it is so pressing.

01:42:27.000 --> 01:42:29.000
So you should say, yeah, when you show the probe.

01:42:29.000 --> 01:42:34.000
Say, well, I don't believe this. Yeah, you should do that.

01:42:34.000 --> 01:42:44.000
Yeah, so… This… yeah, okay, but… I think I wouldn't even, like, if you need to present, I wouldn't even show the probing results. Maybe just later for the steering.

01:42:44.000 --> 01:42:49.000
But then how do you… how do you not the hypothesis of things?

01:42:49.000 --> 01:42:55.000
Actually, for me, for me, the thing I learned from this project is

01:42:55.000 --> 01:42:57.000
Provs are actually not that bad.

01:42:57.000 --> 01:43:03.000
It was totally evident, but they can do…

01:43:03.000 --> 01:43:05.000
They could just hypothesize.

01:43:05.000 --> 01:43:11.000
I agree, but when the kids suggested doing prob, I was like, no, we are not doing prob.

01:43:11.000 --> 01:43:25.000
And then he kept saying that for a couple of weeks, and then I was like, okay, do props, like, whatever, okay. And then we were able to actually construct really nice positive experiments because of… Because of the probe, I see. But why did you even probe the tokens that are not that position?

01:43:25.000 --> 01:43:30.000
Because when… Fuck.

01:43:30.000 --> 01:43:32.000
No, it wasn't… it wasn't actually work, because…

01:43:32.000 --> 01:43:43.000
Things were not working. When we… when we did not include those additional tokens, which is what generally you would do. You would only work with the square tokens. You could go through the… you can tell the same story,

01:43:43.000 --> 01:44:00.000
with causality. I'm not saying change your story, just that I think it would be, for me, simpler if you start. So I think that depends on the format as well. If I was giving, like, a formal talk, then I would have definitely said that. But then, as soon as I started saying resale, people started asking questions. So…

01:44:00.000 --> 01:44:06.000
claimed it. You claimed it, and you were like, oh, I don't know, this could be that, this could be this, and then you showed causal and unlike conviction.

01:44:06.000 --> 01:44:11.000
The, like, when you're starting out, when you run…

01:44:11.000 --> 01:44:16.000
presented your problems, Simon, you had a couple results showing, like, a preliminary, like,

01:44:16.000 --> 01:44:29.000
Is this… is there even something there for me to search for? Which is kind of what probes give you. It, like, gives you, like, okay, there that I can actually look at. Like, I'm not totally opposed to, like, showing those results of, like, okay, there's actually something going on, and then…

01:44:29.000 --> 01:44:38.000
You can also say, like, when you show the problems, you can say, don't worry, I have causal experiments later on, this is just, like… Okay, okay. Or even if it's just to make the point,

01:44:38.000 --> 01:44:41.000
Probes can help us think

01:44:41.000 --> 01:44:46.000
About what kind of causal experiments you want to make. If this is a point you want to promote, then…

01:44:46.000 --> 01:44:54.000
Yeah. I think we could… I think that that is one of the things that I've learned in the project, but I don't think so we are doing… we are doing that right now, but yeah, I think we could do that.

01:44:54.000 --> 01:44:57.000
By the way, I like your first slides, Brandon.

01:44:57.000 --> 01:45:04.000
You're fantastic. I know. I wanted some advice to me. Thank you very much. Okay, cool. Thank you.

01:45:04.000 --> 01:45:19.000
It's a very sweet

