WEBVTT

00:01:46.000 --> 00:01:52.000
and then some of the most recent, um, let's say, interesting stuff that is going on in this

00:01:52.000 --> 00:01:54.000
In this area of interpretability.

00:01:54.000 --> 00:01:59.000
Um, yeah. So, the first… the first thing is that…

00:01:59.000 --> 00:02:04.000
Uh, the way that we can conceptualize, um, attribution is…

00:02:04.000 --> 00:02:10.000
Um, by looking at what the models are using to do predictions, right?

00:02:10.000 --> 00:02:17.000
So, in… here I argue that there are basically two pathways to prediction. The first one is

00:02:17.000 --> 00:02:22.000
the inputs, so the models receive, as you know, in-context information,

00:02:22.000 --> 00:02:24.000
And, um…

00:02:24.000 --> 00:02:30.000
these methods that we use to trace importance back to the inputs are what we normally refer to

00:02:30.000 --> 00:02:33.000
input or feature… feature attribution.

00:02:33.000 --> 00:02:39.000
Um, so this is what we call also this in-context learning, right, in language models.

00:02:39.000 --> 00:02:43.000
Uh, and the other dimension that is complementary is

00:02:43.000 --> 00:02:48.000
whatever the models learned from training, right? So in this case, we have the learned weights,

00:02:48.000 --> 00:02:58.000
Uh, and, uh, we can still do some attribution. It's a bit related to causal mediation that you saw from the previous lecture, so you're gonna see

00:02:58.000 --> 00:03:07.000
Uh, here, too, we have a way to do, uh, basically component attribution, so understanding which components are responsible for a specific prediction.

00:03:07.000 --> 00:03:12.000
And the plus one is, uh, there is a second-order effect, of course, so…

00:03:12.000 --> 00:03:19.000
whatever learned weights the model has are, uh, are derived from training data, right? So, uh…

00:03:19.000 --> 00:03:27.000
a big mission in, uh, in attribution would also be to ideally trace back whatever importance

00:03:27.000 --> 00:03:35.000
Uh, from, uh, like, from the prediction to back to the training data, right? So there are some methods that are responsible for that.

00:03:35.000 --> 00:03:45.000
I'm not really covering that part today, uh, but just for your knowledge, these methods are kind of very similar to the others that we're going to discuss, and these are called

00:03:45.000 --> 00:03:49.000
training data attribution, or simply data attribution.

00:03:49.000 --> 00:03:58.000
So that overall, attributionally interpretability is asking which elements motivate model predictions, right? Uh, that's the… that's a big question.

00:03:58.000 --> 00:04:08.000
So, this is kind of a formalized way to think about input attribution, so if you have a trained model and an input that the model receives,

00:04:08.000 --> 00:04:13.000
Uh, the input attribution method is just a map that, given an input,

00:04:13.000 --> 00:04:19.000
In this case of dimension D, um, it will produce a set of scores of 1 to alpha D,

00:04:19.000 --> 00:04:22.000
that are telling you how much, how relevant is

00:04:22.000 --> 00:04:28.000
the i-th dimension of this input, uh, for the prediction.

00:04:28.000 --> 00:04:33.000
So, what counts as relevance here is quite vague.

00:04:33.000 --> 00:04:36.000
And it's left vague on purpose, in a sense, because…

00:04:36.000 --> 00:04:40.000
Uh, it's very hard to formalize what does it mean for something to be

00:04:40.000 --> 00:04:47.000
salient or important towards prediction, right? So we're gonna maybe discuss this a bit more in the next few slides.

00:04:47.000 --> 00:04:51.000
But just for you to know, like, saliency is a bit of a fuzzy concept.

00:04:51.000 --> 00:04:53.000
Earl.

00:04:53.000 --> 00:04:58.000
Um, so to quantify importance, um, if we had

00:04:58.000 --> 00:05:03.000
simple linear models, right? Like, linear regression models. This would be a very

00:05:03.000 --> 00:05:08.000
trivial things to do. Like, we could just look at the coefficients that are learned,

00:05:08.000 --> 00:05:17.000
Um, so whatever the weights here that would be, uh, matched to the matrix of the inputs would be our importance scores, right?

00:05:17.000 --> 00:05:22.000
However, for deeper models, this is not straightforward, and the reason for that is that

00:05:22.000 --> 00:05:26.000
is the presence of nonlinearities, right? So nonlinearities mess up

00:05:26.000 --> 00:05:36.000
this kind of contributions of different inputs, and especially when they are chained, like, in a deep neural network, the whole influence is

00:05:36.000 --> 00:05:38.000
becomes very messy.

00:05:38.000 --> 00:05:45.000
So, generally, the way that we go about estimating this importance is through some approximations.

00:05:45.000 --> 00:05:50.000
For example, approximating some specific operations linearly.

00:05:50.000 --> 00:05:54.000
Uh, or perturbations, so trying to kind of, like,

00:05:54.000 --> 00:06:02.000
try to estimate how important something is by just ablating it or modifying it slightly.

00:06:02.000 --> 00:06:08.000
So, let's have the most… let's have a look at the most basic setup for

00:06:08.000 --> 00:06:10.000
Um, for attribution,

00:06:10.000 --> 00:06:17.000
Which is occlusion. Uh, so, you know, in the case of occlusion, let's say that we have our language model here,

00:06:17.000 --> 00:06:20.000
let's say this is, like, our GPT or our Llama,

00:06:20.000 --> 00:06:26.000
And we have, um, an input that is a simple string, right?

00:06:26.000 --> 00:06:28.000
For example, Welcome Back Ladies End.

00:06:28.000 --> 00:06:34.000
Uh, that, as you know, gets tokenized and embedded before being fed through the model.

00:06:34.000 --> 00:06:42.000
So, you see this is our, uh, of dimension D, uh, which is the dimension of the embedding, time S.

00:06:42.000 --> 00:06:45.000
the dimension of the sequence.

00:06:45.000 --> 00:06:50.000
And the output of the model is this distribution of probabilities over the vocabulary.

00:06:50.000 --> 00:06:55.000
So in this case, we find that the model is predicting gentlemen as the most likely next token.

00:06:55.000 --> 00:06:59.000
So the occlusion case is very simply,

00:06:59.000 --> 00:07:02.000
Um, let's ablate a single token.

00:07:02.000 --> 00:07:05.000
Uh, for example, ladies.

00:07:05.000 --> 00:07:09.000
Uh, so here we're gonna have a different embedding for the ablated token.

00:07:09.000 --> 00:07:15.000
And let's get an output for the perturbed input, right?

00:07:15.000 --> 00:07:17.000
So in this case, the probabilities will change.

00:07:17.000 --> 00:07:24.000
Um, and an idea would be, okay, uh, the way that we associate an importance to the token ladies

00:07:24.000 --> 00:07:28.000
is by looking at the top prediction,

00:07:28.000 --> 00:07:38.000
And looking at how big of the… of a drop this is, right? In this case, the drop is huge, so it means that ladies is probably very important towards that.

00:07:38.000 --> 00:07:42.000
Um, probably you can see already that this is

00:07:42.000 --> 00:07:47.000
overall a bit problematic as a procedure. I'm curious if someone already has a…

00:07:47.000 --> 00:07:58.000
a hint of, like, what could be the problems in doing that, or…

00:07:58.000 --> 00:07:59.000
Yep.

00:07:59.000 --> 00:08:03.000
So, could I ask, because I'm not the expert here, is that right? So when you… so when you include it, what do you… what do you… what do you put there?

00:08:03.000 --> 00:08:06.000
Yeah, that's one of the problems, exactly.

00:08:06.000 --> 00:08:13.000
So, there is no good answer to that, actually. So, this I mentioned in the next slide.

00:08:13.000 --> 00:08:18.000
But one of the issues with perturbation, and in general with this kind of, like,

00:08:18.000 --> 00:08:27.000
occlusion, um, which is related to one of the questions that we received, also for integrated gradient, what would be a good baseline for language, right?

00:08:27.000 --> 00:08:30.000
Um, the problem is that, um,

00:08:30.000 --> 00:08:36.000
perturbations, uh, let me go to… yeah, can produce OOD behaviors.

00:08:36.000 --> 00:08:43.000
And this would result in unfaithful explanation in the case in which you are replacing that with something that is not

00:08:43.000 --> 00:08:47.000
uh, realistic, right? So, for example, if we have

00:08:47.000 --> 00:08:52.000
um… let's say for encoder language models like BERT,

00:08:52.000 --> 00:08:55.000
we have this kind of mask tokens, right?

00:08:55.000 --> 00:09:01.000
Which are kinda interesting in this setting, because we could just replace things by masks.

00:09:01.000 --> 00:09:05.000
And the model is trained with masks, right? So it's able to handle that.

00:09:05.000 --> 00:09:11.000
But for GPTs, we don't have such things, right? The model is just predicting left to right, so it doesn't need masking.

00:09:11.000 --> 00:09:15.000
Um, yeah, so the coder LLMs don't, right?

00:09:15.000 --> 00:09:20.000
And one common approach is to sample

00:09:20.000 --> 00:09:24.000
replacements randomly and aggregate over multiple replacements, right?

00:09:24.000 --> 00:09:31.000
But you can imagine that this scales very poorly, right? So if you… if I have to do 100 replacements with random tokens,

00:09:31.000 --> 00:09:40.000
And just measure how big of an impact is on average. This becomes very expensive. I just attributed a single token in this case, right?

00:09:40.000 --> 00:09:44.000
Uh, so imagine if I was to attribute the full sentence then.

00:09:44.000 --> 00:09:46.000
Right? So, this is very expensive.

00:09:46.000 --> 00:09:51.000
And it scales poorly to long inputs.

00:09:51.000 --> 00:09:54.000
Um… yeah.

00:09:54.000 --> 00:09:59.000
So, an alternative, a natural alternative in, in, um…

00:09:59.000 --> 00:10:04.000
in neural networks is to use gradients as some sort of attribution.

00:10:04.000 --> 00:10:14.000
Uh, so the models, as you all know, are trained with gradient descent, so they… they… this gradient information is naturally employed during training.

00:10:14.000 --> 00:10:20.000
Um, and we can repurpose that to try to get a sense of the importance of components

00:10:20.000 --> 00:10:22.000
Uh, at inference time.

00:10:22.000 --> 00:10:27.000
So let's say an example here, again, we have exactly our same setup as before.

00:10:27.000 --> 00:10:29.000
Uh, we get the prediction gentleman,

00:10:29.000 --> 00:10:34.000
But then, the way that we would go about that is we pick a specific

00:10:34.000 --> 00:10:39.000
target in, uh, target function, let's say. In this case, the target could be

00:10:39.000 --> 00:10:44.000
the probability of the topmost likely token, like gentlemen.

00:10:44.000 --> 00:10:51.000
And we can take, uh, the gradient with respect to that, um, back to the input embeddings.

00:10:51.000 --> 00:10:53.000
of the model. So…

00:10:53.000 --> 00:11:03.000
Note that here, these gradients normally are taken with respect to a loss, right? So the gradient with respect to a loss tells you how do I minimize this loss.

00:11:03.000 --> 00:11:06.000
Uh, so which weights should I change?

00:11:06.000 --> 00:11:13.000
to minimize this loss. Uh, in this case, instead, the gradient with respect to probabilities is telling us

00:11:13.000 --> 00:11:20.000
in a sense, how sensitive is this final output of the model, so the final probability for gentlemen,

00:11:20.000 --> 00:11:29.000
too little perturbations of, uh, in this case, if we focus on input embeddings, of these embeddings, right?

00:11:29.000 --> 00:11:38.000
So the final outcome of this procedure is gradient vectors that have the same exact dimension as the input embeddings,

00:11:38.000 --> 00:11:44.000
And that basically express each dimension of the input embedding, how important is that?

00:11:44.000 --> 00:11:47.000
Towards the, uh, prediction of the…

00:11:47.000 --> 00:11:50.000
of the final, uh, output.

00:11:50.000 --> 00:11:54.000
So, normally, this is not very useful, right?

00:11:54.000 --> 00:12:00.000
one's corporate dimension is not very useful, so what we do is normally to aggregate those.

00:12:00.000 --> 00:12:04.000
at the token level to get a single score per word.

00:12:04.000 --> 00:12:10.000
So in this case, we could find, for example, that the gradients for ladies and are very

00:12:10.000 --> 00:12:17.000
high, and if we aggregate these, for example, by taking the vector norm of each one of these gradient vectors,

00:12:17.000 --> 00:12:21.000
we would get a high attribution score for Ladies N.

00:12:21.000 --> 00:12:26.000
Which is intuitive, right? If we perturb ladies' end, the model probably wouldn't predict gentleman.

00:12:26.000 --> 00:12:31.000
Um, so this relies on this kind of approximation.

00:12:31.000 --> 00:12:39.000
Yeah. Is everything clear for the gradient attribution part?

00:12:39.000 --> 00:12:40.000
he… I think so.

00:12:40.000 --> 00:12:42.000
You guys are so… you guys are all camera off. Oh, I'm camera off, too. Look, I'm camera off, and so it's, like, impossible…

00:12:42.000 --> 00:12:45.000
Yeah, exactly. I'm kind of like, uh…

00:12:45.000 --> 00:12:48.000
Impossible for Gabriel to tell what's going on.

00:12:48.000 --> 00:12:49.000
I don't know.

00:12:49.000 --> 00:12:50.000
Yeah. So, um,

00:12:50.000 --> 00:12:51.000
Um,

00:12:51.000 --> 00:12:54.000
Yeah, so, uh, um…

00:12:54.000 --> 00:12:56.000
So you just take a… you just take a vector,

00:12:56.000 --> 00:13:03.000
uh… size. Is that, is that typically really what people do, or do they do other things other than…

00:13:03.000 --> 00:13:08.000
This is another thing that there is no consensus, uh, so there's plenty of

00:13:08.000 --> 00:13:13.000
of ways that people try to do this kind of aggregation. So, for example, the L2,

00:13:13.000 --> 00:13:15.000
norm is probably…

00:13:15.000 --> 00:13:20.000
the most natural way to do this, but other people took the sum of the gradients.

00:13:20.000 --> 00:13:22.000
or the L1 norm is…

00:13:22.000 --> 00:13:29.000
Um, like, these are all commonly used, and there's no, like, one-size-fit-all, kind of.

00:13:29.000 --> 00:13:30.000
Okay.

00:13:30.000 --> 00:13:32.000
Yeah.

00:13:32.000 --> 00:13:33.000
Wait, I had a…

00:13:33.000 --> 00:13:34.000
So, yeah. Sorry. Yep.

00:13:34.000 --> 00:13:45.000
So, what's the general sample size people use? I'm assuming they don't compute the gradient on one example and show this graph, right? So, what is the general sample size that people use?

00:13:45.000 --> 00:13:48.000
I mean, the idea here is that…

00:13:48.000 --> 00:13:51.000
the gradient attribution is really local, right? So, meaning,

00:13:51.000 --> 00:13:54.000
or a given example, you're gonna get your scores out.

00:13:54.000 --> 00:13:57.000
In this case, right? If you want to draw some

00:13:57.000 --> 00:14:04.000
Um, you know, hypotheses from the kind of information that you get out of this, definitely you want to have some

00:14:04.000 --> 00:14:16.000
data set and to… to be able to tag, you know, the kind of expected behaviors that you would like to see, and see whether the gradient attribution retrieves something that matches your intuition, right?

00:14:16.000 --> 00:14:20.000
So, I agree with you that on a single example, this doesn't tell you much.

00:14:20.000 --> 00:14:26.000
But that's actually one of the nice things about it. Isn't… or do you think that's true, right? That is one of the nice things about…

00:14:26.000 --> 00:14:28.000
Input attribution is that

00:14:28.000 --> 00:14:35.000
It is about, you know, you can analyze single examples. It's trying to get you an answer about single examples, right?

00:14:35.000 --> 00:14:45.000
Yeah, yeah, yeah, exactly. So, it's really a local thing, right? It doesn't, like, it doesn't give you any intuition about the behavior of the model globally.

00:14:45.000 --> 00:14:54.000
but rather on that specific example. If you want to get this global intuition about model behavior, you probably have to repeat the thing on a dataset, right, and see

00:14:54.000 --> 00:14:59.000
whether the trend that you observe on that example holds in general.

00:14:59.000 --> 00:15:00.000
Okay.

00:15:00.000 --> 00:15:01.000
Yep.

00:15:01.000 --> 00:15:02.000
Yeah, okay, thanks.

00:15:02.000 --> 00:15:10.000
So yeah, so this all relates to, uh, the fact that the devil is in the details for retribution methods,

00:15:10.000 --> 00:15:12.000
And, uh, we got some questions here, so what happens in this…

00:15:12.000 --> 00:15:15.000
Oh yeah, we'll have them ask the questions.

00:15:15.000 --> 00:15:17.000
Courtney had a question.

00:15:17.000 --> 00:15:24.000
Armita had a question. What were your questions?

00:15:24.000 --> 00:15:41.000
Um, so for, um, we had this problem in our, uh, like my research project that we wanted to build adversarial input examples, and many of them are gradient based.

00:15:41.000 --> 00:16:00.000
Um, and our input… Um… Our data was like discrete. So, um… Changing them in a way that, like, increases the gradients sometimes wasn't, um, so meaningful because we had.

00:16:00.000 --> 00:16:12.000
few perturbation options. But in this context, I was wondering if.

00:16:12.000 --> 00:16:29.000
Yeah, so in a sense, as you saw in the example before, the way that we move from, like, the continuous pace that the model operates on to the discrete space of, for example, the vocabulary of the model is by just aggregating, right, whatever we get in the continuous space at the token level.

00:16:29.000 --> 00:16:32.000
Um, I agree that in your case of, like,

00:16:32.000 --> 00:16:39.000
trying to optimize these embeddings, maybe, like, in a way that they still map onto words, probably you would have to…

00:16:39.000 --> 00:16:45.000
of, like, some sort of projection to the nearest neighbor procedure, right, that it can be quite noisy.

00:16:45.000 --> 00:16:46.000
Uh, in the case of this.

00:16:46.000 --> 00:16:56.000
Is that what you… is that what you did in your experiment, Army, or did you do a nearest neighbor thing?

00:16:56.000 --> 00:16:57.000
Oh, yeah.

00:16:57.000 --> 00:16:58.000
Oh, you, you moved to do something different, okay, cool.

00:16:58.000 --> 00:16:59.000
No, no, we didn't do gradient space.

00:16:59.000 --> 00:17:00.000
Yeah.

00:17:00.000 --> 00:17:05.000
Uh-huh, uh-huh. Is Courtney here? Does Courtney have a question?

00:17:05.000 --> 00:17:22.000
Yeah, hi, my question is about, um, whether or not you can tell when, like, a specific token, or in the Mirage example, a specific document is actually negatively contributing to a prediction, or, like, taking away from moving away from the correct, um, answer, or.

00:17:22.000 --> 00:17:31.000
Yeah, so in the gradient case, um, definitely, you can do that. Also, in the occlusion case, so overall, let's say in the occlusion case,

00:17:31.000 --> 00:17:43.000
you would have the simple, uh, the simple setup where, um, this estimated importance could either increase or decrease the probability, so you could see this as a positive or negative contribution.

00:17:43.000 --> 00:17:46.000
Um, while in the gradient case,

00:17:46.000 --> 00:17:52.000
gradients will be signed in these vectors, so depending on the kind of attribution that you do,

00:17:52.000 --> 00:17:58.000
And now I'm kind of getting a bit ahead on what I'm saying here. Depending on the aggregation,

00:17:58.000 --> 00:18:03.000
disinformation might get lost. For example, if you take just the L2 norm of the vector,

00:18:03.000 --> 00:18:14.000
then you're effectively just getting a positive value that is overall how much it contributes, abstracting away positive and negative contribution.

00:18:14.000 --> 00:18:25.000
But if you sum, for example, you would get something that is… that can also be negative, right? If all the dimensions are kind of negative.

00:18:25.000 --> 00:18:29.000
So, are there other ways of aggregating that you wouldn't lose it?

00:18:29.000 --> 00:18:34.000
Um… I would say probably the sum is the most common here, um…

00:18:34.000 --> 00:18:35.000
Uh-huh.

00:18:35.000 --> 00:18:44.000
you could also get, like, something like, you know, just this kind of heuristic, I guess, like, the maximal dimension within the embedding, if it's a negative dimension, then you would say that it's…

00:18:44.000 --> 00:18:47.000
a negative, uh, contribution.

00:18:47.000 --> 00:18:48.000
Um, there has been…

00:18:48.000 --> 00:18:49.000
Although there was something… there was something we saw in the, um…

00:18:49.000 --> 00:18:52.000
Yeah, so…

00:18:52.000 --> 00:18:54.000
In BeanCam's TCAV paper,

00:18:54.000 --> 00:18:57.000
Where, once you had a gradient, she had dot product in it,

00:18:57.000 --> 00:19:03.000
with, uh, you know, some particular vector of interest, right? You might dot product it with…

00:19:03.000 --> 00:19:04.000
Yeah.

00:19:04.000 --> 00:19:07.000
a class or a dot product it with, like, a token or something like that.

00:19:07.000 --> 00:19:08.000
That's true, that's true.

00:19:08.000 --> 00:19:10.000
Do people ever do that, or…?

00:19:10.000 --> 00:19:20.000
Um, maybe… so I don't have a slide for that, but I think it's interesting to relate it to the dot product, right? It's interesting to think of when we talk about gradient attribution,

00:19:20.000 --> 00:19:33.000
to think of the gradient vectors per se versus the gradient times whatever input they would be applied to, right? Uh, so these are two common attribution methods, like just gradient, row gradient, and gradient times input.

00:19:33.000 --> 00:19:42.000
And, like, the gradient per se tells you the sensitivity, kind of, of the inputs to the… like, of the prediction to the inputs.

00:19:42.000 --> 00:19:51.000
But the moment you multiply them, you actually get some sort of scaling by actually considering what these gradients will be applied to, right?

00:19:51.000 --> 00:19:56.000
You might have very high gradients just because the dimension that they would be applied to is very small, right?

00:19:56.000 --> 00:20:08.000
Um, so yeah. So, definitely, this relates to what you're saying, David. I think, uh, depending on what you're looking for, it might make sense to consider gradients in relation to their inputs.

00:20:08.000 --> 00:20:14.000
Uh, and not just by themselves, right?

00:20:14.000 --> 00:20:15.000
Yeah.

00:20:15.000 --> 00:20:17.000
Yeah, I think so, and I think that might give you a sign, also, so you can have a positive one or a negative one in that case.

00:20:17.000 --> 00:20:18.000
Yeah, exactly, yeah.

00:20:18.000 --> 00:20:20.000
I don't know, I think that's pretty interesting.

00:20:20.000 --> 00:20:21.000
Question from Courtney. Okay, let's keep on going. I don't want to slow you down too much.

00:20:21.000 --> 00:20:24.000
Yeah.

00:20:24.000 --> 00:20:37.000
No worries, no worries. Yeah, so I just want to highlight here this, like, plenty of people work on this, and there's no established norms, basically. So every new paper that does attribution does it in a different way.

00:20:37.000 --> 00:20:47.000
Um, and there's no consensus of, like, oh, this always works the best, uh, you know? So I think there's a lot to be… to be explored still in this area.

00:20:47.000 --> 00:20:58.000
So, one of the readings that you had was integrated gradients, and I took this image, which was quite nice, which was from this blog post that says,

00:20:58.000 --> 00:21:03.000
integrated gradients is a decent attribution method, which I found funny.

00:21:03.000 --> 00:21:05.000
Um, and uh…

00:21:05.000 --> 00:21:10.000
Yeah, the intuition is what you see here in the image is you have a starting point here,

00:21:10.000 --> 00:21:14.000
Um, you have an endpoint, which is gonna be your baseline,

00:21:14.000 --> 00:21:22.000
And effectively, you're taking steps, right? Like, ideally, this would be an integral, but actually, you're probably approximating this by taking steps alongside this

00:21:22.000 --> 00:21:25.000
this straight line, um…

00:21:25.000 --> 00:21:27.000
So this is the activation space, and then…

00:21:27.000 --> 00:21:38.000
Uh, here you would have two input features, uh, in reality, we have many more, and what you're… what you're doing is just by adding up the contributions of each one of the two features, right?

00:21:38.000 --> 00:21:43.000
So this is a nice way of visualizing what's happening here.

00:21:43.000 --> 00:21:45.000
So,

00:21:45.000 --> 00:21:53.000
Integrated gradient is quite robust, uh, because of this property of, like, um, you know, considering contributions alongside this path.

00:21:53.000 --> 00:22:00.000
Um, in practice, though, um, to get a good approximation, probably, it's quite expensive.

00:22:00.000 --> 00:22:05.000
Uh, to run, meaning that you will need many approximation steps alongside this line.

00:22:05.000 --> 00:22:12.000
And as many of you noted, the lack of a good baseline is often a problem, actually.

00:22:12.000 --> 00:22:20.000
So, especially in NLP, people have tried the zero vector, which, again, doesn't mean anything in the NLP world.

00:22:20.000 --> 00:22:26.000
Uh, and they tried to use the kind of, like, out-of-vocabulary token, or the mask token.

00:22:26.000 --> 00:22:35.000
Um, I think overall, the best idea is really to do random sampling and just averaging out, but this, again, is extremely expensive, so…

00:22:35.000 --> 00:22:39.000
Um, no consensus there.

00:22:39.000 --> 00:22:45.000
Um, so there were some interesting variants that were proposed specific to NLP. I have an image here.

00:22:45.000 --> 00:22:50.000
Um, so the idea here was, instead of going for a straight path, let's instead

00:22:50.000 --> 00:23:01.000
do these kind of, uh, clumping to the nearest neighbor, kind of like we were saying before, and let's find the path that passes through existing tokens in the model vocabulary.

00:23:01.000 --> 00:23:04.000
So let's take the integral with respect to this path.

00:23:04.000 --> 00:23:10.000
Um, it kind of works, but it's not that much better, so I… again, there's no, like…

00:23:10.000 --> 00:23:13.000
Um, one-size-fit-all for this kind of method.

00:23:13.000 --> 00:23:25.000
And I also wanted to highlight the Smooth grad, which is another different approach, which is just introducing noise to the gradient estimation, with just some Gaussian, for example, noise.

00:23:25.000 --> 00:23:34.000
Uh, this also improve robustness, so this points to the fact that actually what matters is the robustness in this kind of evaluations, right?

00:23:34.000 --> 00:23:37.000
Because gradients are noisy overall.

00:23:37.000 --> 00:23:45.000
Um, yeah, so there were several questions regarding the implementation invariance properties, uh, here.

00:23:45.000 --> 00:23:52.000
Whether they always hold, whether they're good, and whether the axioms matter in practice.

00:23:52.000 --> 00:24:00.000
um… I think overall, my perspective is that they don't really matter, uh, in the sense that

00:24:00.000 --> 00:24:02.000
I feel like this is really…

00:24:02.000 --> 00:24:09.000
uh, you know, best-case scenario, um, although integrated gradients is not really the best, nor the most efficient method out there.

00:24:09.000 --> 00:24:15.000
Um, so probably, especially this implementation invariance is kind of like, uh,

00:24:15.000 --> 00:24:26.000
you know, a moonshot kind of goal of, like, having two models that are exactly identical, uh, except for, you know, the implementation, but they behave exactly identically.

00:24:26.000 --> 00:24:30.000
Uh, I think this is not very realistic.

00:24:30.000 --> 00:24:38.000
Yeah, so there was someone that was mentioning, I don't remember if it was Claire, maybe, that was mentioning their example about the

00:24:38.000 --> 00:24:45.000
Like, what if we have this, uh, classification that is based on the background versus looking at

00:24:45.000 --> 00:24:49.000
Uh, at the subject in the photo, and then the two behaviors are

00:24:49.000 --> 00:24:51.000
the same, but for different reasons, right?

00:24:51.000 --> 00:24:56.000
Um, but then the… maybe you can correct me if I… if I said it wrong.

00:24:56.000 --> 00:24:58.000
No, that's right.

00:24:58.000 --> 00:25:04.000
Right, right. Um, yeah, so my take on that is that if indeed this was the case, uh,

00:25:04.000 --> 00:25:12.000
like, implementation environments would apply to all sorts of examples, right? So you should be able to craft an example if the heuristics are different,

00:25:12.000 --> 00:25:18.000
Such that the two networks would have different behaviors. And if the two networks have different behavior, then

00:25:18.000 --> 00:25:23.000
implementation invari doesn't apply, right? So then, uh…

00:25:23.000 --> 00:25:30.000
Um, yeah, I feel like, again, what you're pointing out is the fact that for a given set of examples,

00:25:30.000 --> 00:25:36.000
Um, this might be the case, that the two methods, like, the two networks are behaving the same way.

00:25:36.000 --> 00:25:42.000
But in practice, this is rarely the case. If they have different heuristics, you should be able to find examples that…

00:25:42.000 --> 00:25:47.000
lead them to behave differently, right?

00:25:47.000 --> 00:25:53.000
I don't know if you agree with that.

00:25:53.000 --> 00:25:56.000
Alright, um…

00:25:56.000 --> 00:26:02.000
So, one thing that I wanted to mention is that my…

00:26:02.000 --> 00:26:07.000
My opinion on the most promising approaches that are currently used in this space are

00:26:07.000 --> 00:26:10.000
methods that are basically trying to tweak

00:26:10.000 --> 00:26:13.000
Um, uh, these kind of gradient propagation.

00:26:13.000 --> 00:26:25.000
without making it too inefficient, so they don't take this kind of approach of, like, taking steps or, uh, you know, multiple prediction steps, but they are just crafting some

00:26:25.000 --> 00:26:29.000
either custom propagation rule, like in the case of

00:26:29.000 --> 00:26:31.000
layer-wise relevance propagation.

00:26:31.000 --> 00:26:37.000
Or, uh, accounting for transformer quirks, so this gym method, for example,

00:26:37.000 --> 00:26:39.000
is something that was recently proposed.

00:26:39.000 --> 00:26:45.000
And one of the things that they were proposing is, uh, can we compensate for, um,

00:26:45.000 --> 00:26:50.000
for this kind of behavior, where if you remove something in the… from the input,

00:26:50.000 --> 00:26:55.000
Then the softmax would reallocate the importance, and then this will lead to different behavior, right?

00:26:55.000 --> 00:27:07.000
Um, so I think these are very interesting, especially because they're cost, in the end, is the same as regular gradient-based attribution, so much more efficient than integrated gradients, so probably

00:27:07.000 --> 00:27:14.000
can scale to this kind of applications where we use language models, right?

00:27:14.000 --> 00:27:16.000
All right.

00:27:16.000 --> 00:27:26.000
Um, so, one area that has been explored to some degree, but uh… yeah, with still with mixed success, is instead the idea of, like,

00:27:26.000 --> 00:27:32.000
Um, just looking at model internals. So, for now, we always relied on prediction, right?

00:27:32.000 --> 00:27:40.000
Either we take the gradient with respect to the prediction, or we take, uh, we look at the difference in prediction,

00:27:40.000 --> 00:27:45.000
Uh, when, uh, I don't know, ablating, occluding components.

00:27:45.000 --> 00:27:49.000
But here, maybe we can just look at some properties within the network.

00:27:49.000 --> 00:27:55.000
to try to understand how the model is allocating importance to different components in the input. For example,

00:27:55.000 --> 00:28:04.000
Initially, people were very keen on doing that with attention weights, uh, right? And this has been kind of, like, controversial.

00:28:04.000 --> 00:28:08.000
Uh, so you're in this mixed success links, you find two papers that are titled,

00:28:08.000 --> 00:28:14.000
Uh, attention is not explanation, and attention is not not explanation.

00:28:14.000 --> 00:28:18.000
Uh, so people debated that, uh, quite a lot.

00:28:18.000 --> 00:28:22.000
I think there are some promising works here, um…

00:28:22.000 --> 00:28:28.000
So, as I said, initial work, we're looking at attention weights in a vacuum.

00:28:28.000 --> 00:28:30.000
Uh, which was quite misleading.

00:28:30.000 --> 00:28:34.000
Then there has been some work in that direction that thought,

00:28:34.000 --> 00:28:38.000
The reason why this is misleading is that we're not considering

00:28:38.000 --> 00:28:43.000
the actual vector, so the value vectors that these weights are multiplied by,

00:28:43.000 --> 00:28:51.000
So, why don't we instead look at the vectors, so the magnitude of the resulting vectors, rather than looking at the attention weight?

00:28:51.000 --> 00:28:56.000
So, in this example here, you can see the second attention weight is quite large.

00:28:56.000 --> 00:28:59.000
So you would say, oh, this word is very important.

00:28:59.000 --> 00:29:07.000
But actually, the vector is very small, and maybe this is just a compensatory behavior, right? Uh, to… to get a final vector that is not too…

00:29:07.000 --> 00:29:09.000
too small. And…

00:29:09.000 --> 00:29:12.000
The final perspective here was

00:29:12.000 --> 00:29:21.000
Can we relate these three vectors here to the final outcome of the attention operation, and see which one of these

00:29:21.000 --> 00:29:23.000
uh, is closer to that.

00:29:23.000 --> 00:29:28.000
If this green vector, for example, is closer to the final outcome of the attention operation,

00:29:28.000 --> 00:29:33.000
then it might be that it's the most, uh, influential towards that computation, right?

00:29:33.000 --> 00:29:39.000
Meaning, it's the most aligned with whatever the attention is doing to the, uh, to the vectors in the input.

00:29:39.000 --> 00:29:45.000
Um, so these are some pointers for you if you're interested in digging deeper.

00:29:45.000 --> 00:29:49.000
Um, the cool part about these approaches is that it's entirely

00:29:49.000 --> 00:29:54.000
forward base, so there's no backpropagation, super efficient at inference time,

00:29:54.000 --> 00:29:57.000
Uh, can scale very well with, like,

00:29:57.000 --> 00:30:00.000
long context, you know, big models, so…

00:30:00.000 --> 00:30:06.000
Definitely, that's the big appeal of these kind of methods.

00:30:06.000 --> 00:30:14.000
Alright, um, so I wanted to give you an overview also of the InSeq toolkit that is… yeah, sorry.

00:30:14.000 --> 00:30:15.000
Sure.

00:30:15.000 --> 00:30:18.000
So, actually, Pat, can I pause you for a moment? Gabriel, so you sort of described two…

00:30:18.000 --> 00:30:20.000
or maybe 3 different classes of methods now. So you've described occlusion,

00:30:20.000 --> 00:30:22.000
Right.

00:30:22.000 --> 00:30:25.000
Uh, you've described these kind of interesting gradient.

00:30:25.000 --> 00:30:26.000
based methods.

00:30:26.000 --> 00:30:27.000
Yeah.

00:30:27.000 --> 00:30:36.000
And then these… I guess it's… maybe this kind of attention-based methods, or are there things other than… you say internals-based, but maybe the… mainly attention-based methods.

00:30:36.000 --> 00:30:37.000
Yeah, yeah, yeah.

00:30:37.000 --> 00:30:39.000
And so… so, like, okay.

00:30:39.000 --> 00:30:43.000
So we got… we got all these students here trying to do their projects, and they're…

00:30:43.000 --> 00:30:51.000
They're probably gonna try some input attribution at some point. What's your opinion? Do you, like,

00:30:51.000 --> 00:30:56.000
Would you… would you advise people to do one of these or the others? Is one of them…

00:30:56.000 --> 00:31:01.000
Uh, you know, things that people believe more, or… yeah, what… I just want to get a sense for your…

00:31:01.000 --> 00:31:02.000
your own opinion here.

00:31:02.000 --> 00:31:07.000
Yeah, so I think my perspective is, again, this gradient-based ones,

00:31:07.000 --> 00:31:09.000
these specific variants

00:31:09.000 --> 00:31:19.000
seem to be the most faithful at the moment, and they are kinda actionable if you're working with models that are not extremely large.

00:31:19.000 --> 00:31:23.000
So probably, if I had to go for something, I would go for…

00:31:23.000 --> 00:31:26.000
one of these two methods that I'm linking down here.

00:31:26.000 --> 00:31:32.000
They both have quite nice implementations available, so that's also a big plus, right? Kind of like, you can just plug and play with

00:31:32.000 --> 00:31:35.000
with existing models, um…

00:31:35.000 --> 00:31:47.000
So, yeah, so that would be my guess. If I really found out that this wasn't scalable enough for the use case that I'm working on, for whatever reason, because it's too long of a context, too big of a model,

00:31:47.000 --> 00:31:54.000
I guess my second choice would be some of these, uh, these information flow routes, for example.

00:31:54.000 --> 00:31:58.000
Which is a forward-only method, probably would be my go-to.

00:31:58.000 --> 00:32:00.000
Um, yeah.

00:32:00.000 --> 00:32:01.000
Nice, thanks.

00:32:01.000 --> 00:32:03.000
I guess that's my… my opinion, yeah.

00:32:03.000 --> 00:32:05.000
Cool.

00:32:05.000 --> 00:32:06.000
Um… yeah.

00:32:06.000 --> 00:32:07.000
One question.

00:32:07.000 --> 00:32:09.000
Yep, sorry.

00:32:09.000 --> 00:32:13.000
Sorry, yeah. So, so the gradient-based approaches, we…

00:32:13.000 --> 00:32:15.000
Yeah.

00:32:15.000 --> 00:32:18.000
based on your experience, how…

00:32:18.000 --> 00:32:28.000
How much does it scale for long context? Let's say if I want to analyze a chain of thought and want to, let's say, understand which are the important input tokens,

00:32:28.000 --> 00:32:29.000
Yep.

00:32:29.000 --> 00:32:31.000
Or import… important input sentences.

00:32:31.000 --> 00:32:36.000
put the integrated gradient methods work?

00:32:36.000 --> 00:32:41.000
Yeah, uh, the only downside with gradients is that sometimes we find…

00:32:41.000 --> 00:32:50.000
some sort of spread out of probabilities, so it's unlikely that all the tokens will receive exactly zero importance, right?

00:32:50.000 --> 00:32:59.000
Uh, even if we were to update them, probably the output would be irrelevant for many of them. Like, I don't know, function words, you know, this kind of…

00:32:59.000 --> 00:33:07.000
unrelated stuff, so that's more of a property of the gradient, and if the context is very large, the importance will tend to be very spread out.

00:33:07.000 --> 00:33:12.000
I think this is something that can maybe be mitigated by using different

00:33:12.000 --> 00:33:14.000
um… um…

00:33:14.000 --> 00:33:21.000
In the future, meaning, like, when designing model architectures, I think in general, you know, going towards more sparse,

00:33:21.000 --> 00:33:29.000
um… activation function, for example, sparse marks instead of soft marks, that would promote this kind of sparsity at the output level, and

00:33:29.000 --> 00:33:35.000
Potentially, this could also reflect into a sparsity in the input when taking gradients with respect to that.

00:33:35.000 --> 00:33:38.000
Um, so…

00:33:38.000 --> 00:33:51.000
Yeah, so I agree with gradients, that's a potential failure case. I think even if spread out the magnitudes would still be informative, though. I have an example later on

00:33:51.000 --> 00:33:54.000
on retrieval augmented generation, you can see that, uh…

00:33:54.000 --> 00:33:56.000
Even for longer contexts, this is…

00:33:56.000 --> 00:33:58.000
somewhat informative, yeah.

00:33:58.000 --> 00:34:02.000
Okay, okay, cool, thanks.

00:34:02.000 --> 00:34:10.000
All right. Um, yeah, so I just wanted to show you this toolkit that we built, um, that is exactly…

00:34:10.000 --> 00:34:15.000
Uh, for using attribution methods, mostly gradient-based, on language models.

00:34:15.000 --> 00:34:23.000
So the idea here is that you have your hugging Face model, uh, let's say a GPT-like model that receives a prompt,

00:34:23.000 --> 00:34:30.000
And the model will do autoregressive generation, predicting one word at a time, for example, to innovate, one should

00:34:30.000 --> 00:34:36.000
think outside the box. And what the toolkit allows you to do is simply, at every generation step,

00:34:36.000 --> 00:34:41.000
We extract the attribution scores for the given prefix,

00:34:41.000 --> 00:34:46.000
Um, and we can extract also, um, quantities of interest, for example.

00:34:46.000 --> 00:34:49.000
Uh, the probability of the output, the entropy of the output,

00:34:49.000 --> 00:34:54.000
Which are also what we're taking the gradient with respect to, right?

00:34:54.000 --> 00:34:59.000
So, the final outcome of all this would be something that resembles this table.

00:34:59.000 --> 00:35:05.000
Um, so the way that you read this is that the columns are the tokens that were generated,

00:35:05.000 --> 00:35:17.000
And on the rows is the prompt that every generation step. So you see this triangular pattern here is because every new token gets added as an element in the prompt at every generation step.

00:35:17.000 --> 00:35:21.000
Uh, so it plays… it has an influence on the next steps of prediction.

00:35:21.000 --> 00:35:29.000
Um, and you also have, for example, some information of interest, like here I'm also extracting the probability,

00:35:29.000 --> 00:35:36.000
of the predicted token at every step, kind of like, yeah, the final prediction probability.

00:35:36.000 --> 00:35:42.000
So you can see in this example, um, that the moment that the model starts predicting things outside the box,

00:35:42.000 --> 00:35:49.000
The model is mostly relying on Innovate, which is the key word to predict the multi-word expression,

00:35:49.000 --> 00:35:54.000
But the moment it starts producing the sequence, it's kind of

00:35:54.000 --> 00:36:02.000
saliency shifts towards the previous tokens in the sequence, which kind of reflects the model knows where it's going and is just looking at the

00:36:02.000 --> 00:36:08.000
prefix to… to finish the expression, right? This is also reflected by the probability that

00:36:08.000 --> 00:36:14.000
starts, kind of, 50%, but then it becomes increasingly closer to 100%.

00:36:14.000 --> 00:36:24.000
So, yeah, here you have easy access to, like, a dozen attribution methods, including attention weights, gradient base, internal space.

00:36:24.000 --> 00:36:32.000
specifically for generative LMs, so, like, both encoder-decoder, or decoder-only language models.

00:36:32.000 --> 00:36:37.000
And, um, in the paper that we had related to this toolkit, we did a couple case studies.

00:36:37.000 --> 00:36:41.000
Uh, the first was to study gender bias in machine translation.

00:36:41.000 --> 00:36:47.000
So we're highlighting that pronouns have a big role when the model decides to use

00:36:47.000 --> 00:36:53.000
Um, sorry, that professions, uh, stereotypical professions have a big role when the model decides

00:36:53.000 --> 00:37:00.000
to translate, um, uh, as he or she in a language from a language that doesn't have the distinction.

00:37:00.000 --> 00:37:07.000
Um, and then we also try to approximate patching, so I have a slide on that, uh, later.

00:37:07.000 --> 00:37:12.000
So, this is a simple example of

00:37:12.000 --> 00:37:18.000
of using the library, so here we load a model with integrated gradients, which is the method that you saw.

00:37:18.000 --> 00:37:28.000
Um, and we do this model.attribute, which is kind of like the generate function in Hugging Face, just on top of that, we also extract whatever we're doing, right?

00:37:28.000 --> 00:37:33.000
Uh, so here it's, uh, we prompt the model with, does 3 plus 2 equals 6?

00:37:33.000 --> 00:37:38.000
Uh, and the output that gets generated can then be visualized with show.

00:37:38.000 --> 00:37:48.000
And it looks like this. Um, so the model predicts, yes, end of sequence, and here we have attribution scores for the prompt, plus

00:37:48.000 --> 00:37:51.000
the yes in the case of the end of sequence, right?

00:37:51.000 --> 00:38:01.000
Um, so do you notice something kind of weird here?

00:38:01.000 --> 00:38:09.000
like, is this kind of attribution pattern what you would expect for this kind of task?

00:38:09.000 --> 00:38:13.000
So I'm gonna… I'm gonna just talk out loud here, because I'm…

00:38:13.000 --> 00:38:15.000
really being slow here, but…

00:38:15.000 --> 00:38:18.000
Let's see, so it does 3 plus 3…

00:38:18.000 --> 00:38:20.000
equals 6.

00:38:20.000 --> 00:38:23.000
And I would expect…

00:38:23.000 --> 00:38:28.000
Oh, I see, so then it's gotta decide, so it's deciding whether this is true or not.

00:38:28.000 --> 00:38:31.000
And I would expect that…

00:38:31.000 --> 00:38:37.000
the things that it would have to look at to decide whether it's true is…

00:38:37.000 --> 00:38:40.000
It needs to look at the answer.

00:38:40.000 --> 00:38:41.000
Right, it needs to look at the equation also, right? Yeah.

00:38:41.000 --> 00:38:45.000
And it has to look at the question. It has to look at the question and answer, right? So, like, if you…

00:38:45.000 --> 00:38:46.000
Yeah.

00:38:46.000 --> 00:38:56.000
Right? So, like, so, like, if it would say, like, 3 plus 4 equals 6, right, that would be a different answer. So, like, the 3 and the 3 and the 6 seem like they'd probably be the most important things. Maybe the plus.

00:38:56.000 --> 00:38:58.000
That's right. That's right.

00:38:58.000 --> 00:39:01.000
But it's not. It's like they're the least important parts of the sentence.

00:39:01.000 --> 00:39:06.000
Indeed, indeed. So, like, my hypothesis, by looking at this example, was

00:39:06.000 --> 00:39:18.000
Can it be that this model is just so strongly trying to figure out the output desired format? Like, the fact that this is a yes-no question, rather than a mathematical

00:39:18.000 --> 00:39:20.000
come up with your answer question.

00:39:20.000 --> 00:39:24.000
that the fact of these function words receiving high importance is just because

00:39:24.000 --> 00:39:27.000
these are what are driving the final format, right?

00:39:27.000 --> 00:39:32.000
Uh, these are what produces the yes, rather than the equation itself, right?

00:39:32.000 --> 00:39:33.000
But then…

00:39:33.000 --> 00:39:38.000
Oh, right, because you're saying there's 50,000 other words that it could say here.

00:39:38.000 --> 00:39:39.000
Yeah.

00:39:39.000 --> 00:39:41.000
It could say… it could say pumpkin, or whatever, right? And…

00:39:41.000 --> 00:39:44.000
Uh-huh.

00:39:44.000 --> 00:39:45.000
Uh-huh, yes.

00:39:45.000 --> 00:39:53.000
Well, it could be that this model is trained to do maths, right? And then it's usually prompted to produce an answer given a mathematical expression, but in this case, the answer is not a number, right?

00:39:53.000 --> 00:40:00.000
So then it has to rely pretty heavily on… on function words that define the kind of expected format, which is yes, no, right?

00:40:00.000 --> 00:40:01.000
I see, I see, I see.

00:40:01.000 --> 00:40:02.000
Um, so…

00:40:02.000 --> 00:40:11.000
So, this can lead us to formulate some hypotheses, right? Like, the equation is not getting much importance, so can it be that the model is actually getting it

00:40:11.000 --> 00:40:17.000
right for the wrong reason, right? Maybe… maybe it's not actually caring much about the equation, it is just betting

00:40:17.000 --> 00:40:19.000
50-50, yes or no, right?

00:40:19.000 --> 00:40:21.000
And this can be tested.

00:40:21.000 --> 00:40:30.000
And indeed, we find that this… this is a pretty old model, but it's saying, yes, also for DAS3 plus 2 equals 7, right?

00:40:30.000 --> 00:40:33.000
So in this case, I just gave you an example of, like,

00:40:33.000 --> 00:40:39.000
Uh, we started from some attribution to formulate hypotheses that then we test behaviorally, right?

00:40:39.000 --> 00:40:43.000
So, the thing that I want to emphasize here is

00:40:43.000 --> 00:40:46.000
The attribution itself didn't give us any causal confirmation,

00:40:46.000 --> 00:40:49.000
that what we were trying to do, um…

00:40:49.000 --> 00:40:53.000
was, like, that the model wasn't actually doing the expression,

00:40:53.000 --> 00:40:58.000
But kind of highlighted that maybe the importance was a bit off for this kind of problem, right?

00:40:58.000 --> 00:41:06.000
Um, so this could be valuable for this kind of hypothesis generation, uh…

00:41:06.000 --> 00:41:07.000
Oh yeah, there's…

00:41:07.000 --> 00:41:12.000
Sorry, I'm seeing that… oh, Jasmine brought some messages.

00:41:12.000 --> 00:41:17.000
Uh, yeah. Uh, I can… yeah, we have some examples later, um…

00:41:17.000 --> 00:41:19.000
Yeah, I can… I can try to…

00:41:19.000 --> 00:41:25.000
discuss more about those, uh, in the next few slides.

00:41:25.000 --> 00:41:27.000
Um, so…

00:41:27.000 --> 00:41:31.000
there were some questions about faithfulness, um…

00:41:31.000 --> 00:41:37.000
So, yeah, I don't know if people are here, uh, maybe they want to ask them themselves.

00:41:37.000 --> 00:41:40.000
Luz, Luz, yes, go ahead.

00:41:40.000 --> 00:41:43.000
Or Jasmine, either one.

00:41:43.000 --> 00:41:49.000
Oh, I guess, like, for me, I was wondering, like, when… like, an explanation is, like.

00:41:49.000 --> 00:42:06.000
informative enough, like, um, for instance, like, if someone asked me, like, why did you eat an egg this morning? Like, I could say, like, because I was hungry, but I could also say something like, okay, like, when I was born, like, I… my mom fed me, you know what I mean? Like, it could start from 20-something years ago.

00:42:06.000 --> 00:42:11.000
Like, how do you know, like, when you have enough information, you're like, that's like a reasonable explanation.

00:42:11.000 --> 00:42:18.000
Right. Yeah, I think my answer to that is whenever it's sufficient

00:42:18.000 --> 00:42:20.000
to predict behavior in the…

00:42:20.000 --> 00:42:25.000
current use case, right? Like, at a satisfactory level, I think that's probably…

00:42:25.000 --> 00:42:30.000
Uh, the way that we tend to operationalize faithfulness overall, so…

00:42:30.000 --> 00:42:35.000
Um, yeah, so I would say that's the best way to understand, you know, like, if our explanation

00:42:35.000 --> 00:42:39.000
allows us to, to some degree, to understand, you know,

00:42:39.000 --> 00:42:48.000
what the model is doing there, and if we can act upon it, then probably our explanation is faithful with respect to how the model is doing things internally, right?

00:42:48.000 --> 00:42:54.000
Um, and maybe, you know, if it's a high-risk domain, probably I wouldn't trust

00:42:54.000 --> 00:42:59.000
even very faintful explanation in the medical domain, because the stakes are very high, right?

00:42:59.000 --> 00:43:05.000
So then, yeah, you really need, uh, like, to weight your expectations based on

00:43:05.000 --> 00:43:08.000
on the kind of application, I guess.

00:43:08.000 --> 00:43:10.000
Um…

00:43:10.000 --> 00:43:18.000
So yeah, so, um, I like two dimensions in faithfulness, so this is a paper, actually, from Northeastern, from Barron.

00:43:18.000 --> 00:43:27.000
group. Uh, so they… they were working on faithfulness, uh, very early, uh, in, in interpretability, and…

00:43:27.000 --> 00:43:33.000
One way that they, uh, that they were defining faithfulness was two complementary perspectives.

00:43:33.000 --> 00:43:39.000
So, if we want actionable results, is if these tokens are found to be important,

00:43:39.000 --> 00:43:45.000
Uh, one idea is if we drop these tokens, then we expect a big impact on the results, right?

00:43:45.000 --> 00:43:53.000
So this is what they call comprehensiveness. It's kind of like ablation, right? Kind of like occlusion. We occlude, we…

00:43:53.000 --> 00:43:55.000
Uh, we cause a big impact.

00:43:55.000 --> 00:44:01.000
The other perspective is efficiency. So here we're saying, if we only have these tokens, and we remove all the rest,

00:44:01.000 --> 00:44:06.000
does the result… do the result remain kind of consistent, right?

00:44:06.000 --> 00:44:11.000
So, yeah, this is just a visualization. If we find most amount of leaves as the important part,

00:44:11.000 --> 00:44:15.000
Uh, we would have comprehensiveness is like, um, yeah.

00:44:15.000 --> 00:44:18.000
if we, um…

00:44:18.000 --> 00:44:20.000
If we drop that, the probability drops,

00:44:20.000 --> 00:44:27.000
Uh, if we only keep that, the probability kind of stays the same. This is sufficiency.

00:44:27.000 --> 00:44:32.000
Um, so, the loose question about… I think…

00:44:32.000 --> 00:44:38.000
I don't know if Luz is here.

00:44:38.000 --> 00:44:42.000
Maybe not.

00:44:42.000 --> 00:44:49.000
Right? I can… I can summarize it. So the question was…

00:44:49.000 --> 00:44:58.000
Um, if the models are able to explain themselves, like to, for example, for Mirage, it was, if the models can cite

00:44:58.000 --> 00:45:04.000
Uh, alongside giving an answer, why do we even need a faithful, uh, method that looks at the internals?

00:45:04.000 --> 00:45:06.000
Right? If the model can already do that.

00:45:06.000 --> 00:45:12.000
Um, and our point with that paper was actually that

00:45:12.000 --> 00:45:20.000
the fact that models are so capable as to being kind of precise is kind of a recent thing, and before, we were just relying on superficial matching.

00:45:20.000 --> 00:45:26.000
So the citation was done mostly by saying, oh, here is the answer, here is the…

00:45:26.000 --> 00:45:34.000
the three documents that the model received, let's just embed those and look at the similarity between those, and find whatever is the most similar, right?

00:45:34.000 --> 00:45:39.000
Um, so I think that the answer to this question is, I think,

00:45:39.000 --> 00:45:45.000
There is an interesting perspective where we try to make models more aware of their inner workings,

00:45:45.000 --> 00:45:47.000
And in that direction,

00:45:47.000 --> 00:45:51.000
Uh, potentially, we wouldn't need so much of, like,

00:45:51.000 --> 00:45:57.000
digging deep into the model, if the model indeed can kind of self-predict well enough, right?

00:45:57.000 --> 00:46:01.000
Uh, I think we're still kind of far from this perspective, though, so I think…

00:46:01.000 --> 00:46:03.000
probably we do need, um…

00:46:03.000 --> 00:46:09.000
We do need this kind of method still, uh, for the foreseeable future.

00:46:09.000 --> 00:46:20.000
Um, and the complementary perspective to faithfulness is plausibility. So the plausibility dimension is user-centric instead of being model-centric.

00:46:20.000 --> 00:46:23.000
And it's asking, uh,

00:46:23.000 --> 00:46:29.000
Can these explanations be understood by whatever users of the system are looking at the thing?

00:46:29.000 --> 00:46:35.000
Uh, so one big problem is that you don't have guarantees that faithfulness and

00:46:35.000 --> 00:46:42.000
uh… plausibility go hand in hand, right? So ideally, the more faithful you are to the model,

00:46:42.000 --> 00:46:49.000
the more understandable you are for humans, but potentially there could be a disconnect there, right?

00:46:49.000 --> 00:46:52.000
Uh, so sometimes this has also been highlighted as a trade-off.

00:46:52.000 --> 00:46:59.000
So, the more you make explanation plausible, the more you're abstracting away the inner complexity of the model,

00:46:59.000 --> 00:47:03.000
Uh, and this produces this kind of mismatch, uh, that maybe you're kind of like,

00:47:03.000 --> 00:47:10.000
Um, diluting too much the complexity for users.

00:47:10.000 --> 00:47:18.000
So, again, this is application-dependent, and one interesting idea that I like in this domain is counterfactual simulation.

00:47:18.000 --> 00:47:24.000
So, the kind of setting would be, if you have an example, then you can get attribution out of this.

00:47:24.000 --> 00:47:27.000
Uh, the idea would be to…

00:47:27.000 --> 00:47:31.000
Um, given the attribution, can I simulate

00:47:31.000 --> 00:47:36.000
a counterfactual that would produce

00:47:36.000 --> 00:47:42.000
a different behavior, right? Uh, like in the other case that we saw just before,

00:47:42.000 --> 00:47:49.000
Uh, can I simulate the fact that if I change the 6 into a 7, uh, the model still predicts yes, right?

00:47:49.000 --> 00:47:56.000
And then, uh, what you would do is to compare, you know, the expected result with the actual behavior in this case.

00:47:56.000 --> 00:48:05.000
Um, so this is a good way to understand whether what you're looking at is kind of plausible or not, right?

00:48:05.000 --> 00:48:06.000
All right.

00:48:06.000 --> 00:48:10.000
Um, so…

00:48:10.000 --> 00:48:11.000
Um, now talking about

00:48:11.000 --> 00:48:15.000
or use the packing… what was the paper you cited for that one? Sorry, Connecting Attribution.

00:48:15.000 --> 00:48:20.000
Oh, yeah, it's a paper from some years ago from Greg Durrett, uh…

00:48:20.000 --> 00:48:27.000
Uh, from the group of Greg Durant about, like, doing this kind of counterfactual for evaluating explanations, um,

00:48:27.000 --> 00:48:32.000
And so was it… was that paper about, like, sort of human evaluations? What did Direct do?

00:48:32.000 --> 00:48:37.000
Yeah, yeah, they were trying to do similar stuff, like, this figure is taken from there, so…

00:48:37.000 --> 00:48:38.000
Cool.

00:48:38.000 --> 00:48:40.000
So this is their setup, actually, yeah.

00:48:40.000 --> 00:48:48.000
They were doing it mostly for, um, classifiers, though, NLP classifiers. Uh, so I… I… I do think that it's a bit…

00:48:48.000 --> 00:48:51.000
more challenging to do that in the generation setting.

00:48:51.000 --> 00:48:54.000
Um, that's a bit related to what we did for…

00:48:54.000 --> 00:48:57.000
for the PCOR, um, method, actually.

00:48:57.000 --> 00:48:58.000
Okay, cool.

00:48:58.000 --> 00:49:01.000
Yeah.

00:49:01.000 --> 00:49:04.000
Yeah, so, um…

00:49:04.000 --> 00:49:09.000
like, the contrastive attribution setup, I think it's very compelling for language.

00:49:09.000 --> 00:49:12.000
And, um…

00:49:12.000 --> 00:49:16.000
Yeah, I just want you to focus for now on the example that I have on the left.

00:49:16.000 --> 00:49:20.000
Uh, so if you have this input, right, can you stop the dog from…

00:49:20.000 --> 00:49:23.000
And the model predicts barking.

00:49:23.000 --> 00:49:28.000
Um, if we do gradient-based attributions, so this is just simple gradients,

00:49:28.000 --> 00:49:30.000
taken with respect to the inputs.

00:49:30.000 --> 00:49:33.000
And we do the aggregation, as I showed before.

00:49:33.000 --> 00:49:39.000
you would get some scores that look like this. So, red is positive, and blue is negative.

00:49:39.000 --> 00:49:42.000
in this setting, um…

00:49:42.000 --> 00:50:01.000
So, do you think this is intuitive, what you're seeing here? Like, these attribution scores, do they make sense to you, given this prompt?

00:50:01.000 --> 00:50:07.000
Maybe not.

00:50:07.000 --> 00:50:13.000
So there's things that are very positive and things that are very negative, and does white mean, like, close to zero?

00:50:13.000 --> 00:50:15.000
Yeah.

00:50:15.000 --> 00:50:16.000
And so…

00:50:16.000 --> 00:50:24.000
Yeah, so the highest tier is from, right? From is very, very, uh, positively influencing barking.

00:50:24.000 --> 00:50:28.000
Right. And V is very negatively influencing barking.

00:50:28.000 --> 00:50:30.000
Yeah.

00:50:30.000 --> 00:50:33.000
But the word that has the least effect is dog.

00:50:33.000 --> 00:50:37.000
Yeah.

00:50:37.000 --> 00:50:41.000
Which makes you smile. It seems like it's very counterintuitive, seems like…

00:50:41.000 --> 00:50:43.000
Look at, look at Nikhil, he's laughing at this.

00:50:43.000 --> 00:50:45.000
Yeah, exactly.

00:50:45.000 --> 00:50:51.000
Yeah, I mean, that's… that's weird, right? That's exactly the opposite of what we would expect, right?

00:50:51.000 --> 00:50:55.000
Um, and I think one of the…

00:50:55.000 --> 00:50:58.000
One of the reasons for that is that

00:50:58.000 --> 00:51:01.000
As humans, we tend to reason counterfactually, right?

00:51:01.000 --> 00:51:04.000
So when I'm asking you, um,

00:51:04.000 --> 00:51:09.000
what would come after that? Like, can you stop the dog from barking, right?

00:51:09.000 --> 00:51:12.000
If you had to explain barking,

00:51:12.000 --> 00:51:21.000
You have the tendency to reason semantically about barking, right? So, like, barking dog are related words, so dog should receive a big importance, right?

00:51:21.000 --> 00:51:24.000
But exactly, exactly as Jasmine is saying in the chat.

00:51:24.000 --> 00:51:30.000
eating could be an alternative word, right? So, naturally, in a sense, we're contrasting in our head

00:51:30.000 --> 00:51:36.000
barking with some other plausible alternative that doesn't involve dogs, right?

00:51:36.000 --> 00:51:40.000
Um, well, in practice, what attribution here is doing

00:51:40.000 --> 00:51:43.000
is just detecting relevance, right?

00:51:43.000 --> 00:51:45.000
And you… I could argue with you that…

00:51:45.000 --> 00:51:49.000
Uh, the from here is super important to predicting barking,

00:51:49.000 --> 00:51:57.000
Because it's… it's the next previous word, uh, and it's exactly the finding that the verb should be in that form, right?

00:51:57.000 --> 00:52:00.000
Without from, we wouldn't have a present continuous verb there, right?

00:52:00.000 --> 00:52:03.000
So then it's… it's essential, right?

00:52:03.000 --> 00:52:08.000
Um, so then, how do we actually bring this closer to human intuition?

00:52:08.000 --> 00:52:14.000
Um, so the idea that they had in this paper that I'm citing here from, uh…

00:52:14.000 --> 00:52:16.000
Graham Newbig and Kayo Yin.

00:52:16.000 --> 00:52:19.000
is to have a contrasting attribution.

00:52:19.000 --> 00:52:23.000
And the way that you would do this is by contrasting two words.

00:52:23.000 --> 00:52:26.000
Um, so the idea…

00:52:26.000 --> 00:52:31.000
is very simple. Instead of taking the gradient with respect to a single probability,

00:52:31.000 --> 00:52:38.000
We can take it with respect to a difference in probabilities. Here, probability of barking versus probability of crying, for example.

00:52:38.000 --> 00:52:44.000
And then the gradient that we get with respect to the input looks a lot more reasonable, if you ask me.

00:52:44.000 --> 00:52:48.000
So, Doug now finally has a meaning, and from it doesn't matter much.

00:52:48.000 --> 00:52:53.000
Because it would be a good choice for both verbs, right?

00:52:53.000 --> 00:53:00.000
So, in their work, what they were showing is that by disentangling this kind of semantic factors from syntactic,

00:53:00.000 --> 00:53:08.000
factors, you can improve simulatability, which is a bit like what we were seeing before in plausibility.

00:53:08.000 --> 00:53:12.000
Uh, so the ability of a human to actually, um,

00:53:12.000 --> 00:53:17.000
predict whether the prediction would change by changing a specific word, right?

00:53:17.000 --> 00:53:29.000
Um, so yeah, one thing that I want to stress here, here on the right, is that the attribution function, like, sorry, the attributed function, in this case the difference in probability,

00:53:29.000 --> 00:53:35.000
is fundamental to the way that you interpret what you're getting out, right? So in this case, we saw

00:53:35.000 --> 00:53:39.000
by… by taking this difference, you can interpret it as a…

00:53:39.000 --> 00:53:48.000
why this rather than something else, right? But here I make an even more abstract example. I could attribute the entropy of the final distribution

00:53:48.000 --> 00:53:50.000
over the vocabulary.

00:53:50.000 --> 00:53:56.000
And this could maybe tell me what in the input, is driving the uncertainty in the model, right?

00:53:56.000 --> 00:54:05.000
or the certainty in the model. So, in principle, like, the possibilities are endless, you know? You could attribute any kind of function

00:54:05.000 --> 00:54:13.000
of your prediction, and the attribution scores would tell you different things depending on what you're looking at, right?

00:54:13.000 --> 00:54:16.000
So, I think here, again, there's a lot of…

00:54:16.000 --> 00:54:26.000
untested ground in this area, so people are just kind of starting out in digging into potential variants of that.

00:54:26.000 --> 00:54:28.000
Um…

00:54:28.000 --> 00:54:31.000
Yeah, and I mentioned before the…

00:54:31.000 --> 00:54:33.000
Sorry.

00:54:33.000 --> 00:54:35.000
Oh, sorry.

00:54:35.000 --> 00:54:41.000
Uh, maybe it is attribution methods are just not good enough for modeling 2D order effects.

00:54:41.000 --> 00:54:47.000
Um, yeah, that's a… that's a good point. I have a slide later about, actually,

00:54:47.000 --> 00:54:50.000
modeling interactions of features.

00:54:50.000 --> 00:54:51.000
So there are some methods that are specifically meant for that, since many people ask in the class.

00:54:51.000 --> 00:54:56.000
So, it looks like you have a question from Arnav online from the chat.

00:54:56.000 --> 00:55:03.000
Um, but yeah, they're very expensive, so definitely that's… that's also another problem why they haven't seen much usage.

00:55:03.000 --> 00:55:11.000
Yeah.

00:55:11.000 --> 00:55:12.000
Yeah.

00:55:12.000 --> 00:55:13.000
So, the Graham-Neubig method here, this cool contrastive method, is that… but this is pretty simple. This is pretty lightweight. It's, like, a pretty cheap gradient to compute. That's neat.

00:55:13.000 --> 00:55:16.000
It also seems like… like, you…

00:55:16.000 --> 00:55:19.000
Here, they're applying it to a gradient method.

00:55:19.000 --> 00:55:25.000
Well, it seems like you might be able to do this for some of the other methods you talked about, like the occlusion…

00:55:25.000 --> 00:55:26.000
Yeah, sure.

00:55:26.000 --> 00:55:29.000
you know, looking at attention, looking at LRP, all this stuff, like, you might be able to drop in.

00:55:29.000 --> 00:55:35.000
I think they were trying, um, they were trying in their original paper also occlusion, like, this kind of setup for occlusion.

00:55:35.000 --> 00:55:37.000
Uh-huh. Yep.

00:55:37.000 --> 00:55:45.000
Um, I have personally already implemented the LRP with the contrastive attribution, so this works. It's tested.

00:55:45.000 --> 00:55:46.000
Yeah, yeah, yeah.

00:55:46.000 --> 00:55:52.000
Oh, you did it. Oh, you used… oh, so you like it. I see a big smile here. So you think this is a pretty good way to go?

00:55:52.000 --> 00:55:56.000
I see.

00:55:56.000 --> 00:55:57.000
Right.

00:55:57.000 --> 00:55:59.000
Yeah, yeah, yeah, I'm a big fan of this idea. For NLP, I think it's very valuable, because the output space is so large, right? You have so many tokens that it just makes sense to…

00:55:59.000 --> 00:56:00.000
Nice.

00:56:00.000 --> 00:56:02.000
pin down exactly what you want to compare, yeah.

00:56:02.000 --> 00:56:06.000
Nice, that's great, thanks. This is really helpful.

00:56:06.000 --> 00:56:12.000
Great. Um, yeah, I just wanted to mention, so for component attribution, uh, I think

00:56:12.000 --> 00:56:17.000
last lecture, two lectures ago, you saw with David the causal mediation, right?

00:56:17.000 --> 00:56:20.000
So, this might be familiar to you, this kind of setup.

00:56:20.000 --> 00:56:23.000
Eiffel Tower is located in Paris, right?

00:56:23.000 --> 00:56:27.000
Um, so when we introduced the in-seq library,

00:56:27.000 --> 00:56:30.000
we asked, but can we use contrastive attribution

00:56:30.000 --> 00:56:33.000
or approximating causal mediation, right?

00:56:33.000 --> 00:56:39.000
Uh, can we use these contrasting, uh, gradient, uh, attribution?

00:56:39.000 --> 00:56:49.000
to get saliency, not for the input tokens like we saw just now, but for all intermediate steps, right? And kind of see how much does it agree with causal mediation.

00:56:49.000 --> 00:56:55.000
And our result was that, of course, this is much coarser and not, you know,

00:56:55.000 --> 00:57:00.000
Uh, not as sharp as causal mediation, but we did also see this kind of

00:57:00.000 --> 00:57:06.000
early site that they were highlighting on the last subject token. So, to some degree, let's say our method was

00:57:06.000 --> 00:57:10.000
was associating much more importance to the last token,

00:57:10.000 --> 00:57:17.000
But it still found some structure that… that, using causal mediation, would have required a lot of ablations, right?

00:57:17.000 --> 00:57:25.000
Um, and… and the cool part here is that this is very efficient, right? We do a single forward pass,

00:57:25.000 --> 00:57:30.000
And we do one backward pass, in which we get saliency values for all the nodes.

00:57:30.000 --> 00:57:32.000
Here, in the graph.

00:57:32.000 --> 00:57:35.000
And that's it. Uh, instead of doing…

00:57:35.000 --> 00:57:42.000
sequence length, time, number of layers, uh, forward passes to do… to estimate causal mediation.

00:57:42.000 --> 00:57:49.000
So, actually, there is one very popular attribution method that has been used after that that is called attribution patching.

00:57:49.000 --> 00:57:54.000
Uh, which was introduced more or less at the same time as ours, uh, but got a lot more

00:57:54.000 --> 00:58:04.000
traction. Uh, and uh… and the idea is quite similar, it's just instead of having the contrastive outputs, they contrast inputs, so they change

00:58:04.000 --> 00:58:09.000
They have two settings, they get gradients for the two settings, kind of like in causal mediation.

00:58:09.000 --> 00:58:14.000
Uh, and then they just take the difference between the gradients. But the idea is

00:58:14.000 --> 00:58:18.000
pretty similar, kind of complementary, let's say.

00:58:18.000 --> 00:58:21.000
Um…

00:58:21.000 --> 00:58:24.000
So, now I want to move to, um…

00:58:24.000 --> 00:58:26.000
The other… the other work that you've seen?

00:58:26.000 --> 00:58:32.000
Uh, in the readings that was related to attributing context, uh…

00:58:32.000 --> 00:58:37.000
with language models. So, our original contribution in this area was

00:58:37.000 --> 00:58:41.000
um… was this PCOR framework.

00:58:41.000 --> 00:58:49.000
So the main driver for that was that we found out attribution methods are expensive if you apply them to the whole generation, right?

00:58:49.000 --> 00:58:55.000
Uh, we've seen already, if you had to build the table that I've seen, and that I've shown before,

00:58:55.000 --> 00:58:57.000
for a very long generation,

00:58:57.000 --> 00:59:04.000
That would be super expensive, so you kind of want to narrow down exactly on which steps you are interested in doing attribution.

00:59:04.000 --> 00:59:15.000
And secondly, the fact of ambiguity with large vocabularies, so the second drive is, uh, this, you know, contrastive need for contrastive explanations, right?

00:59:15.000 --> 00:59:21.000
Um, so what we propose is this plausibility evaluation of context reliance,

00:59:21.000 --> 00:59:26.000
framework. Uh, and the way that this works is simply in two steps.

00:59:26.000 --> 00:59:31.000
Um, the first step is to identify in the generation which steps are more

00:59:31.000 --> 00:59:34.000
um, influenced by context.

00:59:34.000 --> 00:59:36.000
And then to focus on those,

00:59:36.000 --> 00:59:43.000
to do this contrastive attribution back to the context. So the final outcome of all of this is simply a pair,

00:59:43.000 --> 00:59:48.000
of influential context tokens,

00:59:48.000 --> 00:59:53.000
relating to some influence-generated tokens, right?

00:59:53.000 --> 01:00:01.000
So now, I'll give you a very quick overview of how this works. So, let's assume our model, here it's an encoder-decoder, but it doesn't really matter.

01:00:01.000 --> 01:00:03.000
And, um…

01:00:03.000 --> 01:00:08.000
We are considering a generation task, so English to Italian generation,

01:00:08.000 --> 01:00:14.000
Uh, in which we have a context that the model needs to use to do the task correctly. So let's say here you have

01:00:14.000 --> 01:00:19.000
I ate the pizza, this is my context, it was quite tasty.

01:00:19.000 --> 01:00:27.000
Um, if I have to translate, it was quite tasty in Italian, I have to know whether the tasty is masculine or feminine, right?

01:00:27.000 --> 01:00:32.000
Depending on what I said before. In this case, pizza is feminine, so it needs to be born.

01:00:32.000 --> 01:00:34.000
For example. Um…

01:00:34.000 --> 01:00:39.000
So, this is the… what we call the contextual variant of the input.

01:00:39.000 --> 01:00:42.000
Then we can have a non-contextual variant,

01:00:42.000 --> 01:00:49.000
that is passed as is, and would predict something different, right? So here, for example, we could go with the masculine as a default,

01:00:49.000 --> 01:00:53.000
Era molto buono, with a O at the end.

01:00:53.000 --> 01:00:57.000
Um, so what our method does is actually…

01:00:57.000 --> 01:01:03.000
is to take the contextual version of the output and force the code it in the non-contextual case,

01:01:03.000 --> 01:01:12.000
By taking, uh, this kind of information theoretic metrics at every step of the generation, so here we enforce the same token

01:01:12.000 --> 01:01:19.000
at every step. And then we look in… for which one of these tokens the distribution would be the most skewed

01:01:19.000 --> 01:01:22.000
by the absence of input context, right?

01:01:22.000 --> 01:01:27.000
So, we would get a score per token that then can be

01:01:27.000 --> 01:01:35.000
discretized in some, uh, with some heuristics, for example, to get a label that is either positive or negative.

01:01:35.000 --> 01:01:38.000
Um, uh, just to have a…

01:01:38.000 --> 01:01:43.000
Uh, yes-no kind of perspective here. So in this case, we would find this last token

01:01:43.000 --> 01:01:49.000
that was generated is the one that is the most influenced by the context, right?

01:01:49.000 --> 01:01:52.000
Um, so, yeah.

01:01:52.000 --> 01:01:53.000
Oh, you saw me on mute, yes.

01:01:53.000 --> 01:01:54.000
David? You had a question? Yeah.

01:01:54.000 --> 01:02:02.000
Yeah, yeah, yeah. So you say force decode, so is this just… so, which I didn't really understand the 4C code, so you're sort of making two predictions.

01:02:02.000 --> 01:02:03.000
And you're asking,

01:02:03.000 --> 01:02:04.000
Yeah.

01:02:04.000 --> 01:02:07.000
Um…

01:02:07.000 --> 01:02:13.000
Let me think about this for a second. So, you're asking, what is…

01:02:13.000 --> 01:02:18.000
Or… I see. So you're putting the… you're putting the…

01:02:18.000 --> 01:02:21.000
prediction, uh…

01:02:21.000 --> 01:02:24.000
Y-hat…

01:02:24.000 --> 01:02:28.000
in… and you're using the model to evaluate

01:02:28.000 --> 01:02:29.000
Yeah. Yeah, exactly.

01:02:29.000 --> 01:02:34.000
that prediction. And then… and then… and then you're… and then you're making a heat map over…

01:02:34.000 --> 01:02:39.000
the evaluated tokens to say which one

01:02:39.000 --> 01:02:43.000
is most unlikely. And are you… so are you doing that with KL divergence, with the whole distribution, or…?

01:02:43.000 --> 01:02:51.000
Yeah, in this… in this case, we tried several metrics, and KL divergence was the one that was leading to the best results.

01:02:51.000 --> 01:02:52.000
I see.

01:02:52.000 --> 01:02:55.000
We also try to contrast just the probability of the top token, um…

01:02:55.000 --> 01:02:56.000
like, it's a difference, yeah.

01:02:56.000 --> 01:03:01.000
I see, I see. I see. So it's not literally just the top token, it's just whatever the model was thinking there, you're taking that whole distribution,

01:03:01.000 --> 01:03:03.000
Yeah.

01:03:03.000 --> 01:03:07.000
And then you're putting it in, you're saying, how different is that from what the model wants to think?

01:03:07.000 --> 01:03:08.000
Exactly, yeah.

01:03:08.000 --> 01:03:14.000
Um, over… over in this situation. But you are… but for in… on the input side, you have to feed in,

01:03:14.000 --> 01:03:16.000
the whole sentence, and so you're just feeding in…

01:03:16.000 --> 01:03:17.000
Yeah.

01:03:17.000 --> 01:03:19.000
the whole sentence on the input side.

01:03:19.000 --> 01:03:20.000
I see. I see.

01:03:20.000 --> 01:03:22.000
Yeah, there are two alternatives, right? The one with the context, the one without.

01:03:22.000 --> 01:03:25.000
Yeah, I think this first decode can be a bit misleading here, because the example is very short, right? But…

01:03:25.000 --> 01:03:27.000
Okay. Right.

01:03:27.000 --> 01:03:40.000
Imagine if you had a long example that has many of these kind of keywords that were influenced by the context. The idea here is that I just wanted to express that you…

01:03:40.000 --> 01:03:41.000
Right.

01:03:41.000 --> 01:03:43.000
You will always keep the two cases identical, so you're kind of, like, adding

01:03:43.000 --> 01:03:51.000
the… whatever is that from the contextual case, regardless of what the non-contextual case would predict there, to ensure that the prefix is always matching, right?

01:03:51.000 --> 01:03:52.000
Um,

01:03:52.000 --> 01:03:54.000
Right, right. Because the output becomes part of the input as you go autoregressively.

01:03:54.000 --> 01:04:03.000
Exactly, exactly, and you have to match them, otherwise you would have a disagreement that might be mediated by different outputs, right?

01:04:03.000 --> 01:04:04.000
Yeah.

01:04:04.000 --> 01:04:05.000
Okay. Sorry to ask this question so fast. Is it…

01:04:05.000 --> 01:04:11.000
I want to let the students ask further if we've confused them.

01:04:11.000 --> 01:04:14.000
I don't know if this is clear enough.

01:04:14.000 --> 01:04:16.000
So the goal of this step

01:04:16.000 --> 01:04:19.000
is to basically get a heat map

01:04:19.000 --> 01:04:22.000
Over these Ys, over the output.

01:04:22.000 --> 01:04:23.000
Yep, yep, pretty much.

01:04:23.000 --> 01:04:24.000
Okay, great. Mm-hmm.

01:04:24.000 --> 01:04:29.000
Um, yeah, so as I said, these will be continuous course, but then we would get

01:04:29.000 --> 01:04:32.000
Um, this kind of discrete yes-no labels, right?

01:04:32.000 --> 01:04:33.000
Sure.

01:04:33.000 --> 01:04:37.000
So the reason why we need these yes-no labels is for the step two,

01:04:37.000 --> 01:04:44.000
Where this will, uh, allow us to understand where to do the attribution, right?

01:04:44.000 --> 01:04:52.000
So, um, the key interesting thing that I think we introduce here is that, let's say that we… now we have our sequence,

01:04:52.000 --> 01:04:57.000
Uh, so this is the contextually generated output with these labels, right?

01:04:57.000 --> 01:05:01.000
Um, the key step here is that we want to force

01:05:01.000 --> 01:05:04.000
Um, the prefix that is the same

01:05:04.000 --> 01:05:09.000
for all the tokens that were and found contextually sensitive, right?

01:05:09.000 --> 01:05:14.000
But then to sample from the non-contextual setting,

01:05:14.000 --> 01:05:17.000
the alternative, right?

01:05:17.000 --> 01:05:26.000
So, basically, here, this is just a data-driven way to get to this contrastive pairs that then would allow us to do contrastive attribution.

01:05:26.000 --> 01:05:29.000
Uh, by exploiting the…

01:05:29.000 --> 01:05:32.000
the same model without the context,

01:05:32.000 --> 01:05:36.000
to get what would be its prediction without the context, right?

01:05:36.000 --> 01:05:40.000
So now, we kind of bootstrapped this minimal pair of, like,

01:05:40.000 --> 01:05:49.000
um, words that then can be… we can do the contrastive attribution that I showed before, so probability of 1 minus the other.

01:05:49.000 --> 01:05:53.000
and propagate this throughout the model back to the input context.

01:05:53.000 --> 01:05:56.000
So what we get here is, for example,

01:05:56.000 --> 01:06:00.000
this continuous course at the token level, like I showed before,

01:06:00.000 --> 01:06:07.000
And these, again, would be discretized to get pairs. So here, the way that we would read that is

01:06:07.000 --> 01:06:13.000
the pizza in the source is influencing Buana, and also the pizza in the Target is influencing Buena.

01:06:13.000 --> 01:06:16.000
Right?

01:06:16.000 --> 01:06:18.000
Um…

01:06:18.000 --> 01:06:20.000
is this clear enough?

01:06:20.000 --> 01:06:29.000
I know it's quite complex as a structure.

01:06:29.000 --> 01:06:32.000
Yeah, so what kind of this decision…

01:06:32.000 --> 01:06:36.000
can we make off of these results? Um…

01:06:36.000 --> 01:06:45.000
I think the interesting part that you… like, the interesting idea here would be that you can apply this to any kind of task, let's say, in a

01:06:45.000 --> 01:06:49.000
bit of a blind way, right? Like, you have a generation from the model,

01:06:49.000 --> 01:06:52.000
You could just entirely, in a data-driven way, I run

01:06:52.000 --> 01:06:57.000
my PCOR approach, I get these relations between inputs and outputs,

01:06:57.000 --> 01:07:08.000
And then you could explore the outputs, right, and form hypotheses, right? So we had kind of interesting examples that we found when we did that on machine translation,

01:07:08.000 --> 01:07:12.000
Sadly, I don't have them here in the slides, but one…

01:07:12.000 --> 01:07:16.000
One example that I can mention that surprised me was that, um,

01:07:16.000 --> 01:07:21.000
we had a text where the model was receiving some information, like,

01:07:21.000 --> 01:07:24.000
the soccer match is at 10am.

01:07:24.000 --> 01:07:28.000
written, like, 10 semicolon, 0, 0AM.

01:07:28.000 --> 01:07:39.000
And then, uh, it had a text that was saying something like, the match was a fierce competition, and it ended 26-0, right, for the blue team, for example.

01:07:39.000 --> 01:07:45.000
And the model was deciding to format 26 to 0 with a semicolon,

01:07:45.000 --> 01:07:56.000
Because the time in the context was using the same semicolon format to express the hour, right? Which is kind of a weird behavior, if you think about it. I would have never thought of that myself.

01:07:56.000 --> 01:07:59.000
Um, so I think that kind of highlights

01:07:59.000 --> 01:08:07.000
the importance of making this data-driven, right? So a big motivation for us was that most of this kind of evaluation here

01:08:07.000 --> 01:08:15.000
rely on this kind of hypothesis-based, I don't know, I expect my mother to have a gender bias, so I craft my set of data

01:08:15.000 --> 01:08:18.000
We gender bias, and I test them or not, right?

01:08:18.000 --> 01:08:25.000
Well, here, you can just run it on anything, and then kind of post-doc look at if there is something interesting there, right?

01:08:25.000 --> 01:08:32.000
Um, so yeah. That's… that's the overall idea.

01:08:32.000 --> 01:08:33.000
Yep.

01:08:33.000 --> 01:08:35.000
That's… that's cool. And so, it's sort of…

01:08:35.000 --> 01:08:37.000
is a way to think about this is…

01:08:37.000 --> 01:08:40.000
There's… there's just too much information.

01:08:40.000 --> 01:08:46.000
in the all pairs…

01:08:46.000 --> 01:08:47.000
Yep.

01:08:47.000 --> 01:08:52.000
uh… you know, sort of attribution, and you're doing things to try to winnow that down to a small number of edges that you really want to pay attention to. Is that right?

01:08:52.000 --> 01:08:58.000
Exactly. And I think, you know, more philosophically speaking, moving forward with, like,

01:08:58.000 --> 01:09:04.000
interpreting, you know, very complex scenarios. I think this kind of, um…

01:09:04.000 --> 01:09:16.000
narrowing down what we really care for will be super important in the future, too. You know, even if we want to do, I don't know, mechanistic analysis of, you know, something sketchy is going on here that we didn't expect, you know.

01:09:16.000 --> 01:09:23.000
I think in reasoning or, like, agents, you know, I think this will become more and more important to kind of narrow down

01:09:23.000 --> 01:09:26.000
what's going on there. Um…

01:09:26.000 --> 01:09:28.000
So, yeah.

01:09:28.000 --> 01:09:32.000
So that was the original idea here.

01:09:32.000 --> 01:09:40.000
So, in the next reasonable step here was, wait, now we can connect outputs to inputs,

01:09:40.000 --> 01:09:41.000
Uh…

01:09:41.000 --> 01:09:50.000
Oh, did… did Jasmine get to ask her question? I see another text that flew by. Did you get to ask your question, Jasmine?

01:09:50.000 --> 01:09:51.000
You're all set. Okay, cool.

01:09:51.000 --> 01:09:55.000
Yeah. Yeah, yeah, yeah, I can, um, took it up, yeah.

01:09:55.000 --> 01:09:59.000
Yeah, yeah, so the next reasonable step here was, um,

01:09:59.000 --> 01:10:04.000
Well, now we can link outputs to inputs, can we use that to create citations, right?

01:10:04.000 --> 01:10:09.000
Uh, so this was very relevant. It was, uh, yeah, a couple years ago, we didn't have, again,

01:10:09.000 --> 01:10:12.000
these models that could cite themselves well.

01:10:12.000 --> 01:10:15.000
Uh, they weren't trained to do that, um…

01:10:15.000 --> 01:10:19.000
So, our idea was, can we just, looking at the internals,

01:10:19.000 --> 01:10:24.000
understand how the inputs are influencing the generation, right?

01:10:24.000 --> 01:10:27.000
So, Mirage works exactly in the same way as Pekore,

01:10:27.000 --> 01:10:36.000
Uh, so here we are, our context that gets added or removed is the three documents that were retrieved by a retrieval system.

01:10:36.000 --> 01:10:38.000
And added to the prompt.

01:10:38.000 --> 01:10:41.000
And the functioning is exactly the same, we would

01:10:41.000 --> 01:10:47.000
see how these three documents shift the probability distribution of the model

01:10:47.000 --> 01:10:58.000
for specific tokens in the answer, and then trace this back to some specific tokens in one of the documents that are responsible for the shift.

01:10:58.000 --> 01:11:00.000
Um…

01:11:00.000 --> 01:11:07.000
So, I wanted to show this picture that was also in the paper that you were referred to by David.

01:11:07.000 --> 01:11:14.000
I think this is quite good evidence in relation to what Nikhil was asking before.

01:11:14.000 --> 01:11:23.000
that even though it's not super clean, this is entirely gradient-based, so this is raw gradients for attribution.

01:11:23.000 --> 01:11:31.000
Uh, and, um, you can see that here on the x-axis, you have the five documents that were given as context, and every point here

01:11:31.000 --> 01:11:38.000
is a word within a document that receives an attribution score, right? So the y-axis is kind of like the…

01:11:38.000 --> 01:11:41.000
the attribution intensity for that word.

01:11:41.000 --> 01:11:46.000
Um, uh, given the output here 9 in the generation.

01:11:46.000 --> 01:11:51.000
Right? So you can see that this quite cleanly points at the two

01:11:51.000 --> 01:11:55.000
the exact March $19 billion in the Document 1,

01:11:55.000 --> 01:12:00.000
as a motivation for predicting $19 billion in the answer.

01:12:00.000 --> 01:12:05.000
So yeah, even though it's not super clean, there is still, like, enough information to kind of, you know,

01:12:05.000 --> 01:12:07.000
cut out exactly what we want here.

01:12:07.000 --> 01:12:13.000
Um, and yeah, in the paper, we try different approaches, and

01:12:13.000 --> 01:12:20.000
we were using some heuristic, like, let's take the top 5% or the top 20% of these tokens,

01:12:20.000 --> 01:12:22.000
Uh, based on the attribution scores.

01:12:22.000 --> 01:12:30.000
But we also tried calibration, so kind of, like, trying to select what would be a good threshold to match a set of gold labels

01:12:30.000 --> 01:12:33.000
Uh, in a way that then we can just…

01:12:33.000 --> 01:12:44.000
find this threshold value and then apply it, you know, to unseen documents, and that seemed to help. So, in general, if you have a gold annotated dataset with citations that you want,

01:12:44.000 --> 01:12:53.000
Um, that probably would be a good way to… to select these.

01:12:53.000 --> 01:13:10.000
There were some questions related to Mirage, maybe people that are here and can ask them.

01:13:10.000 --> 01:13:17.000
Um, I was asking what would happen…

01:13:17.000 --> 01:13:23.000
If, like, Attribution saw two documents that are similar, but, like, different in tone,

01:13:23.000 --> 01:13:24.000
Right.

01:13:24.000 --> 01:13:29.000
So, like, they may have, like, the same content, but presenting different viewpoints.

01:13:29.000 --> 01:13:44.000
Yeah, that… yeah, I thought that was a very good question. I think it depends a lot on the model, right? So it might be that the model will find that, you know, a more explicit mention of something is more…

01:13:44.000 --> 01:13:55.000
actionable to come to an answer, so it might be that that receives the higher importance. Um, actually, what I mentioned here is that recency is probably the most

01:13:55.000 --> 01:13:59.000
common trend that you see in language models, so if something is mentioned

01:13:59.000 --> 01:14:06.000
several times. The last time… the last mention is probably gonna be the one that the model is mostly relying on.

01:14:06.000 --> 01:14:10.000
Um, but this all relates to…

01:14:10.000 --> 01:14:20.000
this thing that we were saying before, um, attribution pertains only to the current context, right? So, like, the fact that this token received a high importance,

01:14:20.000 --> 01:14:23.000
is not exactly a proxy for saying,

01:14:23.000 --> 01:14:31.000
if this token wasn't there, then everything would change, because, like, in this case of redundant mentions, if these tokens disappeared,

01:14:31.000 --> 01:14:43.000
then other tokens could take the attribution other place. Uh, so sometimes this can lead to this kind of misleading interpretations, right?

01:14:43.000 --> 01:14:46.000
Yeah.

01:14:46.000 --> 01:14:47.000
And…

01:14:47.000 --> 01:14:49.000
Okay, that's great.

01:14:49.000 --> 01:14:56.000
Is Aria here?

01:14:56.000 --> 01:14:57.000
Yeah, pretty, pretty related.

01:14:57.000 --> 01:14:59.000
Yeah, but I think you've already answered my question, so I don't have to go through it again.

01:14:59.000 --> 01:15:10.000
Yeah. And was this last one?

01:15:10.000 --> 01:15:11.000
not here, I think.

01:15:11.000 --> 01:15:13.000
Shui?

01:15:13.000 --> 01:15:14.000
brochures here, but maybe, maybe not.

01:15:14.000 --> 01:15:15.000
Um… oh.

01:15:15.000 --> 01:15:17.000
Yeah, but I'm not sure.

01:15:17.000 --> 01:15:18.000
Oh, mic is not working. Okay, it's fine.

01:15:18.000 --> 01:15:23.000
Oh, right. Okay, okay. I can just summarize, yeah.

01:15:23.000 --> 01:15:28.000
Yeah, so the idea here was… is this procedure of, like, forcing

01:15:28.000 --> 01:15:32.000
the prefix when… when considering this difference from the context.

01:15:32.000 --> 01:15:38.000
actually breaking the Mirage framework, right, in a sense, because you're forcing a…

01:15:38.000 --> 01:15:42.000
a prefix of the output that is maybe not what the model would generate,

01:15:42.000 --> 01:15:45.000
when the context was absent, right?

01:15:45.000 --> 01:15:50.000
So definitely, this leads to some potentially out-of-distribution behavior there.

01:15:50.000 --> 01:15:55.000
But considering that that's mostly used to select

01:15:55.000 --> 01:15:57.000
cases where the context was influential.

01:15:57.000 --> 01:16:02.000
Um, actually, this can lead to some interesting things, like…

01:16:02.000 --> 01:16:04.000
if something becomes…

01:16:04.000 --> 01:16:07.000
explicitly mentioned in the… in the output,

01:16:07.000 --> 01:16:11.000
Uh, then it might be that whatever comes after…

01:16:11.000 --> 01:16:21.000
is now relying on the output of the model, rather than relying on the context, right? So maybe at first, you need to rely on the context, but from there onwards,

01:16:21.000 --> 01:16:29.000
now you're just relying on the output, uh, so all the subsequent mentions would not be picked out, right, as context-sensitive.

01:16:29.000 --> 01:16:31.000
So I think that's actually something interesting.

01:16:31.000 --> 01:16:37.000
Um, that actually reflects how the model operates, right?

01:16:37.000 --> 01:16:42.000
Um, yeah.

01:16:42.000 --> 01:16:46.000
Alright. Um…

01:16:46.000 --> 01:16:53.000
Yeah, and I just wanted to show you, um, so we have an API for, uh, PCOR.

01:16:53.000 --> 01:16:55.000
inside in Seek now.

01:16:55.000 --> 01:16:57.000
So that's… it's pretty…

01:16:57.000 --> 01:17:03.000
convenient to use, and it's, you know, it can be used within a Jupyter notebook. It's quite nice.

01:17:03.000 --> 01:17:08.000
Um, so we can have a look at this example together. I think it's quite, uh, quite informative.

01:17:08.000 --> 01:17:16.000
Um, so here, the input that the model receives is when was the most successful player in NBA history born.

01:17:16.000 --> 01:17:24.000
And this is not a great model, it's like a 300 million parameter model, so it's… it's pretty bad at doing language modeling, so here,

01:17:24.000 --> 01:17:28.000
The model is predicting 2015-2016.

01:17:28.000 --> 01:17:34.000
Uh, but then it's also saying the most successful player in NBA history is Steven John…

01:17:34.000 --> 01:17:36.000
something, okay? Um…

01:17:36.000 --> 01:17:42.000
So, we set our threshold when we apply the method, and we found that John

01:17:42.000 --> 01:17:47.000
In this case is one of these tokens that is context-sensitive, right?

01:17:47.000 --> 01:17:50.000
So, if we open this toggle here…

01:17:50.000 --> 01:17:52.000
Uh, what we see is

01:17:52.000 --> 01:18:00.000
three documents that the model received that were appended to the prompt, right? So this is a rug setup, kind of like in Mirage, right?

01:18:00.000 --> 01:18:09.000
So, you can see that the tokens within these documents are colored based on how influential they were towards the prediction of John.

01:18:09.000 --> 01:18:13.000
Here. And we can see that the most influential ones are

01:18:13.000 --> 01:18:19.000
actually, Steven John, right? Which makes sense, so it means the third document here…

01:18:19.000 --> 01:18:22.000
is what is driving the prediction of

01:18:22.000 --> 01:18:24.000
Steve and John here.

01:18:24.000 --> 01:18:26.000
But I think even more informative,

01:18:26.000 --> 01:18:29.000
Uh, is the fact that here you can also see

01:18:29.000 --> 01:18:32.000
what the model would have predicted in the non-contextual case.

01:18:32.000 --> 01:18:35.000
So here, the model would have said Steven Kerr,

01:18:35.000 --> 01:18:41.000
Which is probably Stephen Curry, right? I'm not a basketball expert, but I guess…

01:18:41.000 --> 01:18:43.000
That will make sense, uh…

01:18:43.000 --> 01:18:45.000
But it was kind of like…

01:18:45.000 --> 01:18:51.000
um, you know, sidetracked by the presence of this Steven John here in context.

01:18:51.000 --> 01:18:53.000
for towards predicting Steve and John.

01:18:53.000 --> 01:19:00.000
Here. So, I think this is interesting, because basically this example is pointing at the fact that this model

01:19:00.000 --> 01:19:02.000
is over-relying

01:19:02.000 --> 01:19:10.000
on the context versus its previous memorized information, so it probably would have said something reasonable if only using memory.

01:19:10.000 --> 01:19:16.000
But the context has such a big influence on what it's predicting, that it decided to go for something

01:19:16.000 --> 01:19:19.000
that maybe is not ideal here, right?

01:19:19.000 --> 01:19:26.000
Uh, so this is the kind of visualization you can get in a notebook. So, in the NSIC repository, um,

01:19:26.000 --> 01:19:31.000
we have a notebook with exactly this example, so to reproduce exactly this.

01:19:31.000 --> 01:19:37.000
Uh, and some analysis on reasoning, also.

01:19:37.000 --> 01:19:42.000
So yeah, so it's a new version that we released some time ago.

01:19:42.000 --> 01:19:50.000
Um, any question on that?

01:19:50.000 --> 01:19:51.000
No.

01:19:51.000 --> 01:19:53.000
Dude.

01:19:53.000 --> 01:19:54.000
All right.

01:19:54.000 --> 01:19:58.000
You know, I'll be interested in, you know,

01:19:58.000 --> 01:20:05.000
whether, you know, whatever methods get adapted to different research projects. You know, each one of these methods is…

01:20:05.000 --> 01:20:08.000
you know, exposing something different, so I'll just be really curious.

01:20:08.000 --> 01:20:09.000
Yeah. Yeah, yeah, yeah, that's true.

01:20:09.000 --> 01:20:14.000
Um, so… And I think it's fair, if people are like, oh, I, you know, I might…

01:20:14.000 --> 01:20:18.000
use some of these methods for my research project for people to just ask now while we have

01:20:18.000 --> 01:20:27.000
Gabriel here, uh, to… to sort of share his wisdom or opinions about different possible applications.

01:20:27.000 --> 01:20:29.000
Yeah.

01:20:29.000 --> 01:20:34.000
I mean, I just want to mention that my perspective on that is, like,

01:20:34.000 --> 01:20:39.000
this is great, like, if you have this kind of setup, right? My next question would be,

01:20:39.000 --> 01:20:44.000
Well, I found that Stephen John now is influencing John, right?

01:20:44.000 --> 01:20:49.000
Let's look at the internals, right? Let's do, like, our circuit analysis, let's do our…

01:20:49.000 --> 01:20:57.000
You know, um, causal mediation, but now we have an anchor point, you know, we have some behavior that we identified that is

01:20:57.000 --> 01:21:03.000
there, right? And that would have been much more painful to do had we had to do causal mediation on the full

01:21:03.000 --> 01:21:08.000
sequence of documents here, right? So I think this is…

01:21:08.000 --> 01:21:16.000
great to get started on, like, finding interesting phenomena that then you can dig deeper with the mechanistic toolkits, right? So…

01:21:16.000 --> 01:21:17.000
Cool.

01:21:17.000 --> 01:21:22.000
Yeah, so if you intend to use these kind of things, that's probably my suggestion.

01:21:22.000 --> 01:21:25.000
Yeah.

01:21:25.000 --> 01:21:27.000
And…

01:21:27.000 --> 01:21:31.000
Oh, yeah, there were a couple final slides here, um…

01:21:31.000 --> 01:21:34.000
The first was about interactions.

01:21:34.000 --> 01:21:38.000
Um, so here, the idea is, um…

01:21:38.000 --> 01:21:41.000
like, as many of you ask about this,

01:21:41.000 --> 01:21:53.000
Uh, can we model this kind of second-order effects, interactions? And indeed, there are several methods to do that. There is this shapely interaction index,

01:21:53.000 --> 01:22:01.000
Uh, which the idea here is you try pairs of groups of features going… increasing in size with these groups.

01:22:01.000 --> 01:22:04.000
to understand when a group is minimal, to predict some behavior.

01:22:04.000 --> 01:22:08.000
Um, and then you have also gradient-based methods, so this…

01:22:08.000 --> 01:22:16.000
Pessian or integrated Hessian are basically the equivalent of gradients, but you're taking the second-order derivatives, so you're taking, like,

01:22:16.000 --> 01:22:23.000
which other factors are more influential for the gradient of that factor to get to this magnitude, right?

01:22:23.000 --> 01:22:25.000
So, in this case,

01:22:25.000 --> 01:22:32.000
Um, yeah, it's the equivalent for interactions. The only problem with all of these methods is that they're quite expensive because you're

01:22:32.000 --> 01:22:38.000
potentially, you know, estimating all possible interaction with all possible, uh, groups.

01:22:38.000 --> 01:22:43.000
So that could become very expensive, unless you start by some, you know, assumption.

01:22:43.000 --> 01:22:45.000
Of, like, how this group should be formed.

01:22:45.000 --> 01:22:55.000
Um, I think some people were working at the level of syntax, for example, right? If something belongs to the same phrase, then it makes sense that they are kind of part of the same group.

01:22:55.000 --> 01:23:04.000
Um… but yeah, I think that's the only actionable way to study this kind of things.

01:23:04.000 --> 01:23:06.000
Um…

01:23:06.000 --> 01:23:16.000
And the final thing, I really like this question from Jasmine that was, uh, like, why are we even doing attribution? Like, what's the… what's the final…

01:23:16.000 --> 01:23:23.000
goal, right? What's the end game of attribution? So I think it's a good way to kind of close on that.

01:23:23.000 --> 01:23:28.000
Um, and I just wanted to highlight some of the works that use that.

01:23:28.000 --> 01:23:35.000
I think one of the very convincing use cases was from people that do, uh, protein design,

01:23:35.000 --> 01:23:41.000
I thought this was from a presentation that I clear, uh, two years ago.

01:23:41.000 --> 01:23:51.000
I found it very interesting, because there were these people from Genentech that do this kind of lab protein design, kind of, like, lab-in-the-loop protein design.

01:23:51.000 --> 01:24:02.000
And when they were giving their keynote, I was pretty surprised to find out that they were saying, yeah, you know, if we were to identify exactly which amino acids are responsible for, like, a specific

01:24:02.000 --> 01:24:13.000
gene expression, that would be very painful to do, like, by trying out all possible things. So we just do gradient-based attribution on that… on the protein sequence, and we…

01:24:13.000 --> 01:24:16.000
And we use that, right? So I found it quite compelling as an idea.

01:24:16.000 --> 01:24:21.000
Um, so there are some interesting direction here.

01:24:21.000 --> 01:24:28.000
And I feel like now people are starting to use these kind of methods also for other, more actionable purposes.

01:24:28.000 --> 01:24:32.000
So here, there is a recent paper that we're trying to…

01:24:32.000 --> 01:24:37.000
use attribution to lead… to steer generation towards looking at

01:24:37.000 --> 01:24:42.000
specific regions of interest. So, like, if you define a constraint in the prompt,

01:24:42.000 --> 01:24:47.000
You could decide which tokens to pick based on which tokens are relying the most on that

01:24:47.000 --> 01:24:50.000
part of the prompt where you're defining your constraint.

01:24:50.000 --> 01:24:58.000
Um, yeah, it's interesting. I would have some criticism about this, probably, but I think it's potentially promising.

01:24:58.000 --> 01:25:03.000
Um, and finally, one thing that I wanted to highlight specifically, because

01:25:03.000 --> 01:25:10.000
Um, it feels like in the mechanistic community, people think, oh, you know, attribution is a thing for the past.

01:25:10.000 --> 01:25:13.000
And now, you know, we don't do this anymore.

01:25:13.000 --> 01:25:18.000
Uh, we did that on images, you know, in the 2010s.

01:25:18.000 --> 01:25:26.000
But actually, all the new methods that do circuit finding, so I think you have a class, David, in the upcoming weeks about that, right?

01:25:26.000 --> 01:25:35.000
But all these kind of methods that aim to find, you know, how components interact towards a prediction are using, effectively, some form of attribution, right?

01:25:35.000 --> 01:25:41.000
Uh, most of them now are gradient-based, actually, some form of integrated gradients, or…

01:25:41.000 --> 01:25:46.000
Um, so I think these questions are still, you know, um…

01:25:46.000 --> 01:25:52.000
very, very, uh, recent and very important. It's just maybe they kind of became so…

01:25:52.000 --> 01:25:56.000
consolidated that now they moved away from the…

01:25:56.000 --> 01:26:02.000
from the spotlight, kind of, and now they're just the kind of tools that we use without even thinking about it, which is great, I guess.

01:26:02.000 --> 01:26:04.000
Yeah.

01:26:04.000 --> 01:26:06.000
It's still salient.

01:26:06.000 --> 01:26:10.000
Exactly, still important to know what's salient, yeah.

01:26:10.000 --> 01:26:13.000
Yep.

01:26:13.000 --> 01:26:14.000
Great.

01:26:14.000 --> 01:26:16.000
So, yeah, so I think that's it for me.

01:26:16.000 --> 01:26:24.000
Thank you so much for having me, and if you have any questions, I'm here to answer, of course.

01:26:24.000 --> 01:26:28.000
Thanks, Gabrielle. It was really, really, really helpful.

01:26:28.000 --> 01:26:32.000
Great. I'm glad.

01:26:32.000 --> 01:26:33.000
Yay! Everybody liked it.

01:26:33.000 --> 01:26:35.000
So…

01:26:35.000 --> 01:26:40.000
So, yeah, so it's… so it's actually… so one, you know…

01:26:40.000 --> 01:26:46.000
it's… you will sort of keep up with the theme of what we're doing. I, you know, encourage everybody to give a try.

01:26:46.000 --> 01:26:52.000
to, uh, you know, these input attribution methods, there's… there's, uh, as you can see, there's a lot…

01:26:52.000 --> 01:26:55.000
Um, yes, I agree with what…

01:26:55.000 --> 01:27:00.000
Jasmine says, feels particularly broadly useful in situations with high-stakes decisions.

01:27:00.000 --> 01:27:05.000
Um, you know, my intuition is a lot of these interdisciplinary questions that you guys are asking,

01:27:05.000 --> 01:27:09.000
Where you have a lot of organic text, and um…

01:27:09.000 --> 01:27:11.000
And you're asking, how does the model thinking during…

01:27:11.000 --> 01:27:14.000
During processing of complex text.

01:27:14.000 --> 01:27:19.000
Uh, I… I feel like these methods are really well-suited for it.

01:27:19.000 --> 01:27:25.000
Which is why I wanted to make sure… make sure we cover it, um, before spring break. And so, um…

01:27:25.000 --> 01:27:32.000
So yeah, so… so I… I think it'd be great. I'll be interested to see if, uh, if you're able to find anything.

01:27:32.000 --> 01:27:34.000
interesting in your projects.

01:27:34.000 --> 01:27:40.000
Using these methods. Um, and we have… we have Gabriel here this semester, so…

01:27:40.000 --> 01:27:41.000
Right.

01:27:41.000 --> 01:27:49.000
Uh, so, you know, so take advantage of him. Uh, he's directly helping out on one of the teams, but, you know, he's a general resource for the class.

01:27:49.000 --> 01:27:56.000
Yeah. Yeah, I'll also share the links, of course, to this library that we built in the…

01:27:56.000 --> 01:28:02.000
in the Discord channel, so that you can have a look.

01:28:02.000 --> 01:28:05.000
Great. Okay, guys.

01:28:05.000 --> 01:28:07.000
Stay safe in the snow out there.

01:28:07.000 --> 01:28:16.000
And we'll see you… I'm not sure if we'll see you in person on Thursday or not, hopefully the weather will cooperate, and we'll see you in person on Thursday.

01:28:16.000 --> 01:28:23.000
Thank you, bye-bye.