WEBVTT

1
00:00:02.770 --> 00:00:07.580
Sarti, Gabriele: And so, we're lucky to have… Gabriel, talking about…

2
00:00:07.710 --> 00:00:13.329
Sarti, Gabriele: Oh, it's… I didn't actually know what the topic was gonna be, but,

3
00:00:13.620 --> 00:00:16.260
Sarti, Gabriele: But it looks like you're going to be…

4
00:00:16.370 --> 00:00:26.499
Sarti, Gabriele: doing LLM agent, discussions today, so… Yeah, a lot of different things, actually, so… Oh, okay, that's, that's great. So, Gabriel's, you know, he, he recently finished his

5
00:00:26.660 --> 00:00:30.360
Sarti, Gabriele: PhD, where he focused on

6
00:00:30.470 --> 00:00:43.210
Sarti, Gabriele: interpretability, and he's had a lot of focus, not just on, interpretability, Methods… But also, interpretability…

7
00:00:43.560 --> 00:00:52.150
Sarti, Gabriele: tools, And the interpretability community at large, has been… has been, you know, have been things that,

8
00:00:52.330 --> 00:00:57.600
Sarti, Gabriele: have been part of his practice, not just the academic output itself, and so we're really lucky to have Gabriel

9
00:00:57.870 --> 00:01:01.240
Sarti, Gabriele: working with us on Endiph, Abbas.

10
00:01:01.430 --> 00:01:04.920
Sarti, Gabriele: I'll build the, interpretability community infrastructure.

11
00:01:04.959 --> 00:01:06.099
Sarti, Gabriele: This way.

12
00:01:06.120 --> 00:01:30.689
Sarti, Gabriele: And so, so, okay, so welcome, Gabriel. Great, thanks. Thanks, David. Thanks, everyone, for having me. Yeah, so the presentation for today, it's actually a bit of, a lot of things packed into a single presentation, but I want to just give you a quick overview first. Like, as David said, I recently graduated from the PhD. I was in the Netherlands working on this consortium that focused on explainability for

13
00:01:30.690 --> 00:01:33.630
Sarti, Gabriele: Text, speech, and actually even music models.

14
00:01:33.630 --> 00:01:55.880
Sarti, Gabriele: And my thesis was focused on interpretability for machine translation, and as you see from the title, there's this actionable keyword there. So my core focus wasn't really to try to, I don't know, develop new theoretical understanding of models, but rather, how can we bring the advances that we currently have in the field of interpretability

15
00:01:56.220 --> 00:02:18.669
Sarti, Gabriele: to users of machine translation models in a way that, like, they can use this information to make more informed choices. So a lot of my work was, for example, working with human post editors that could access uncertain information from model internals and decide whether it makes sense to correct, for example, some translation.

16
00:02:18.920 --> 00:02:33.340
Sarti, Gabriele: So… and right now, I'm here in David's lab, and I'm working with the NDIF project, and the idea here, my personal focus, is more related to building tools and interface to support

17
00:02:33.340 --> 00:02:42.369
Sarti, Gabriele: workflows towards frontier LLMs, so kind of, like, trying to scale a bit interpretability, which is also a bit the focus that I try to give to this presentation.

18
00:02:42.810 --> 00:02:45.290
Sarti, Gabriele: So…

19
00:02:45.460 --> 00:03:09.019
Sarti, Gabriele: I would want to start with some conceptual framing here. So I think everyone here knows about Evolves, LLM Evolves, and LLM interpretability, right? So I would argue that these are kind of, like, two different words that don't talk much these days, right? And it's kind of a pity, because both are concerned with model understanding, right? So on the one hand, you have evaluations.

20
00:03:09.040 --> 00:03:27.629
Sarti, Gabriele: that are mostly behavioral, right? You're running your models, you're getting black box input-output behavior, and this allows you to be very… to scale very well on a lot of data, so you can test over plenty of data sets, but it's quite shallow, right? Because the moment that you flag the behavior,

21
00:03:27.910 --> 00:03:42.239
Sarti, Gabriele: you don't quite understand what is motivating this behavior, right? Is it that the model is misbehaving on this specific example, or on a category of example? Which components are responsible for that, right? You cannot know just black box.

22
00:03:42.810 --> 00:03:52.199
Sarti, Gabriele: On the other hand, interpretability is great, because you can really probe deeper, right, into the model, understand which components are responsible for different stuff.

23
00:03:52.200 --> 00:04:12.420
Sarti, Gabriele: But the problem is that, as many of you know, like, if you ever try to run a patching experiment, good luck doing this on, like, the setup that black box people run their evaluations on, right? So, in general, I feel like this is an interesting word, because evaluation definitely can motivate interpretability analysis, right? Like.

24
00:04:12.520 --> 00:04:18.359
Sarti, Gabriele: You can pinpoint patterns here, and this can motivate where to focus for interpretability.

25
00:04:18.820 --> 00:04:25.590
Sarti, Gabriele: And in the other… on the other hand, you can use interpretability findings to justify why

26
00:04:25.630 --> 00:04:40.540
Sarti, Gabriele: some evaluation results were, like, the ones observed, right? So I think the main problem here is that there is a tooling gap between these two words, right? There is, like, this issue of, like, once that I have my evals.

27
00:04:40.540 --> 00:04:50.099
Sarti, Gabriele: how do I move to the interpretability side? So I think that's the core focus that I want to try to address. And that connects a bit with my previous research, right?

28
00:04:50.340 --> 00:05:08.740
Sarti, Gabriele: So just to give you a quick overview of today's presentation, the first part of the presentation is about scaling context understanding. So, a lot of my previous work was about attribution, and a lot of my concern was actually about how do we make attribution more scalable for language model settings, right?

29
00:05:08.740 --> 00:05:21.009
Sarti, Gabriele: So for this kind of autoregressive prediction settings. So in this context, I had, this PECore framework, plausibility Evaluation for Context Reliance,

30
00:05:21.180 --> 00:05:28.270
Sarti, Gabriele: And its extension applied to retrieval augmented generation for citations. So the core here was

31
00:05:28.410 --> 00:05:36.110
Sarti, Gabriele: Faithful citations that could outperform self-citation from the model just by looking at its internals, how it uses information.

32
00:05:38.200 --> 00:05:55.400
Sarti, Gabriele: Then the second part is indeed about agents, and it's about how can we study agents, not only from a behavioral perspective, but from a representational perspective. And here, I will discuss briefly our recent work about,

33
00:05:55.540 --> 00:06:07.170
Sarti, Gabriele: about studying goal-directedness in a toy setup for LLM agents. So this is a bit rough. There's still rough edges here, so your feedback will be super useful.

34
00:06:07.290 --> 00:06:10.730
Sarti, Gabriele: And the very final slides are actually

35
00:06:11.420 --> 00:06:28.980
Sarti, Gabriele: This is a bit annoying, but, about bridging the tooling gap, and it's about what we are thinking about here in the NDIF context, so trying to… how to… how do we scale the NDIF ecosystem to kind of bridge this tooling gap that I discussed earlier?

36
00:06:29.750 --> 00:06:38.310
Sarti, Gabriele: Any question on that? Any part that you find more interesting? So that, just to understand how to allocate time, maybe we're gonna be out of time.

37
00:06:39.680 --> 00:06:45.869
Sarti, Gabriele: Who's interested in number one, and who's interested in number two, and who's interested in number three? Number one, people? Number one?

38
00:06:46.600 --> 00:06:57.000
Sarti, Gabriele: Number two? Killing attribution. Number two? Alright, so we're gonna talk about agents. All right. Number three is very short, so… Okay, I'll be quicker on the first part, then.

39
00:06:57.150 --> 00:06:58.080
Sarti, Gabriele: So…

40
00:06:58.120 --> 00:07:18.720
Sarti, Gabriele: just to give you a very brief overview of gradient-based attribution, I think a lot of you already saw this slide. So you start with some inputs, this is passed through the language model, you get a distribution over probabilities, and then to do gradient-based attribution, what you normally do is you select a target, in this case, for example, the probability of the output.

41
00:07:18.820 --> 00:07:22.180
Sarti, Gabriele: And you propagate with respect to this target to get

42
00:07:22.190 --> 00:07:39.720
Sarti, Gabriele: gradient vectors, right? So these have the same shape as the original embeddings, and one way to get a score per token is to simply aggregate these at the token level. So, for example, the L2 norm of the gradient vector becomes a single score that tells you how important this token

43
00:07:39.950 --> 00:07:57.840
Sarti, Gabriele: was with respect to the prediction, right? So this was the famous saliency map that a lot of people in interpretability are now saying that it's a dead direction, right? It doesn't exist anymore, and since the rise of mechanistic interpretability, saliency maps are out of fashion, right?

44
00:07:57.840 --> 00:08:03.519
Sarti, Gabriele: And yet, that's what everyone heard about in industry. Exactly. And all they do.

45
00:08:03.520 --> 00:08:18.890
Sarti, Gabriele: Well, apart from that, I would argue, even in Macinturp, we still use saliency a lot, right? All the circuit finding work is using some notion of saliency to detect edges, right? Like, edge attribution patching is saliency, effectively.

46
00:08:19.100 --> 00:08:22.529
Sarti, Gabriele: To find CLT circuits, you still use Salette.

47
00:08:22.670 --> 00:08:23.710
Sarti, Gabriele: So…

48
00:08:24.580 --> 00:08:35.230
Sarti, Gabriele: So, back when I was starting my PhD in the pre-ChatGPT era, we were… I decided to start building some tools for doing

49
00:08:35.230 --> 00:08:51.009
Sarti, Gabriele: easy attribution on language models. So this in-seq tool was simply a wrapper of fogging phase that would allow you to do very easily attribution on generation, right? So, the problem that I stumbled on very quickly is that

50
00:08:51.010 --> 00:09:08.220
Sarti, Gabriele: if you start working on realistic settings, like reasoning models or, you know, retrieval augmented generation, you get this kind of monstrosities here, right? So what you see here is a saliency heatmap of generated token by input tokens.

51
00:09:08.220 --> 00:09:24.760
Sarti, Gabriele: that you can see here there is a diagonal, meaning there's new tokens that get added to the generation, right? So it's autoregressive. And then every one of these little scores is basically the saliency of an input token towards the prediction of an output token, right?

52
00:09:25.340 --> 00:09:41.350
Sarti, Gabriele: So the problem is that if you want to do this kind of thing, you have to do attribution at every step, right? You have to do… for the next token, I attribute all previous tokens, right? So that I get these kind of pairs. But then, you can imagine that this is one backward pass per step, right?

53
00:09:41.910 --> 00:09:43.290
Sarti, Gabriele: At the token level.

54
00:09:43.710 --> 00:09:51.310
Sarti, Gabriele: So, super expensive, right? So this scales very poorly with the kind of, with the kind of setup that we have.

55
00:09:53.110 --> 00:10:02.429
Sarti, Gabriele: So, what do we actually care about, right? I think it's interesting, this image, because it shows you that there is some information here, right? It's not a very uniform map.

56
00:10:02.590 --> 00:10:12.169
Sarti, Gabriele: But how do we know that we care about these things, right, in the first place? Can we just scope our analysis to these things to make it more efficient and scalable, right?

57
00:10:13.020 --> 00:10:14.300
Sarti, Gabriele: So…

58
00:10:15.330 --> 00:10:28.030
Sarti, Gabriele: now I want to give you an example of how we do the evaluation in the context of attribution. So, you might have seen some datasets in this context that have this kind of

59
00:10:28.050 --> 00:10:44.729
Sarti, Gabriele: assumptions about some hypotheses, right? So, if you're interested about gender bias, you might have this kind of examples, where you have a text that is, after she finished working, the pretty manager went home because she was tired, right? Then you might have some human labels here.

60
00:10:44.760 --> 00:10:46.709
Sarti, Gabriele: That's say, well.

61
00:10:46.960 --> 00:10:58.699
Sarti, Gabriele: this token here, the token of interest, was predicted because of the presence of the same pronoun before, right? So, she motivates she. This is human judgment, right?

62
00:10:59.250 --> 00:11:15.960
Sarti, Gabriele: And then you could… you could use this dataset to test a language model with attribution, and see which tokens are salient, right? So maybe the language model will say, actually, it's the pretty token in this… in this example that motivates the prediction of she, right?

63
00:11:16.210 --> 00:11:32.140
Sarti, Gabriele: So I think this is a cool example, because it also shows this pattern of right for the wrong reasons, right? So the model would predict Xi correctly, and you would never know that it's doing it for the wrong reason unless you were to look at the token that it's using, right?

64
00:11:32.970 --> 00:11:39.330
Sarti, Gabriele: So, again, this is an example of how we do this plausibility evaluation with human labels.

65
00:11:39.780 --> 00:11:53.589
Sarti, Gabriele: But, the concern here is that this is hypothesis-based, right? So I had to build a dataset of examples like this one, where I have this token labeled as my target that I want to explain, right?

66
00:11:54.230 --> 00:12:08.520
Sarti, Gabriele: useful to narrow down specific phenomena, but how do I scale this in a data-driven way, right? So the problem is that you cannot find these plausible or implausible relations in the wild just running your model, right?

67
00:12:08.900 --> 00:12:26.900
Sarti, Gabriele: So, this was our initial motivation to try to scale this in this scenario with this framework that we call plausibility Evaluation of Context Reliance. Now, I will give you a brief example of how it works. So, let's say that we have text like this.

68
00:12:26.980 --> 00:12:34.419
Sarti, Gabriele: So, I moved the table, it was very heavy. And we say that this first example is the context, right? I moved the table.

69
00:12:35.530 --> 00:12:43.589
Sarti, Gabriele: So, If I were to translate this to a language that has gender, for example, French or Italian.

70
00:12:43.940 --> 00:12:50.160
Sarti, Gabriele: this pronoun is ambiguous, right? The model wouldn't know how to translate it unless it had access to the source.

71
00:12:51.180 --> 00:12:52.260
Sarti, Gabriele: So…

72
00:12:52.550 --> 00:12:58.850
Sarti, Gabriele: In this case, we would have a translation like this. In French, j'ai de place la table, elletre lurge. Okay.

73
00:12:59.190 --> 00:13:02.339
Sarti, Gabriele: So, how do we know

74
00:13:02.550 --> 00:13:12.330
Sarti, Gabriele: what is context-dependent here, right? So, one way, with our method, and then I will show a bit better how it works in practice, is that

75
00:13:12.500 --> 00:13:31.280
Sarti, Gabriele: what we care for is detecting which pro- which tokens in the target generation are context-sensitive, right? One way to do this is to measure, for example, the gap in probability, or the KL divergence between the full distribution between the version with context and without context, right?

76
00:13:31.670 --> 00:13:50.020
Sarti, Gabriele: So this tells you this token is context-sensitive. Then from there, we want to do attribution back to contextual queues, right? So then, the cool part is that this retrieves these kind of pairs that are influential-influenced, right? Context and target.

77
00:13:50.640 --> 00:13:56.499
Sarti, Gabriele: So… Now let's see an example in factual QA for that.

78
00:13:56.750 --> 00:14:06.510
Sarti, Gabriele: So, let's see, we have this query that is, how old is the Panaku Kirk in Groningen? So, this place doesn't exist, it means Pancake Church, okay? It's a fictional place.

79
00:14:06.770 --> 00:14:20.130
Sarti, Gabriele: So if you prompt Zephyr7B with that, the model would say some answer, and then would say, actually, you're asking me a wrong question. This question is, like, doesn't make sense, right? This place doesn't exist.

80
00:14:20.430 --> 00:14:23.340
Sarti, Gabriele: So…

81
00:14:23.340 --> 00:14:45.269
Sarti, Gabriele: What if we are to add this context? So here, I'm just adding a small system prompt, in which I ask the model, provide me with a concise answer containing only a few words, and then I also give it a bit of a fictional context that I made up, right? In the heart of Groningen, nestled between Queen Cobble Street, etc.

82
00:14:45.340 --> 00:14:52.009
Sarti, Gabriele: So now, the response of the model is gonna match what it sees in the context, right? As expected.

83
00:14:52.760 --> 00:14:53.920
Sarti, Gabriele: So…

84
00:14:54.040 --> 00:15:06.180
Sarti, Gabriele: that the church is 143 years old, built in 1877. So what Pecore… so if we apply Pecore to this example, what we get is that 143 is context-sensitive.

85
00:15:06.310 --> 00:15:16.690
Sarti, Gabriele: But not all the rest, note. Which makes sense, right? So, the question is already enough to lead to an answer that is in this form, up to the date, right?

86
00:15:17.960 --> 00:15:19.529
Sarti, Gabriele: But then the date?

87
00:15:20.120 --> 00:15:38.519
Sarti, Gabriele: is dependent both on the original date, written in the context, but also on the fact that we mentioned few words. So I think this is a cool example of, like, how this can lead to data-driven discovery of dependencies, right? Because you could have imagined this, right? You could have guessed that there is a link between the dates.

88
00:15:38.830 --> 00:15:44.350
Sarti, Gabriele: But I would have not guessed that this would be relied to a few words here, right?

89
00:15:44.650 --> 00:15:59.130
Sarti, Gabriele: So the fact that the model is looking a few words there, it means that probably without the context, it could have come up with something like, he's an historical building, blah blah blah, right? It could have gone on. But actually now, it's just giving the answer, right?

90
00:15:59.280 --> 00:16:10.670
Sarti, Gabriele: And another interesting thing is, then, also 1877 in the answer is context-sensitive, mapping again to 1877 in the context, right? So this might look

91
00:16:10.670 --> 00:16:21.840
Sarti, Gabriele: quite, natural, right? It's just the token looking at the same token in the context. But actually, there's an interesting assumption here, which is this token is not

92
00:16:21.840 --> 00:16:35.440
Sarti, Gabriele: is not being computed by looking at this date and working backwards, but it's just copied by the context, right? So it might be that these two things are mismatching, just because the model is referring to the context twice, right?

93
00:16:35.480 --> 00:16:36.430
Sarti, Gabriele: So…

94
00:16:36.650 --> 00:16:48.609
Sarti, Gabriele: Of course, this is just attributional, right? So there is no causality here, but it's great to come up with hypotheses for causality testing, right? You can now test whether patching, yeah.

95
00:16:48.940 --> 00:16:50.780
Sarti, Gabriele: Yeah, true question. Yep.

96
00:16:51.020 --> 00:16:55.470
Sarti, Gabriele: Did you try this method on… on… sort of more at USA.

97
00:16:56.940 --> 00:17:02.189
Sarti, Gabriele: situation where you have more idea of the glasses, like, if you have, like, a hidden magnets.

98
00:17:02.300 --> 00:17:06.430
Sarti, Gabriele: Also, there's food, and there is some work that will be done.

99
00:17:06.700 --> 00:17:13.609
Sarti, Gabriele: trying to elicit, like, suppressed knowledge in Chinese-only models. That was a strange out of them.

100
00:17:14.359 --> 00:17:17.969
Sarti, Gabriele: So you think it's, like, opportunities for most kinds of scenarios?

101
00:17:18.180 --> 00:17:30.250
Sarti, Gabriele: Yeah, I think so. I think there is. We haven't tried. The only thing that we tried was in a case, for the… for the example that I'm showing just here about retrieval augmented generation citations.

102
00:17:30.250 --> 00:17:53.799
Sarti, Gabriele: So normal methods that create these kind of citations use superficial similarity, right? Like, I take the similarity of the paragraph with the similarity of the answer, and… and if it's high, I just say, this is a citation, right? But if your model has a backdoor, and maybe it's just mentioning a word all the time if there is a trigger, right? Similarity would never pick this out, right?

103
00:17:53.800 --> 00:17:55.160
Sarti, Gabriele: But attribution would.

104
00:17:55.250 --> 00:18:03.209
Sarti, Gabriele: So I think that's a good motivation, right? You can actually get a more faithful sense of what matters towards the generation.

105
00:18:03.510 --> 00:18:21.480
Sarti, Gabriele: So, to give you a more better intuition, this was a follow-up in which we applied the same framework in the retrieval augmented generation setting. So, the setup is the same. We have a version without context, a version with context, and the context here is three documents that are retrieved dynamically at inference time.

106
00:18:21.480 --> 00:18:25.150
Sarti, Gabriele: And are used to let the model come up with a better answer, right?

107
00:18:25.560 --> 00:18:30.110
Sarti, Gabriele: So the way that, PCORI works in this case is

108
00:18:30.170 --> 00:18:49.759
Sarti, Gabriele: First, we find the influence generated token, like we saw before, and this is done by looking at the distribution for every token that is generated with the context and without the context, and see in which position this gap is larger, right? So then, in this case, for example, we would flag smaller as

109
00:18:49.780 --> 00:19:05.420
Sarti, Gabriele: context sensitive, right? Once that we have that, we can do… we can do attribution to trace it back to influence token… to influential token, sorry. So, I think the interesting part about this second side that I didn't show.

110
00:19:05.440 --> 00:19:14.520
Sarti, Gabriele: Is that the way that we do this is actually by using the gradient of the difference in probabilities between the two options.

111
00:19:14.720 --> 00:19:21.199
Sarti, Gabriele: So, like, the cool part of this is that, let's say that smaller was predicted with context, right?

112
00:19:21.310 --> 00:19:27.109
Sarti, Gabriele: But PC would be the top prediction without context, right?

113
00:19:27.140 --> 00:19:42.349
Sarti, Gabriele: Basically, we are mining a pair of, like, valid answer with without context, which we can use for doing this contrastive attribution there, with gradients, that is not just measuring what is important towards the smaller

114
00:19:42.350 --> 00:19:53.340
Sarti, Gabriele: But what is important to shift the distribution towards smaller from PC, which is the original choice, right? So this makes it a bit more robust in its application, right?

115
00:19:54.510 --> 00:19:56.120
Sarti, Gabriele: I don't know if it's clear, yeah?

116
00:19:59.020 --> 00:20:01.250
Sarti, Gabriele: For the second one. Yep. Is this…

117
00:20:01.640 --> 00:20:04.539
Sarti, Gabriele: It's just, like, a pipeline, like, first you…

118
00:20:04.630 --> 00:20:18.830
Sarti, Gabriele: find the documents that are important, and then from there, you use gradient individual one, or do you just do, like, gradient attribution over all the documents? So the cool part is that once that you have all these tokens that are context-sensitive, one by one, you can just

119
00:20:18.900 --> 00:20:28.629
Sarti, Gabriele: do one backward, get all the saliency over all the tokens of all the documents, right? So it's… it's quite efficient, actually, because this number of tokens is usually smaller, right?

120
00:20:28.870 --> 00:20:35.110
Sarti, Gabriele: Yeah, that's… It is with all documents in context? Yeah. That's right.

121
00:20:35.320 --> 00:20:52.940
Sarti, Gabriele: So this… maybe there were some recent work that we're trying to approximate… to do some similar stuff with ablations, right? There was some work from MIT, I think, where people were trying to do this kind of LIME estimator for importance of different sentences. Yeah, Koyanna?

122
00:20:58.440 --> 00:21:05.340
Sarti, Gabriele: So, we just propagate the gradient from the prediction to the input embeddings. So it's all the way, kind of, yeah.

123
00:21:06.500 --> 00:21:15.909
Sarti, Gabriele: Yep. This is precluding the last part of your talk, I think, but if you have a very long context, like you often have in a rag scenario.

124
00:21:16.270 --> 00:21:21.699
Sarti, Gabriele: just doing backlog can be very hard. That's right.

125
00:21:21.700 --> 00:21:46.609
Sarti, Gabriele: Yeah, so this method is quite brittle in terms of which attribution method you're using, actually, right? So, the better… like, the cool part is that it's flexible as a framework, in the sense that it's plug-and-play with whatever attribution you want to have, right? So, in the original study, we were using rogue gradients, and that was pretty bad, actually, in the sense that it's… it's a very toy scenario that works in small context.

126
00:21:47.170 --> 00:21:55.120
Sarti, Gabriele: But the moment that the context becomes large, then it becomes very diffused, right? And you lose a lot of information, right?

127
00:21:55.130 --> 00:22:11.690
Sarti, Gabriele: So there's new methods, gradient-based still, that would support this kind of pipeline. For example, this Attention LRP, or, like, GIM that we're using for a later slide, that are much better at kind of, like, narrowing down which tokens are important in a more robust way, yeah.

128
00:22:11.740 --> 00:22:12.410
Sarti, Gabriele: Yeah.

129
00:22:12.580 --> 00:22:25.939
Sarti, Gabriele: So that's talking about how good they are. I'm just saying that even efficient… efficiently calculating gradients for a very long context… Right. …is very hard, like… Yeah. We ran into this problem in a project with

130
00:22:26.200 --> 00:22:30.349
Sarti, Gabriele: Pfizer, they have a system, a RAGQA system on

131
00:22:31.140 --> 00:22:36.140
Sarti, Gabriele: like, thousands of documents or something. So, just doing backlog…

132
00:22:36.850 --> 00:22:39.820
Sarti, Gabriele: We couldn't… so we couldn't do that on our machine.

133
00:22:40.370 --> 00:22:56.679
Sarti, Gabriele: Yeah, so one solution for that could be to try this kind of forward-only attributions, like this kind of information flow routes, but they are definitely not output-based, right? They just rely on local similarities, so it's not very precise, but

134
00:22:57.550 --> 00:22:58.220
Sarti, Gabriele: Yeah.

135
00:22:58.670 --> 00:22:59.480
Sarti, Gabriele: Nice.

136
00:23:00.240 --> 00:23:22.320
Sarti, Gabriele: And I just wanted to give you a very quick overview of what it looks like. So, in this plot, you have the five documents that are retrieved and appended to the context, and every dot in this plot is one token within the document, right? And the scores here are the attribution scores that you get by applying the second part of the graph that we saw before, right?

137
00:23:22.320 --> 00:23:39.709
Sarti, Gabriele: So, in the study, we were trying two approaches. We were trying to set some thresholds based on, like, arbitrary thresholds like this, like the top 5%, or we were trying to do some calibration based on a set of gold examples with human annotations, right?

138
00:23:39.710 --> 00:23:44.529
Sarti, Gabriele: And what we were showing is that it seems like this approach,

139
00:23:45.290 --> 00:23:56.209
Sarti, Gabriele: Well, we found contrasting a bit results on different models. It seemed to be working quite well for Zephyr, but not much for Llama in terms of recall, specifically.

140
00:23:57.840 --> 00:24:04.120
Sarti, Gabriele: no, sorry, in terms of precision for Llama. But in general, the cool part is that you can use

141
00:24:04.120 --> 00:24:23.609
Sarti, Gabriele: the same model… so the model that you're using for generation, you can just do the attribution back, right? So it's… it's not out of distribution, you don't need an external validator. And the calibration, we found that it… it did improve, performance here. What is your ground truth for that experiment? So when you do a precision recall measure?

142
00:24:23.690 --> 00:24:48.480
Sarti, Gabriele: So this was manually annotated, human… so it's plausibility, again, like, it's human labels of, like, which context contain which information, right? So that's also a potential gap, right? Here, we're trying to get a model perspective of what matters, right? So evaluating it against the human perspective, it's a bit of a… Assuming the model does it the way the human would do it. Exactly, yeah.

143
00:24:49.000 --> 00:24:52.949
Sarti, Gabriele: Yeah. Can you go back on this slide? Yep.

144
00:24:54.180 --> 00:25:02.599
Sarti, Gabriele: How do you do the finding the influence generation tokens? Is it… you've generated an entire answer without context? With context.

145
00:25:03.040 --> 00:25:10.449
Sarti, Gabriele: With and without? Without, we don't really need. So the ones that you have the one with context, you want to look for every token.

146
00:25:10.460 --> 00:25:29.920
Sarti, Gabriele: which ones have the highest KL divergence, basically. Between? Between the width and the without. So the without setting is you take the width answer, you force it, basically. You force it there, and you just measure the probabilities for every token. Yeah, without context. Without context. Exactly.

147
00:25:30.020 --> 00:25:33.049
Sarti, Gabriele: So, like, by then, you already have generated them.

148
00:25:33.900 --> 00:25:36.249
Sarti, Gabriele: Let's say, 50 tokens with the context.

149
00:25:36.380 --> 00:25:55.779
Sarti, Gabriele: Don't you think that would sort of influence? For sure, for sure. So the idea here is really to try to trace how the context motivates the answer. Then I agree that there's going to be starting to have dependency within the answer, right? So our goal was really to pinpoint the, like.

150
00:25:55.880 --> 00:26:02.679
Sarti, Gabriele: which kind of information is flowing from context to answer. The moment that it's internal to the answer, we weren't

151
00:26:03.110 --> 00:26:17.449
Sarti, Gabriele: looking at that, right? But in principle, nothing would prevent you from adding the prefix of the answer to the context here, like, autoregressively, right? And keeping it in the loop, kind of. So you could… you could do that, yeah.

152
00:26:19.250 --> 00:26:20.030
Sarti, Gabriele: Yep.

153
00:26:27.320 --> 00:26:39.779
Sarti, Gabriele: I think in general, Zephyr here was a bit better as a model compared to Lama 2, so it could be a factor here. You can see the scores are a bit higher overall.

154
00:26:42.500 --> 00:26:48.540
Sarti, Gabriele: Performance, like, not in terms of attribution, but it was reflected somehow in the attribution sites.

155
00:26:48.660 --> 00:26:50.760
Sarti, Gabriele: amanda.

156
00:26:53.270 --> 00:26:59.809
Sarti, Gabriele: Yeah, so Zephyr is a mistral. It could have been, but we didn't look into that very much, yeah.

157
00:26:59.930 --> 00:27:04.229
Sarti, Gabriele: Yep. Alitatively, did you see anything interesting specifically?

158
00:27:04.370 --> 00:27:06.450
Sarti, Gabriele: And differences between the two models.

159
00:27:07.000 --> 00:27:23.079
Sarti, Gabriele: I have to say, it wasn't super different. We had some examples in the appendix of the paper that were quite interesting, that it seemed like they behaved a bit differently, but it wasn't super, super remarkable, I have to say, yeah.

160
00:27:23.880 --> 00:27:31.079
Sarti, Gabriele: Yeah? Just a quick clarification. So, in terms of the unit for the position, are you called the unit documents or tokens? Tokens.

161
00:27:31.510 --> 00:27:34.240
Sarti, Gabriele: Yep. Yeah, yeah.

162
00:27:36.100 --> 00:27:38.590
Sarti, Gabriele: Alright.

163
00:27:38.670 --> 00:27:51.100
Sarti, Gabriele: And I just wanted to… to close this part. We have the API within NSYC to do exactly this. So, this example, it's… it's quite interesting, I think. So, if you have

164
00:27:51.100 --> 00:28:00.539
Sarti, Gabriele: Here I was prompting the model to asking it, when was the most successful player in NBA history born? This is a pretty small model, it's a small LM2, like.

165
00:28:00.540 --> 00:28:12.109
Sarti, Gabriele: 200 million parameters, so super small. So the model is giving something nonsensical here, which makes sense. But then it was saying the most successful player in NBA history is Steven John.

166
00:28:12.240 --> 00:28:22.450
Sarti, Gabriele: something, and John was found to be contact sensitive, right? So then, if you open the little toggle here, what you see is,

167
00:28:22.520 --> 00:28:32.660
Sarti, Gabriele: the retrieved documents with RUG, right? And then you see the scores for all the tokens within the retrieved documents. So you can see that for John.

168
00:28:32.660 --> 00:28:44.149
Sarti, Gabriele: the model is mostly looking at Steve and John here, right? In the third document, which makes sense, so it's kind of like, it's copying from the third document to mention that, right?

169
00:28:44.180 --> 00:28:47.189
Sarti, Gabriele: But I thought it was interesting, because you can also see

170
00:28:47.570 --> 00:28:52.570
Sarti, Gabriele: The contrastive alternative that the model would have produced without context, right?

171
00:28:52.850 --> 00:29:04.100
Sarti, Gabriele: And if you know basketball a little bit, Stephen Curry could be a good choice here for saying the most successful player, right? So the model kind of knew the answer before

172
00:29:04.300 --> 00:29:09.200
Sarti, Gabriele: Getting the context, but then it was biased by the context to give a wrong answer.

173
00:29:10.220 --> 00:29:11.090
Sarti, Gabriele: So on.

174
00:29:11.440 --> 00:29:14.300
Sarti, Gabriele: Yeah, I thought it was an interesting example.

175
00:29:14.860 --> 00:29:30.170
Sarti, Gabriele: So, as a bridge between the first and the second part, there is some work that we're doing about using attribution to compress COT reasoning, with some collaborators from Italy. And so the idea here

176
00:29:30.170 --> 00:29:52.950
Sarti, Gabriele: is that we have our original COT trace in a reasoning model, so now we're trying mostly GPTOS 20B, and you have this very performant, gradient-based attribution methods that can identify, given the answer that the model produces, which are the tokens that motivate this answer the most within the reasoning chain, right?

177
00:29:53.260 --> 00:30:10.469
Sarti, Gabriele: So, what you can do is that you can score this token and then use some sort of thresholding, again, to select only the ones that are the most salient towards the answer. So, our idea here was, cool, this lets you do this post-doc, but what if we train a probe?

178
00:30:10.730 --> 00:30:28.020
Sarti, Gabriele: to predict these tokens, given an unseen cut trace, right? So, given an unseen trace, what we want is the probe will pick out the tokens that are found to be salient by our attribution method, right? But live, so not post-doc, just…

179
00:30:28.020 --> 00:30:35.770
Sarti, Gabriele: given a token, we… we decide whether it's… it's salient towards the final prediction or not, right? And…

180
00:30:36.530 --> 00:30:54.049
Sarti, Gabriele: So, one initial experiment with patching that we had here was, given the full COT trace, we select the relevant token with attribution, and we just take the token identities here, so the actual token, and we plug it into our new

181
00:30:54.470 --> 00:31:06.930
Sarti, Gabriele: COT, like, reasoning block that only has these tokens, right? So this is not gonna make any sense, because it's just random tokens, right, that were picked out. But then, what we also do.

182
00:31:06.930 --> 00:31:18.339
Sarti, Gabriele: is that we patch the states corresponding to these tokens into these positions, right? So these are not just the tokens, but are actually the patched activations across all layers.

183
00:31:20.570 --> 00:31:25.950
Sarti, Gabriele: And what we were seeing here is that in most cases, the model can get to the final response.

184
00:31:26.100 --> 00:31:33.780
Sarti, Gabriele: Which, I don't know about you, but I personally found kind of surprising, but I'm curious to hear your opinion, yeah?

185
00:31:35.820 --> 00:31:36.859
Sarti, Gabriele: The answer is.

186
00:31:37.320 --> 00:31:38.560
Sarti, Gabriele: That's right.

187
00:31:38.760 --> 00:31:39.470
Sarti, Gabriele: Bye.

188
00:31:43.030 --> 00:32:06.710
Sarti, Gabriele: That's right, and that's why I say offline is biased, actually, in the title. So, the problem there is that you could have this kind of shortcut that we saw in many examples, that one of the salient tokens is the answer in the reasoning chain. And it's kind of hard to control for that, because we tried stopping right before that, so trying to ablate everything from the first occurrence, but then it would be something like.

189
00:32:06.710 --> 00:32:09.949
Sarti, Gabriele: So, 23 plus 57 is…

190
00:32:10.220 --> 00:32:17.050
Sarti, Gabriele: You know, so you're selecting that instead, and it's still trivial to get to the answer, right? Without doing the full reasoning.

191
00:32:17.490 --> 00:32:30.600
Sarti, Gabriele: So then, the live thing that we're trying to do, and that seems to be working well, is to use this train probe on the COT to select, as a decoding strategy for sampling.

192
00:32:30.600 --> 00:32:37.559
Sarti, Gabriele: So, at generation time, you can consider a prefix of the COT, then from this prefix.

193
00:32:37.650 --> 00:32:51.370
Sarti, Gabriele: You can sample, for example, 4 options, and you can apply the probe to the tokens of these four options to detect where there is a higher density of tokens that are found salient towards the final answer, right?

194
00:32:51.460 --> 00:33:00.549
Sarti, Gabriele: And then, this guides your selection of the next sequence, right? So, I would pick the first one as the continuation, keep going, and

195
00:33:00.650 --> 00:33:19.290
Sarti, Gabriele: what we're seeing so far, we still have to kind of, like, finish all our evaluations for that, but it seems like we can, reach, basically, a very high, like, agreement with, let's say, the full COT with much shorter sequences, or something alongside these lines, let's say.

196
00:33:19.690 --> 00:33:32.450
Sarti, Gabriele: So this is a bit of work in progress, but I wanted to show you. Any question on that? Yeah? Yes, I'm curious, but I had a thought, but, like, how would you avoid it starting to just repeat tokens over?

197
00:33:32.610 --> 00:33:33.389
Sarti, Gabriele: Yeah, it's fair.

198
00:33:34.080 --> 00:33:44.060
Sarti, Gabriele: What do you mean? To repeat tokens? Like, say, imagining this in, like, a… I think…

199
00:33:45.310 --> 00:33:47.620
Sarti, Gabriele: It's an art problem, and…

200
00:33:47.840 --> 00:33:58.499
Sarti, Gabriele: you know, if it says, like, 33 plus 44, then it would be salient token changes. Yeah. Theory, you could just say 33 plus 44 repeatedly.

201
00:33:58.930 --> 00:34:01.309
Sarti, Gabriele: That's a way to, like, show up more.

202
00:34:02.130 --> 00:34:08.219
Sarti, Gabriele: Yeah, but then we still measure the final capacity of outputting the answer, right?

203
00:34:09.330 --> 00:34:16.470
Sarti, Gabriele: So, what do you mean is that it could just keep going, and we would never finish the reasoning, or something like that?

204
00:34:17.580 --> 00:34:32.999
Sarti, Gabriele: No, it makes sense. Yeah, I mean, definitely there should be some stopping condition, right? Otherwise, it could be, like, that you enter this kind of loop of just outputting only this kind of final tokens that are always salient, right? So, we're still thinking about that, yeah.

205
00:34:33.650 --> 00:34:38.419
Sarti, Gabriele: So… The fact that, like, Mina can train robes.

206
00:34:38.710 --> 00:34:41.580
Sarti, Gabriele: To me, it seems to imply that, like, the model is…

207
00:34:42.260 --> 00:34:50.400
Sarti, Gabriele: like, ahead of time, like, kind of writing in, like, a late scratch pad, like, oh, I know I'll need this later, I'll use this later for the album.

208
00:34:50.980 --> 00:34:57.339
Sarti, Gabriele: And sort of… I guess, in this way of, like, choosing among the best of four branches.

209
00:34:57.950 --> 00:35:00.040
Sarti, Gabriele: Optimizing for that signal.

210
00:35:01.010 --> 00:35:02.880
Sarti, Gabriele: you know, what I'm gonna need later.

211
00:35:03.350 --> 00:35:07.540
Sarti, Gabriele: results in… in… I guess denser COT, so…

212
00:35:08.060 --> 00:35:22.510
Sarti, Gabriele: I just want to get, like, an understanding, like, is this sort of the intuition? Yeah, I think the intuition is that there's gonna be tokens that the model knows are relevant for later, kind of. And I think that's the signal that is

213
00:35:22.510 --> 00:35:41.700
Sarti, Gabriele: kept, while other tokens are kind of, like, a way to… to get more compute, kind of, like, you know, this kind of, oh, wait, let me think through this, right? This is not, like, answer-directed, it's more like, I need more computation to get to something meaningful, let me just kind of stall, you know, for a bit. So, yeah.

214
00:35:42.590 --> 00:36:00.930
Sarti, Gabriele: Oh, yeah. Just, yeah, quick follow-up. Do you think you'd see a trend in this direction? If you, like, for example, you looked at checkpoints of, like, the formal reasoning, or, like, the RL stage, do you think you'd see, like, it start to generate more of these, like, highly dense CFEs?

215
00:36:01.770 --> 00:36:10.859
Sarti, Gabriele: That's a good question. I'm not sure I have an assumption here. Like, I don't know what to expect on… about looking at this specific model, yeah.

216
00:36:12.200 --> 00:36:13.170
Sarti, Gabriele: Yep.

217
00:36:14.040 --> 00:36:19.890
Sarti, Gabriele: how I'm interpreting this effect is that the model is doing… Some kind of confession.

218
00:36:20.210 --> 00:36:30.129
Sarti, Gabriele: previously intermediate tokens, that it needs to maybe, compact information from the token, from the context, and also, like, some information that is going to be relevant related.

219
00:36:30.960 --> 00:36:35.830
Sarti, Gabriele: But isn't that it's this exact thing that the model needs to do that?

220
00:36:36.490 --> 00:36:39.640
Sarti, Gabriele: Makes the first brain not to be prevented much harder.

221
00:36:40.300 --> 00:36:51.409
Sarti, Gabriele: Because, I mean, let's say the model has some context about few smaller indigenous rack documents, it decided to contact this stuff into some token.

222
00:36:51.550 --> 00:37:01.550
Sarti, Gabriele: that if you just look at the token itself, it is just… has nothing to do with… Yeah, yeah, yeah. And then, even when you do this gradient attribution and stuff like that.

223
00:37:01.700 --> 00:37:07.259
Sarti, Gabriele: then that… Token that is apparently not related to the concept you're looking for.

224
00:37:07.440 --> 00:37:22.650
Sarti, Gabriele: it will also get boosted, because it is also storing some information. Right. I think conceptually, yes, that would be the case. My observation, though, is that this kind of compaction doesn't happen usually on, like.

225
00:37:22.650 --> 00:37:38.320
Sarti, Gabriele: end-of-sequence tokens, right? So there were all this previous work, for example, this Thought Anchors work by Neil, by some people at MATS, right, I think, that were using end-of-sequence within the COT to split the sentences, right?

226
00:37:38.320 --> 00:37:47.660
Sarti, Gabriele: And we tried to look at that, so when we were thinking about compaction, we were thinking, can we just bound the sequence based on the sequence boundary, right?

227
00:37:47.700 --> 00:38:06.390
Sarti, Gabriele: But actually, it seems like there is not… this kind of thing that the information is gathered at the end, right? Can be used for reading, but something that we're also seeing in the goal-directedness part is that it cannot really be used causally, right? Meaning, like, information is moved there.

228
00:38:06.810 --> 00:38:16.330
Sarti, Gabriele: But it's a bit diffused, kind of, right? So the attribution is more likely to pick out the original source of the information, rather than this final part.

229
00:38:17.090 --> 00:38:23.079
Sarti, Gabriele: At least that's what we observe right now, that we're digging a bit into the results, right?

230
00:38:26.050 --> 00:38:28.760
Sarti, Gabriele: It doesn't need to be, but I think it makes sense

231
00:38:29.000 --> 00:38:47.650
Sarti, Gabriele: that, like, the gathering happens when you need to predict something, right? Kind of like, if you think in this case, if you have to do an intermediate prediction, then there is the point where the attention is looking backwards and is trying to gather this information, not like at the end of the sentence, right?

232
00:38:47.790 --> 00:38:54.620
Sarti, Gabriele: So then this step would anyways be part of this kind of salient elements, right, that get picked out.

233
00:38:55.030 --> 00:38:58.760
Sarti, Gabriele: So then, I feel like it mitigates a bit the problem that you mentioned.

234
00:39:06.120 --> 00:39:06.810
Sarti, Gabriele: Yep.

235
00:39:16.690 --> 00:39:19.270
Sarti, Gabriele: Yeah, something like that. It's kind of like,

236
00:39:19.980 --> 00:39:26.080
Sarti, Gabriele: beam search with a probe that controlled for how much this is relevant to the final outcome.

237
00:39:26.740 --> 00:39:28.080
Sarti, Gabriele: Yeah. Bye.

238
00:39:36.290 --> 00:39:37.500
Sarti, Gabriele: be irrelevant.

239
00:39:38.920 --> 00:39:39.640
Sarti, Gabriele: Excellent.

240
00:39:44.660 --> 00:39:52.130
Sarti, Gabriele: So, we did look at entropy, actually, of the distribution across tokens as a signal initially.

241
00:39:52.130 --> 00:40:06.840
Sarti, Gabriele: to try to understand, because there was some work that was using entropy as a signal of, like, oh, the model is more uncertain, so maybe that's here where it takes the decision, right? And in mathematical problems, we found that low entropy on digits was

242
00:40:06.850 --> 00:40:21.060
Sarti, Gabriele: kind of related to the final outcome, so digits were salient, but apart from that, we didn't find that entropy was very predictive of that. That's why we moved to attribution as a way to get a more output-oriented signal of what matters, right?

243
00:40:21.060 --> 00:40:28.400
Sarti, Gabriele: So yeah, the probe does this kind of attribution, but online, not from knowing the answer, right?

244
00:40:28.400 --> 00:40:34.759
Sarti, Gabriele: So, I was also kind of surprised that it works, but it seems to be working quite well. So, we were getting something like

245
00:40:34.770 --> 00:40:39.369
Sarti, Gabriele: 85-90% accuracy to get out these tokens, yeah.

246
00:40:45.950 --> 00:41:03.109
Sarti, Gabriele: So, you do the attribution from the final output. You get scores for all the sequence, but then the probe is trained given the current token in the sequence, so you can regenerate the same sequence, and then you want to predict the score without knowing the final output, right?

247
00:41:06.660 --> 00:41:16.720
Sarti, Gabriele: The scores on which the probe is trained are based on the final output, but the probe is trained only on the local output that gets generated one step at a time, right?

248
00:41:17.470 --> 00:41:21.779
Sarti, Gabriele: That is annotated with those scores that you pre-computed, kind of, right?

249
00:41:22.260 --> 00:41:34.729
Sarti, Gabriele: Do you give… what do you give as input to the probe? So the probe just gets the binary, yeah, it gets the sequence up to this token, and it gets the binary label that is the, you know, this,

250
00:41:34.860 --> 00:41:44.289
Sarti, Gabriele: kind of like, with the threshold, we set whether the token is meaningful or not towards the prediction, yeah. Did you find any patterns, like.

251
00:41:44.430 --> 00:42:04.169
Sarti, Gabriele: in math problems, could it just be, like, all the numbers are relevant? Yeah, so in math problems, our problem was that initially we worked with entropy, and that there were these kind of patterns of low entropy, always salient, but then the moment we moved to, like, MMLU or something like that, this completely broke our analysis, right? So that's why we moved to attribution.

252
00:42:04.170 --> 00:42:14.149
Sarti, Gabriele: And now we started doing some experiments there, it seems more robust, right? So what we want to test is also generalization, kind of like you train the probe on math, you test on MMLU, right?

253
00:42:14.470 --> 00:42:15.650
Sarti, Gabriele: Oh, yep.

254
00:42:15.730 --> 00:42:25.389
Sarti, Gabriele: We have a submission we're preparing for calling that looks very similar, so maybe we can talk. Nice, well, yeah, sure. One thing I can mention, we…

255
00:42:25.420 --> 00:42:38.969
Sarti, Gabriele: We also had the attribution scores, but we added two properties that I think are important and may help with some of your experiments. One is, you want to know if these steps or parts of the train of thoughts are actually plausible.

256
00:42:39.020 --> 00:42:39.980
Sarti, Gabriele: for the model.

257
00:42:40.070 --> 00:42:49.589
Sarti, Gabriele: So, are they high likelihood, given the preface? You mentioned that you sometimes are left with things that are… they don't make sense, right? You mean when you patch?

258
00:42:49.690 --> 00:43:01.780
Sarti, Gabriele: Here. Even before, right? Yeah, but actually here, we just do the coding from the models. Not in the report. We're talking about just identifying… what was it, yeah. I think when you started, you said…

259
00:43:01.780 --> 00:43:19.809
Sarti, Gabriele: oh, these things are… they don't make sense, but if you… if you add the hidden… if you patch the hidden slate… Yeah, yeah, it was for the patching. Yeah, but this is post-doc, right? So… No, no, I understand. But you can inform… you can require that the steps you identify… Right. We called it, attainable.

260
00:43:19.810 --> 00:43:37.139
Sarti, Gabriele: So that they are something the model will assign high probability. Interesting. And the other thing, which is also related to whether the model is just writing the answer or a shortcut to the answer, you want to get rid of those, and you can get rid of those. So you want things that you cannot…

261
00:43:37.900 --> 00:43:48.300
Sarti, Gabriele: predict the answer just from the given step, right? You don't want… Correct. You don't want them to be enough for predicting it, so you can get rid of that. Yeah, yeah, yeah. You want… basically, you want…

262
00:43:48.510 --> 00:43:59.230
Sarti, Gabriele: steps that are sufficient for predicting the answer when the full question and prefix… That's right, but so you're segmenting still at the step level, right? We're segmenting at the step.

263
00:43:59.350 --> 00:44:03.030
Sarti, Gabriele: Makes sense. Yeah, I'd be happy to talk about that.

264
00:44:03.970 --> 00:44:09.489
Sarti, Gabriele: Now, quickly on to the goal-directedness part that a lot of people were interested in.

265
00:44:11.050 --> 00:44:28.849
Sarti, Gabriele: So the setup here, the whole motivation about this is that most of the evaluations about goal-directedness in agents are based on a behavioral study, right? So we study whether the agents can do the task, and you usually use as a metric of

266
00:44:29.370 --> 00:44:39.450
Sarti, Gabriele: goal directness, whether the agent is accurate or efficient with respect to the optimal policy, right? If it takes more or less the same amount of steps, or more or less the same steps.

267
00:44:39.450 --> 00:44:50.979
Sarti, Gabriele: So the problem of this is that this… the model capabilities of getting to the answer are a bit confounder, right? Because if the model is just not able to get to the answer, it might be the most goal-directed, but…

268
00:44:51.030 --> 00:45:13.469
Sarti, Gabriele: It just breaks, right? So our idea for this project was, can't we ground this study into the representation that the model builds about the environment to try to get a better sense from those about the capabilities of the model? Meaning, is the model just failing at representing the environment, and this leads to a failure in taking action?

269
00:45:13.470 --> 00:45:15.870
Sarti, Gabriele: Or is the environment represented well?

270
00:45:15.870 --> 00:45:16.570
Sarti, Gabriele: Right?

271
00:45:17.240 --> 00:45:28.260
Sarti, Gabriele: So, our setup is with a 2D grid word that looks like this, with different densities of obstacles. So, from 0 to 1, that we say, basically, corridors, right?

272
00:45:28.440 --> 00:45:44.800
Sarti, Gabriele: And this grid word is passed as a text to the LLM, and it's represented as you see there. So we put a lot of care to make sure that every token is a single… every cell is a single token. This helps a lot with the analysis on the mechanistic side of things, right?

273
00:45:46.680 --> 00:45:56.960
Sarti, Gabriele: Alright, so this is the kind of prompt that we use. So, the model is given, like, you are controlling an agent, this is your legend, like, GA…

274
00:45:56.960 --> 00:46:13.610
Sarti, Gabriele: the wall. Your objective is to get from your position to the goal while avoiding walls. And at every turn, the model can only say one of four actions. So it's step-by-step. One conversation turn is one step. And this is the expected output format, right?

275
00:46:13.730 --> 00:46:24.609
Sarti, Gabriele: So then we feed in, at every observation, the grid state here. The model has to take a step, then we parse the step, we use the system behind to get to the next observation.

276
00:46:25.110 --> 00:46:34.110
Sarti, Gabriele: So the setup for the evaluation in the behavioral setup is we have 6 densities, from empty to fully corridors.

277
00:46:34.210 --> 00:46:45.740
Sarti, Gabriele: We have 5 sizes, so 7x7 up to 15x15, 10 grids per sizes, and 10 trajectories per grid, right? So overall, 3,000 trajectories.

278
00:46:47.040 --> 00:46:50.650
Sarti, Gabriele: All right. So this is a first view.

279
00:46:50.990 --> 00:46:56.560
Sarti, Gabriele: of, of, like, how different complexity

280
00:46:56.850 --> 00:47:02.850
Sarti, Gabriele: causes the model to fail more. So what you see here is the density on the x-axis.

281
00:47:02.960 --> 00:47:11.710
Sarti, Gabriele: you see the grid sizes here, and you see the distance to the goal. So this distance to the goal is measured step by step.

282
00:47:11.720 --> 00:47:27.489
Sarti, Gabriele: Using the optimal policy, right? So, just to clarify, the grid word is a 2D grid word, fully observable, so you can, at every step, just do the A-star policy to measure what's the minimal path, basically.

283
00:47:27.660 --> 00:47:43.640
Sarti, Gabriele: So you can see here that the more the sizes increase, the more the failures start to happen, basically. But even for cases in which in the original size, so higher densities, for the smaller size, the model would do it perfectly.

284
00:47:43.700 --> 00:47:52.909
Sarti, Gabriele: at higher densities for bigger sizes, the model starts to fail, right? So, in general, we find, behaviorally that

285
00:47:53.370 --> 00:47:58.319
Sarti, Gabriele: Bigger grids causes failures even for setups that before didn't, right?

286
00:47:59.210 --> 00:48:00.360
Sarti, Gabriele: Yep.

287
00:48:01.570 --> 00:48:02.440
Sarti, Gabriele: Defense.

288
00:48:02.610 --> 00:48:11.160
Sarti, Gabriele: Yeah, so you can see that here, there's no… nothing above 10 actions, because the grid is just a 7x7, right?

289
00:48:14.030 --> 00:48:17.219
Sarti, Gabriele: Why is the… Thank you.

290
00:48:18.180 --> 00:48:28.040
Sarti, Gabriele: Because these grids are empty, they don't have obstacles, so the shortest path is quite straight, right? Well, in this case, you might have to go through the corridors, right?

291
00:48:28.610 --> 00:48:29.310
Sarti, Gabriele: Yep.

292
00:48:31.320 --> 00:48:39.159
Sarti, Gabriele: So another thing that we were testing is the robustness to instrumental goals. So, we had 3 setups here.

293
00:48:39.280 --> 00:48:44.029
Sarti, Gabriele: A setup in which the model needs the key to get to the goal via a door.

294
00:48:44.130 --> 00:48:59.409
Sarti, Gabriele: A setup in which there is a key, but there's no door, so the model actually could just disregard the key, doesn't care about the key. And a setup in which the distance of the goal is the same, but there is a key in one path and no key in the other, right?

295
00:49:00.080 --> 00:49:03.700
Sarti, Gabriele: So, what we saw here is that in the first setup.

296
00:49:03.850 --> 00:49:17.319
Sarti, Gabriele: the model has a 100% success rate in this smaller 9x9 grid, so it means that it can always get to the key, pick up the key, get to the door, open it, get to the goal, right? Which is great.

297
00:49:19.300 --> 00:49:32.479
Sarti, Gabriele: And the success rate is also very high for the KeyNodore case, but you can see that in 17% of the cases, in the KeyNodore case, the model still picks up the key, right?

298
00:49:32.630 --> 00:49:43.669
Sarti, Gabriele: And the key has this… this key attractor bias. It's basically 75% of actions are basically the model going slightly towards the key, right? When it shouldn't.

299
00:49:43.880 --> 00:49:45.489
Sarti, Gabriele: Care about that.

300
00:49:45.890 --> 00:50:03.939
Sarti, Gabriele: So, our idea here is about this attractor, right? So the key semantically in games makes sense, right? If there is a key, you probably need it, right, for doing something. So I think it's kind of interesting to see that this leaks into this setup, where the model doesn't have to pick it up, but just does.

301
00:50:04.220 --> 00:50:04.950
Sarti, Gabriele: Yep.

302
00:50:08.410 --> 00:50:09.930
Sarti, Gabriele: I don't distinguish.

303
00:50:11.210 --> 00:50:14.099
Sarti, Gabriele: An agent that just, like, does a random walk.

304
00:50:14.460 --> 00:50:15.290
Sarti, Gabriele: So…

305
00:50:15.690 --> 00:50:23.769
Sarti, Gabriele: Inject that does a random walk, we'll also get a 100% success rate eventually. That's right. And we'll also have 100% key people freaked on the left.

306
00:50:23.910 --> 00:50:25.420
Sarti, Gabriele: Some random…

307
00:50:25.600 --> 00:50:45.530
Sarti, Gabriele: Yeah, I didn't mention this, but the number of maximal step in the trajectory that we allow is dependent on the grid size, and it's something like, I think twice or three times the grid size, so it's something like 27 at most for the 9x9.

308
00:50:45.640 --> 00:50:50.649
Sarti, Gabriele: It's such that a random agent would have very little success. That's right, yeah.

309
00:50:52.310 --> 00:50:54.390
Sarti, Gabriele: And how do you… Yeah.

310
00:50:54.970 --> 00:51:09.569
Sarti, Gabriele: Yeah, so in the prompt, there's another version of this prompt, in which we just mention that this is a key, but we don't say anything about it.

311
00:51:09.750 --> 00:51:22.180
Sarti, Gabriele: In the door case, we say, you need the key to open the door, but in the other case, we just remove that part. We just say, here is a key. Remind me which models you're testing for this? GPTOS to NEB, yeah. There's no training.

312
00:51:22.360 --> 00:51:39.000
Sarti, Gabriele: There's no training, just out of the box, yeah. And are you doing actions based on, like, a gym environment, or is it just… Yeah, so it's… we're using this gym, this… this gym that is normally used for 2D grade words, right? I don't remember how it's called, yeah.

313
00:51:41.470 --> 00:51:45.329
Sarti, Gabriele: It's a fresh prompt. It's stateless, yeah.

314
00:51:45.700 --> 00:51:49.460
Sarti, Gabriele: Give it a fresh prompt. Yeah. Stateless, yeah.

315
00:51:49.950 --> 00:51:55.169
Sarti, Gabriele: Just curious, if you were to say, not to get the key, would it not to get the key?

316
00:51:55.280 --> 00:52:01.000
Sarti, Gabriele: At least we didn't try, but we just wanted to test the ambiguous case, you know, I imagine so, yeah.

317
00:52:03.520 --> 00:52:09.250
Sarti, Gabriele: And also, we saw this bias towards going to the key path, even in the case where it didn't need to, right?

318
00:52:10.430 --> 00:52:16.310
Sarti, Gabriele: Does this bias occur whether or not there was age in the family?

319
00:52:17.790 --> 00:52:25.450
Sarti, Gabriele: training data? No, there is nothing. No. This is just GPTOS. Okay, so you don't know. No, I imagine so.

320
00:52:25.560 --> 00:52:40.209
Sarti, Gabriele: Is there a standard emoji for the key that you're using? Yeah, it's like K or something. If you put a little poop emoji there? That's a good question. I think it would be interesting to try, yeah. I guess what I'm getting at is, is it just that…

321
00:52:40.340 --> 00:52:58.570
Sarti, Gabriele: can do anything that's there, or things it knows from prior games that… Yeah, I mean, our assumption is that the key… like, the model saw games, definitely, right? And if it knows that this is a key, it knows that keys are picked up in games, and so it does it, right? Or maybe it's a visual processing thing.

322
00:52:58.580 --> 00:53:01.229
Sarti, Gabriele: But it's just a K, right, in the map, so…

323
00:53:01.630 --> 00:53:18.359
Sarti, Gabriele: So it shouldn't… lava squares. Yeah, exactly, or like the poop emoji, I think it's a great idea, yeah. Yeah, so in these cases, most of the cases, it's not on the path, and still we see this

324
00:53:18.910 --> 00:53:20.729
Sarti, Gabriele: Key attraction bias, right?

325
00:53:20.980 --> 00:53:36.760
Sarti, Gabriele: In those cases, you don't tell it about the key at all, or doors whatsoever? It knows that there's this symbol is a key. But you don't tell about doors. But in the other one, you tell about doors. Yeah. I wonder if you did tell it about doors, it wouldn't have picked up the key, because you're like, oh, I know what this is for, and there is no… That's a good question, I don't know, yeah.

326
00:53:37.390 --> 00:53:44.810
Sarti, Gabriele: I'd be more convinced of the numbers if you…

327
00:53:46.010 --> 00:53:53.100
Sarti, Gabriele: If you have pairs of grids where the shortest path is… some solution.

328
00:53:53.460 --> 00:53:57.589
Sarti, Gabriele: And then… Yeah, and then you put a key,

329
00:53:57.700 --> 00:54:02.070
Sarti, Gabriele: And then you have key, and then you compare, like, the length of the…

330
00:54:02.980 --> 00:54:16.540
Sarti, Gabriele: Yeah. Yeah, yeah, so it is the case that it increases. It's just, we don't have it in the table here. But it is, like, the model does get out of its way to get to the key, yeah.

331
00:54:17.510 --> 00:54:31.539
Sarti, Gabriele: Yeah, I think, personally, the most exciting thing about this, so here is purely behavioral, right? But I would be very excited to do probing on the key states, and to see, does the model see it as a goal in this case versus in this case, right?

332
00:54:32.250 --> 00:54:34.159
Sarti, Gabriele: Or you should do your attribution.

333
00:54:34.300 --> 00:54:37.480
Sarti, Gabriele: Oh, yeah.

334
00:54:38.370 --> 00:54:40.560
Sarti, Gabriele: That could be, yeah. Yes?

335
00:54:45.840 --> 00:54:46.670
Sarti, Gabriele: Yeah.

336
00:54:47.220 --> 00:54:52.600
Sarti, Gabriele: Yeah, we didn't try with more attractors, but, yeah. Is there a specification where it's, like.

337
00:54:53.440 --> 00:54:57.390
Sarti, Gabriele: Like, is it just, like, here's this… is it, like, do you have, like.

338
00:54:57.930 --> 00:55:01.130
Sarti, Gabriele: Anything encouraging to find the optimal goal? Because…

339
00:55:01.290 --> 00:55:10.880
Sarti, Gabriele: Yeah, so in the… in here, we say you should aim to reach the goal using the least amount of steps, so there is an explicit instruction.

340
00:55:11.530 --> 00:55:20.030
Sarti, Gabriele: So quickly about cognitive map probes. The idea here is that we have our original grid, N…

341
00:55:20.470 --> 00:55:29.199
Sarti, Gabriele: We pass this to the model, again, that needs to pick one of four actions. And what we do is that we take intermediate states of these.

342
00:55:29.310 --> 00:55:38.050
Sarti, Gabriele: So we take the activation at those, and we use it to train a probe to predict every cell in the grid, right? So this is just an MLP probe.

343
00:55:38.240 --> 00:55:45.390
Sarti, Gabriele: that is trained, given the activation, plus an XY coordinate, so 0, 0.

344
00:55:45.680 --> 00:55:49.929
Sarti, Gabriele: This activation 00, predict that this is wall, right?

345
00:55:50.650 --> 00:56:03.130
Sarti, Gabriele: So, we can train at scale, over all the grid sizes, even, using a padding token for the missing positions, right? So for bigger, smaller grids, we'll have this pad

346
00:56:03.190 --> 00:56:16.659
Sarti, Gabriele: to use it for the same probe. So the interesting part is the probe definitely is able to pick up almost 100% probability the grid size, so it knows when to predict path given the activation.

347
00:56:18.640 --> 00:56:27.950
Sarti, Gabriele: And the other interesting thing that we do is that here, we take the final tokens at the end of the prompt, so it's like the template tokens at the end of the prompt.

348
00:56:27.950 --> 00:56:39.090
Sarti, Gabriele: But then we repeat the same thing by training a probe on the final token at the end of reasoning, right? So our idea is, can we compare how the probe performs before and after reasoning?

349
00:56:39.910 --> 00:56:56.799
Sarti, Gabriele: So here, it's the same, right? So, for example, here, in this example, the initial probe represents, well, the environment where the model crashed into the wall, but the final probe actually makes the model think that it could just go straight, right? As you can see from the reconstructions, the probes are not perfect, right?

350
00:56:56.800 --> 00:57:07.039
Sarti, Gabriele: So the way that we build this cognitive map is simply the argmax of the classes, of the cells, so you could have multiple goals in the reconstructed map, right?

351
00:57:07.850 --> 00:57:08.980
Sarti, Gabriele: Hypothetically.

352
00:57:09.800 --> 00:57:11.560
Sarti, Gabriele: And multiple agents, yeah.

353
00:57:14.370 --> 00:57:19.550
Sarti, Gabriele: of the… I see that the green, red family of gold, but how, how did…

354
00:57:19.830 --> 00:57:21.610
Sarti, Gabriele: It became a goal.

355
00:57:22.120 --> 00:57:23.710
Sarti, Gabriele: Like, what did you do?

356
00:57:24.030 --> 00:57:26.410
Sarti, Gabriele: Here.

357
00:57:27.450 --> 00:57:30.179
Sarti, Gabriele: So, what we're doing is just using the probe.

358
00:57:30.630 --> 00:57:39.370
Sarti, Gabriele: Every cell in this map is a prediction of the probe that is given the activation and needs to reconstruct what's the original

359
00:57:39.480 --> 00:57:41.310
Sarti, Gabriele: Identity for that cell.

360
00:57:41.740 --> 00:57:57.559
Sarti, Gabriele: So that's why I'm saying here, the probe is uncertain where the goal is. It kind of gets it, but it's a bit uncertain between these two options. So your ground truth map is the one up here? It's the real one that the model is prompted. And you have a… you have a special, like…

361
00:57:57.780 --> 00:58:04.800
Sarti, Gabriele: Unicode character for that green square. Yeah, it's G. It's just G. That's whatever it is. Yeah. And,

362
00:58:05.030 --> 00:58:12.000
Sarti, Gabriele: And… but then… but then your probe, basically, who covers… he's in a smeary area. Yeah. Is that right? Any thoughts?

363
00:58:12.310 --> 00:58:14.220
Sarti, Gabriele: Yeah, so thank you, Pooja.

364
00:58:14.770 --> 00:58:16.400
Sarti, Gabriele: How did you train Victor?

365
00:58:16.640 --> 00:58:21.500
Sarti, Gabriele: So, we take this activation at a specific layer. The final ones

366
00:58:21.880 --> 00:58:34.619
Sarti, Gabriele: So it's the last three tokens of, like, the template tokens of… at the end of the prompt, and at the end of the reasoning. So these are two separate probes, though. It's, like, pre-reasoning, post-reasoning, right?

367
00:58:34.980 --> 00:58:51.040
Sarti, Gabriele: And we train, let's say, mostly for layer 15. We also try these two other layers, but let's say layer 15, we take all the activations for the final tokens of the prompt, we train the probe to predict every cell identity, given that activation.

368
00:58:52.830 --> 00:59:01.349
Sarti, Gabriele: Does it surprise me that it seems like it's hard for the… of this model, to…

369
00:59:01.900 --> 00:59:05.730
Sarti, Gabriele: To recognize where is the goal, because… You gave this map?

370
00:59:05.870 --> 00:59:08.560
Sarti, Gabriele: Right? In the prompt.

371
00:59:08.980 --> 00:59:13.430
Sarti, Gabriele: Yeah. So… Seems a very simple task.

372
00:59:13.760 --> 00:59:18.880
Sarti, Gabriele: It's just… Yeah, but now I'm not probing the map identity, like…

373
00:59:19.380 --> 00:59:25.670
Sarti, Gabriele: So I'm probing a final… Exactly. Kinda, yeah.

374
00:59:26.760 --> 00:59:30.189
Sarti, Gabriele: Because they have a different, task.

375
00:59:30.720 --> 00:59:32.190
Sarti, Gabriele: Yeah, that's right.

376
00:59:33.480 --> 00:59:38.040
Sarti, Gabriele: And… So the last 3 tokens, as in the… the arrow tokens?

377
00:59:38.420 --> 00:59:49.490
Sarti, Gabriele: No, no, so what I mean is the template tokens, so GPTOS has these tokens that are, like, end, start, assistant, right? Something like this. We just concatenate the activations of those.

378
00:59:50.230 --> 00:59:53.819
Sarti, Gabriele: Yeah. Does the probe always predict the first step correctly?

379
00:59:54.240 --> 01:00:01.100
Sarti, Gabriele: This one? Like, so, like, if the… there's, like, some rough idea of the map, is it, like…

380
01:00:01.640 --> 01:00:21.289
Sarti, Gabriele: the first thing… the first action that you need to take. The immediate neighborhood. Like, is that… Yeah, in general, the neighborhood is quite… it's quite good. The plot that I had here is actually how correct the agent and goal positions are across different sizes. You can see that this decreases, but, like, for

381
01:00:21.290 --> 01:00:37.599
Sarti, Gabriele: for 7, for example, it's quite high. And what… what the plot on the right shows is, like, the Manhattan distance of all the argmax. So, like, for example, let's say that I have this grid, right? Where I have 3 cells that the argmax is a goal.

382
01:00:37.690 --> 01:00:57.040
Sarti, Gabriele: Then I measure how far these are from the original goal, and you can see that even for larger grids where the accuracy is lower, the Manhattan distance is, like, 1.5. So it's a lower accuracy, but it's always, like, within a neighborhood of the real position, right? So it's just more diffused.

383
01:01:00.070 --> 01:01:10.479
Sarti, Gabriele: And this is the performance. So, what you see here, like, the accuracy, the precision, sorry, is very low, right, of the probe, for the agent and the goal.

384
01:01:10.720 --> 01:01:29.640
Sarti, Gabriele: But you can see that the recall is very high, and this is because the model, the probe tends to predict, like, a cloud of points around the real position, so the recall is generally high, but the actual precision, so to pinpoint exactly that spot, is quite low. And you can see that compared to

385
01:01:29.750 --> 01:01:47.499
Sarti, Gabriele: pre- and post-reasoning, there is a drop in performance, especially for, agent and goal positions, right? So this, like, our assumptions based on this, is that the model goes from modeling the environment before reasoning to modeling the next action to take, right?

386
01:01:47.840 --> 01:01:54.140
Sarti, Gabriele: So… The final thing for this is about

387
01:01:54.380 --> 01:02:01.130
Sarti, Gabriele: well, now we have cognitive maps. Does the behavior of the model agree with the cognitive map, right? That's the…

388
01:02:01.440 --> 01:02:03.759
Sarti, Gabriele: The kind of goal that we had in the beginning.

389
01:02:03.940 --> 01:02:23.879
Sarti, Gabriele: And as of now, it's not so clear, actually, the response. So what you see here in the… in the table, you have grid sizes and densities. Then you have the accuracy for an action with respect to the original grid, so this is, like, accuracy of actions given the… the prompt.

390
01:02:24.600 --> 01:02:30.949
Sarti, Gabriele: And this is the accuracy with respect to the cognitive map, right? So the decoded grid.

391
01:02:31.040 --> 01:02:44.920
Sarti, Gabriele: you can see that this is generally lower than this, right? Although the differences even out for bigger grids. So in general, it seems like the actions are more consistent with your original grid than with the decoded one.

392
01:02:45.060 --> 01:02:48.120
Sarti, Gabriele: But we also see that among all the errors.

393
01:02:48.260 --> 01:02:52.550
Sarti, Gabriele: That are made, within the grids.

394
01:02:52.990 --> 01:03:08.390
Sarti, Gabriele: like, there is a good proportion, this is, like, 60% of this part that is an error, that are actually agreeing with the decoded map. So we find, what we say here, high recovery, which is this last metric.

395
01:03:09.880 --> 01:03:28.919
Sarti, Gabriele: it means that wrong actions that the model takes are mostly in agreement with the cognitive map that we decode, right? So this, I mean, it's only still probing, but it's some signal that maybe we can justify a lot of the action, not as a lack of goal-directedness, but actually as a lack of

396
01:03:29.280 --> 01:03:32.140
Sarti, Gabriele: Capability to represent them up well, right?

397
01:03:35.450 --> 01:03:48.350
Sarti, Gabriele: And this is, like, still so much to do about these experiments. So the thing that we really, really would like to do in this line is causal experiments, so right now we started doing some stuff here.

398
01:03:48.610 --> 01:04:05.630
Sarti, Gabriele: In particular, patching the location of agents and goals, of course, to see whether this changes something, but also patching the neighborhoods of these things. Since we saw that the positions are diffused, we want to see if we can patch the neighborhoods. It's a bit, like, related to what,

399
01:04:06.020 --> 01:04:13.829
Sarti, Gabriele: what was presented last week about VLMs, right? So, if you patch the neighborhood, does it actually affect the final prediction?

400
01:04:14.370 --> 01:04:19.580
Sarti, Gabriele: And then to test these cognitive maps in OOD settings, so…

401
01:04:20.120 --> 01:04:37.700
Sarti, Gabriele: I think the last one is especially interesting. So if we have a partially observable setting in which we occlude part of the map, we observe some action in some direction, do we see that, the probe indeed finds that the goal is kinda that way, right, according to the model? Yeah?

402
01:04:38.920 --> 01:04:39.720
Sarti, Gabriele: Oops.

403
01:04:40.040 --> 01:04:42.070
Sarti, Gabriele: Si veso de…

404
01:04:42.990 --> 01:04:54.239
Sarti, Gabriele: cognitive math is generalized into a different task. For example, it has to just ask to repeat the math, which is a simple task that uses,

405
01:04:54.630 --> 01:05:02.360
Sarti, Gabriele: Yeah, you could imagine that this probably will have a much stronger representation of the map, given that it needs to, you know.

406
01:05:02.850 --> 01:05:04.909
Sarti, Gabriele: What I'm curious is whether…

407
01:05:05.330 --> 01:05:14.750
Sarti, Gabriele: the location of the… of the alternative map stay the same, whether it would be at the same tokens in the same layers.

408
01:05:16.780 --> 01:05:32.739
Sarti, Gabriele: Yeah, I think it would probably be different. I think it depends on the task, definitely. At least the sharpness with which things are represented. I would expect it depends on the task, right? For this task, I expect the goal and the agent position to be more sharply represented.

409
01:05:32.900 --> 01:05:36.790
Sarti, Gabriele: Because they are functional to getting to the goal, right?

410
01:05:37.150 --> 01:05:37.910
Sarti, Gabriele: F.

411
01:05:39.030 --> 01:05:39.770
Sarti, Gabriele: Yo.

412
01:05:44.050 --> 01:05:44.929
Sarti, Gabriele: At the end.

413
01:05:45.830 --> 01:05:46.520
Sarti, Gabriele: Yeah.

414
01:05:47.410 --> 01:05:50.729
Sarti, Gabriele: I'm curious, yeah, we should also test that, yeah, definitely.

415
01:05:51.530 --> 01:05:52.580
Sarti, Gabriele: Nice.

416
01:05:53.180 --> 01:05:54.950
Sarti, Gabriele: Yep. You mentioned that.

417
01:05:55.290 --> 01:05:59.489
Sarti, Gabriele: When you're talking about the probes, that immediately they get 100% accuracy.

418
01:05:59.560 --> 01:06:16.389
Sarti, Gabriele: What do you mean by that? No, so the probe don't have 100% accuracy, they have 100% accuracy in detecting the size of the grid. So, like, the padding tokens to the edges of the grid are very reliably predicted, right? Which is not a given, necessarily, right? From the activation.

419
01:06:16.690 --> 01:06:18.829
Sarti, Gabriele: So… So the, the…

420
01:06:19.040 --> 01:06:28.609
Sarti, Gabriele: The cognitive maps that you quoted earlier, it's one pro that takes XY coordinates and these three concatenated embeddings as an input.

421
01:06:28.860 --> 01:06:29.620
Sarti, Gabriele: Yeah.

422
01:06:29.850 --> 01:06:43.219
Sarti, Gabriele: This is one probe for everything. Exactly. So I just want to show you very quickly something fun. So we have a demo for cognitive maps, that is quite nice. So here, let's say that I select pre-reasoning probes.

423
01:06:43.530 --> 01:06:45.380
Sarti, Gabriele: For size 11.

424
01:06:45.540 --> 01:06:47.939
Sarti, Gabriele: And for complexity, 0.6.

425
01:06:48.720 --> 01:06:52.989
Sarti, Gabriele: So I load this grid. So this is my input, right?

426
01:06:53.260 --> 01:07:06.920
Sarti, Gabriele: This is the input to the GPTOS, and I can actually… I have the full trace already run, so I can actually play this, and I can see the agent going around. So, for now it's optimal, then now it's doing something weird.

427
01:07:07.040 --> 01:07:09.379
Sarti, Gabriele: Then now it's getting back on track.

428
01:07:10.360 --> 01:07:24.489
Sarti, Gabriele: Oh, going up, alright. So what actually can we do? Let's say that I reset here. So, this is the first step, right? I can go down here, I have the prompt.

429
01:07:25.020 --> 01:07:26.560
Sarti, Gabriele: So I can see the prompt.

430
01:07:27.240 --> 01:07:31.690
Sarti, Gabriele: And I can see the model output that has all the reasoning, right?

431
01:07:32.240 --> 01:07:38.179
Sarti, Gabriele: So actually here we're storing also the top 5 next tokens, etc.

432
01:07:38.290 --> 01:07:50.000
Sarti, Gabriele: So here it was a pre-reasoning probe, so we have the last few tokens, you see these three tokens, that are the ones where the probe is trained, right? So we can apply the probe here.

433
01:07:50.160 --> 01:08:02.690
Sarti, Gabriele: and see the cognitive map here. So, you can see these are a bit faded. They look like walls, but actually it's 70% empty, 23% walls, so it's more like empty than wall.

434
01:08:02.910 --> 01:08:07.030
Sarti, Gabriele: But the agent position, it's pretty stably encoded here, right?

435
01:08:07.790 --> 01:08:10.179
Sarti, Gabriele: And then we can actually play this.

436
01:08:12.670 --> 01:08:17.000
Sarti, Gabriele: And this is what the cognitive map looks like when the model is moving around.

437
01:08:18.100 --> 01:08:23.850
Sarti, Gabriele: So you can see that the agent position is kinda robust, although it's not very precise, but…

438
01:08:24.859 --> 01:08:29.189
Sarti, Gabriele: You know, you can see that it gets next to the goal, and then it reaches it, right?

439
01:08:29.890 --> 01:08:33.629
Sarti, Gabriele: Live demos are always stressful.

440
01:08:33.859 --> 01:08:40.689
Sarti, Gabriele: But yeah, so we have this kind of, of thing that helped us a lot to form a bit of intuition about how this works.

441
01:08:42.970 --> 01:08:58.190
Sarti, Gabriele: measured in a specific token, so is it also configurable, to see in different tokens? So is the final, three tokens of the prompt in this case? It's always the pre-reasoning, probe is always there.

442
01:08:59.640 --> 01:09:05.750
Sarti, Gabriele: It would be nice if a user could then perform live interventions in this. Yeah, it would be nice.

443
01:09:06.990 --> 01:09:17.950
Sarti, Gabriele: So, I mean, this ties nicely to the last three slides I have about an insight and an interp, but I don't know if you guys are tired, or if you can survive for 3 more slides.

444
01:09:19.270 --> 01:09:27.920
Sarti, Gabriele: So, it's just a very quick overview of, like, what we're thinking about on the endive side, yeah?

445
01:09:28.520 --> 01:09:29.410
Sarti, Gabriele: How about

446
01:09:29.950 --> 01:09:37.579
Sarti, Gabriele: the formal that the information is there, and it's very clearly correctable, but, I think the other question was, like, the

447
01:09:37.680 --> 01:09:41.790
Sarti, Gabriele: Like, it's just one probe doing this. Like, the probe was, like, more powerful, maybe.

448
01:09:42.880 --> 01:09:45.790
Sarti, Gabriele: Or a definition that yields a stronger device.

449
01:09:47.040 --> 01:09:53.579
Sarti, Gabriele: Stronger findings. Yeah, the problem there is always, like, how much is the probe actually doing the task instead of the model, right?

450
01:09:54.290 --> 01:09:55.950
Sarti, Gabriele: As the unattend teaches us.

451
01:09:59.440 --> 01:10:03.829
Sarti, Gabriele: If you could just make this a bit more high-dev, you get some good conclusions.

452
01:10:06.820 --> 01:10:12.640
Sarti, Gabriele: Did you, like, experiment with different ways of providing the map itself and the prompt to the model?

453
01:10:13.410 --> 01:10:32.870
Sarti, Gabriele: So, we thought a bit about making it kind of like an egocentric view. We would expect that to probably work better for, like, the model to actually get to the goal, because it's kind of like… at least it's not gonna hit walls with its head, right? But we… it's kind of hard to go from

454
01:10:33.340 --> 01:10:42.340
Sarti, Gabriele: from the kind of engines that are, like, that allow you to do these kind of observations to an egocentric view, right? So we postponed that part for now, yeah.

455
01:10:45.440 --> 01:10:46.710
Sarti, Gabriele: How was that something?

456
01:10:46.900 --> 01:10:53.690
Sarti, Gabriele: Was that, like, pre-complicated? Yeah, so the traces are all saved. All the probing results, everything is saved.

457
01:10:53.820 --> 01:11:00.010
Sarti, Gabriele: But you can play with it, there's several of them, right? So just have fun. It's on Hugging Face, yeah.

458
01:11:01.560 --> 01:11:09.750
Sarti, Gabriele: Alright, so for the very final things for NDIF, so as you… many of you know, right now we have this beautiful server.

459
01:11:09.800 --> 01:11:28.100
Sarti, Gabriele: that hosts a lot of models, and… that are gonna change soon, right? So we announced recently on the… on the Discord that we're gonna drop the Llama 405B, that we're trying to host Kimi, which is gonna be great, because it's also gonna have the VLM part, which is very useful.

460
01:11:28.370 --> 01:11:37.959
Sarti, Gabriele: And other more modern models, Quin 3.5, and other… other reasoning, models and VLMs. So…

461
01:11:38.280 --> 01:11:52.599
Sarti, Gabriele: As of now, we have NN site, which is the primary way to access, internals, which is quite high… quite a low level, right? So you have a variety of choices, you have to access components internally at a low level.

462
01:11:52.670 --> 01:12:00.409
Sarti, Gabriele: And this is meant mostly for people that actually are building new interpretability methods, so most of you guys, right?

463
01:12:01.140 --> 01:12:19.769
Sarti, Gabriele: And one of the new things that we have here is that we have these new skills that allow you to kind of use and inside more effectively with coding agents. So if you haven't checked this out, this is something nice. You can just load them in Cloud Code or Codex, and it kind of teaches the model to do

464
01:12:19.840 --> 01:12:33.769
Sarti, Gabriele: patching or do logic lens in a bit more natural way, or use VLLM. I think that was one of the things that WooGu was working on, right? Skills for VLLM usage, that could also be relevant.

465
01:12:34.710 --> 01:12:48.029
Sarti, Gabriele: So the newer part of the ecosystem is where I'm mostly focusing for my work. So, as maybe some of you know, there was this library, an interp, from Clement Dumas.

466
01:12:48.190 --> 01:13:07.819
Sarti, Gabriele: that has been recently integrated in our ecosystem, and the idea here is to provide some sort of interface that is a bit more similar to transformer lens to access the models, right? Which means you have this standardized naming across different models, so you just write once the code, you just run your evaluation on several models.

467
01:13:07.820 --> 01:13:09.970
Sarti, Gabriele: Plus several, kind of, like.

468
01:13:09.970 --> 01:13:28.630
Sarti, Gabriele: synthetic sugar to access model internals. So, I think in the future, it's reasonable to imagine that most researchers, like, I don't know, NLP people or computer vision people that are not interp people, will use something like this to do their research, right? Probably not down to the level of an insight.

469
01:13:29.000 --> 01:13:34.880
Sarti, Gabriele: So, here we have a low-level API that is this kind of, like.

470
01:13:35.210 --> 01:13:41.790
Sarti, Gabriele: shallow wrapper around an inside that is just basically syntactic sugars, mostly, to make everything uniform.

471
01:13:42.070 --> 01:13:43.640
Sarti, Gabriele: But on the other hand.

472
01:13:43.750 --> 01:13:56.629
Sarti, Gabriele: what I'm focusing on right now is we want to have also a higher level API that allows you to do very simple, standardized, and efficient method calls. For example, let's say that you want to do patching.

473
01:13:56.930 --> 01:13:58.430
Sarti, Gabriele: You're just gonna call…

474
01:13:58.780 --> 01:14:10.399
Sarti, Gabriele: patching with that model, with this position, it returns you a structured object that it's easy to serialize, to save, you know, to run your analysis on that. So, kind of all baked in.

475
01:14:10.500 --> 01:14:12.099
Sarti, Gabriele: Efficient and all baked in.

476
01:14:12.490 --> 01:14:32.050
Sarti, Gabriele: And this is gonna make it easier to work with Workbench. So, how many of you have tried Workbench? Out of curiosity? One person? Two people? So, Workbench is the UI that we are building for NN site, which back then was only LogiteLens, but recently I was doing patching on Llama 405B.

477
01:14:32.050 --> 01:14:33.720
Sarti, Gabriele: Which I think is pretty nice.

478
01:14:33.810 --> 01:14:38.459
Sarti, Gabriele: So it kind of supports more advanced use cases at the moment.

479
01:14:38.820 --> 01:14:56.009
Sarti, Gabriele: But we're trying to make it even better, and we imagine that Workbench is going to be mostly for researchers, right? So, like, people that maybe are not even in machine learning, but want to prototype in their domain, or educators that want to showcase, right? We used it a lot in David's class also.

480
01:14:56.660 --> 01:14:57.470
Sarti, Gabriele: Yes.

481
01:14:58.220 --> 01:15:01.840
Sarti, Gabriele: I've killed you, I'm assuming. Yes.

482
01:15:02.230 --> 01:15:03.509
Sarti, Gabriele: I have a question.

483
01:15:04.160 --> 01:15:10.049
Sarti, Gabriele: So… Why build higher level cooling?

484
01:15:10.350 --> 01:15:12.520
Sarti, Gabriele: If… research.

485
01:15:12.780 --> 01:15:16.450
Sarti, Gabriele: Code will increasingly be mediated by agents.

486
01:15:16.710 --> 01:15:19.869
Sarti, Gabriele: So on the last slide, you showed these skills. Yeah.

487
01:15:20.970 --> 01:15:24.739
Sarti, Gabriele: say I want a UI for patching. Yeah.

488
01:15:25.290 --> 01:15:27.560
Sarti, Gabriele: Patient, given the safe…

489
01:15:28.070 --> 01:15:33.570
Sarti, Gabriele: Or if I want it to create some kind of experiment, so we need, I should ask you to…

490
01:15:33.810 --> 01:15:39.020
Sarti, Gabriele: run some patching… That's right. They can figure out the answer, lower level details for me.

491
01:15:39.320 --> 01:15:42.269
Sarti, Gabriele: So… Like, in a year from now.

492
01:15:43.550 --> 01:15:44.270
Sarti, Gabriele: Music.

493
01:15:45.110 --> 01:16:02.630
Sarti, Gabriele: these will be useful to researchers. So you're asking specifically about this part, I imagine, right? Well, even that answer, well, but you… would you argue that the standardization is just gonna make everything easier to run, right? You have one script instead of five.

494
01:16:02.740 --> 01:16:03.500
Sarti, Gabriele: Right.

495
01:16:04.430 --> 01:16:07.080
Sarti, Gabriele: Oh, you need across models. Yeah.

496
01:16:07.180 --> 01:16:08.990
Sarti, Gabriele: So that's the low level, kind of.

497
01:16:09.120 --> 01:16:10.250
Sarti, Gabriele: Don't work there, though.

498
01:16:10.430 --> 01:16:18.709
Sarti, Gabriele: But yeah, Workbench, and maybe high-level… Well, Workbench, you would argue that the UI is going to be useful for educators, or, like, people that…

499
01:16:19.200 --> 01:16:20.680
Sarti, Gabriele: That's, like, amazing.

500
01:16:20.940 --> 01:16:22.999
Sarti, Gabriele: Whatever you want to try to ask for.

501
01:16:23.110 --> 01:16:30.060
Sarti, Gabriele: Does this, I guess, only pertain to this, or are you saying, like, why build anything, really?

502
01:16:30.240 --> 01:16:31.600
Sarti, Gabriele: Big question inquiring.

503
01:16:34.110 --> 01:16:38.220
Sarti, Gabriele: Yeah, no, I think this is actually an increasing question in the industry, which is, like.

504
01:16:38.420 --> 01:16:42.090
Sarti, Gabriele: There's been a whole business of, like.

505
01:16:42.560 --> 01:16:54.980
Sarti, Gabriele: building software tools for people. Yeah. And, like, with, like, we're, you know, a bunch of businesses have some business problem that they have, and people are like, oh, we're gonna build a software to solve that.

506
01:16:55.360 --> 01:16:58.239
Sarti, Gabriele: And increasingly, with voting agents, like, a lot of

507
01:16:58.700 --> 01:17:03.890
Sarti, Gabriele: companies are like, oh, like, we don't have to pay them to develop that for us, we can actually

508
01:17:04.200 --> 01:17:05.900
Sarti, Gabriele: Of course, it's all.

509
01:17:07.070 --> 01:17:09.919
Sarti, Gabriele: Actually, I think they're gonna be into the center, right?

510
01:17:10.300 --> 01:17:24.530
Sarti, Gabriele: So actually, it is a general, like, problem in the industry, I think, now, and but if you wanted to teach a curriculum, I guess, with work management across schools, like, why wouldn't you want one single… like, I mean, this is, written with a lot of agents, like, we're just the ones doing it, right?

511
01:17:24.830 --> 01:17:35.669
Sarti, Gabriele: Again, if you wanted to keep the same thing in California as you do here, we want the same UI across the buildings and show that… So the purpose, like, that workbench will be, education will look like,

512
01:17:36.060 --> 01:17:47.800
Sarti, Gabriele: And prototyping, I would say, also. So, kind of like, I… if I had to do very easy patching experiments, where I just want to patch at every layer, see what… what does it do.

513
01:17:47.900 --> 01:18:03.569
Sarti, Gabriele: I am quite comfortable at doing it with drag-and-dropping arrows in Workbench at this point, right? I would rather start there and modify my prompt and retry, right, on a big model on the remote, rather than having to run my code experiment to see the result.

514
01:18:04.580 --> 01:18:14.240
Sarti, Gabriele: I think that's… that's pretty valuable. So, my answer to that is this, probably, and the fact that I think the standardization that is enforced between these two

515
01:18:14.410 --> 01:18:28.090
Sarti, Gabriele: is useful for this final thing, that is, we want to have a gateway to make evals compatible to our interpretability stack, right? And so you need some format here.

516
01:18:28.440 --> 01:18:36.830
Sarti, Gabriele: where you can bring in Evalt's result in a way that can be digested by this part, right? So if you're building custom stuff.

517
01:18:37.550 --> 01:18:40.240
Sarti, Gabriele: There's no way to enforce this standard, right?

518
01:18:40.840 --> 01:18:41.830
Sarti, Gabriele: I think.

519
01:18:42.430 --> 01:18:49.400
Sarti, Gabriele: Can you say more? Yeah, so, like, the vision for that is the final slide, I promise.

520
01:18:49.420 --> 01:18:58.870
Sarti, Gabriele: is the fact that you, like, the vision for how we also are gonna work within Terp is that most likely you want to run

521
01:18:58.870 --> 01:19:17.309
Sarti, Gabriele: your Evolts, or your analysis on a big data set, with an inference-optimized API, right? Like, you run it on Inspect, you run it on Together AI, whatever, you know, you generate all that you have to generate. And then, from there, so you're gonna get some output from this

522
01:19:17.470 --> 01:19:23.759
Sarti, Gabriele: that then can be parsed in some shared format, which is this INIF format that I was mentioning.

523
01:19:24.010 --> 01:19:26.840
Sarti, Gabriele: Where you can identify areas of interest.

524
01:19:27.210 --> 01:19:39.609
Sarti, Gabriele: for deeper investigation, right? So, like, in our case for GoalDirectness, you want to know, you know, what happens at the end of template tokens before and after reasoning, for example, right?

525
01:19:39.980 --> 01:19:45.490
Sarti, Gabriele: So then, I would tag that postdoc after I have my full trades, right?

526
01:19:45.860 --> 01:19:49.010
Sarti, Gabriele: And then from there, what you want to do…

527
01:19:49.480 --> 01:20:09.289
Sarti, Gabriele: okay, you want to apply your Macinturp toolbox on those locations, right? And that's gonna be very efficient to run, right? Because you, like in the case before, you're not running the probe over all the position in the trace, or, like, logic lens over all the position in the trace. You're running it where you need it, right?

528
01:20:10.310 --> 01:20:23.540
Sarti, Gabriele: Is this all supposed to be, like, an automated pipeline, or… I think the vision would be that you go towards something that is very easy to just… I did my Evolves with what the Evolves people are doing, right?

529
01:20:23.690 --> 01:20:36.150
Sarti, Gabriele: then I just have this bridging, and then I… I'm back into the NDIF ecosystem, right? I can run my things in NMTERP high level, I can serialize, I can load it, and I can just do my analysis there, right?

530
01:20:36.820 --> 01:20:42.749
Sarti, Gabriele: So I think the standardization is what makes it worth more… more, like, structure, right?

531
01:20:43.560 --> 01:20:49.389
Sarti, Gabriele: I agree that for prototyping, maybe for very specific stuff at the low level, you just have an insight.

532
01:20:49.560 --> 01:20:52.140
Sarti, Gabriele: And you build your own custom demo there, right?

533
01:20:52.480 --> 01:20:55.809
Sarti, Gabriele: But I think for more complex scenarios, for, like.

534
01:20:56.180 --> 01:21:00.249
Sarti, Gabriele: Full evolves over full data sets, that's probably more robust, right?

535
01:21:02.740 --> 01:21:05.570
Sarti, Gabriele: But I'm happy to be convinced otherwise.

536
01:21:07.090 --> 01:21:08.209
Sarti, Gabriele: Yeah. I don't know.

537
01:21:08.750 --> 01:21:09.820
Sarti, Gabriele: What do I find it.

538
01:21:10.980 --> 01:21:13.850
Sarti, Gabriele: Be hopeful, though, and have a little… Things like those.

539
01:21:18.640 --> 01:21:20.960
Sarti, Gabriele: We'll need to see it.

540
01:21:21.110 --> 01:21:25.350
Sarti, Gabriele: Well, no, no, no, my initial struggles was just, like, building, like…

541
01:21:25.780 --> 01:21:28.900
Sarti, Gabriele: Building things at a high level of abstraction.

542
01:21:31.150 --> 01:21:32.530
Sarti, Gabriele: focus of usability.

543
01:21:33.000 --> 01:21:33.800
Sarti, Gabriele: Yeah.

544
01:21:34.100 --> 01:21:35.889
Sarti, Gabriele: I think that makes sense, yeah.

545
01:21:37.880 --> 01:21:39.930
Sarti, Gabriele: It's standard for what everyone needs.

546
01:21:40.200 --> 01:21:41.889
Sarti, Gabriele: And you can compare research into the

547
01:21:43.490 --> 01:21:51.829
Sarti, Gabriele: So I can also mention for the educational perspective, right now we ran a little study that we're gonna submit next week in David's class.

548
01:21:51.950 --> 01:22:07.010
Sarti, Gabriele: between Workbench and an interp using a shared UI, right? So you have a UI that is in the web, but then it's the same plots that are produced locally by your library, right? When you work on a Jupyter, for example.

549
01:22:07.330 --> 01:22:08.290
Sarti, Gabriele: And…

550
01:22:08.360 --> 01:22:27.750
Sarti, Gabriele: we collected a lot of feedback from people, and in general, people found it intuitive to work with the same shared visualization, both in the kind of, like, drag-and-drop, no-code web interface, than when you're designing your, kind of intervention more manually, right? So I think it does make sense to have something shared there, right?

551
01:22:30.390 --> 01:22:31.060
Sarti, Gabriele: Yeah.

552
01:22:32.750 --> 01:22:33.900
Sarti, Gabriele: You're the pensioner.

553
01:22:35.760 --> 01:22:36.590
Sarti, Gabriele: Sorry.

554
01:22:41.330 --> 01:22:48.739
Sarti, Gabriele: I found in the workplace. Even with the one workbench interface, it's… you get… pretty differing results.

555
01:22:48.960 --> 01:22:52.410
Sarti, Gabriele: Time to change it, so having something standard and charities.

556
01:22:52.570 --> 01:23:00.499
Sarti, Gabriele: You can make it very efficient, also. If you control the backend, you can, like, make it very quick, right?

557
01:23:01.210 --> 01:23:03.140
Sarti, Gabriele: You can make it more definitively, Ben.

558
01:23:04.880 --> 01:23:06.850
Sarti, Gabriele: of the stock, the more likely it's no.

559
01:23:09.500 --> 01:23:16.109
Sarti, Gabriele: Well, That's it for me. Sorry for keeping you long.

560
01:23:20.710 --> 01:23:22.320
Sarti, Gabriele: Any final question?

561
01:23:22.660 --> 01:23:24.540
Sarti, Gabriele: Is this close to a job book?

562
01:23:24.870 --> 01:23:26.390
Sarti, Gabriele: No.

563
01:23:26.520 --> 01:23:38.620
Sarti, Gabriele: No, no, it was just, a bit of a… Yeah? Nice. I'm glad to hear, no, it wasn't planned as a job talk, no.

564
01:23:39.320 --> 01:23:42.880
Sarti, Gabriele: What, sorry? Google Slides. Just, like, manually?

565
01:23:44.830 --> 01:23:50.729
Sarti, Gabriele: It's convenient to make figures for papers in Google Slides, because then it's plug-and-play for the presentation, right?

566
01:23:51.050 --> 01:23:53.220
Sarti, Gabriele: That's a lot of challenges.

567
01:23:53.410 --> 01:23:54.300
Sarti, Gabriele: Yeah.

568
01:23:55.470 --> 01:23:56.700
Sarti, Gabriele: Thanks, everyone.

569
01:23:59.340 --> 01:24:00.200
Sarti, Gabriele: You know…

