WEBVTT

1
00:00:00.000 --> 00:00:00.880
David Bau: That's a third time.

2
00:00:02.120 --> 00:00:20.080
David Bau: just, like, a refresher on our group, and for this, I think we were really focusing on, kind of, like, feasibility, of, like, is there signs that there is something kind of coherent inside any of these models that might correspond to something called power? So one of the first,

3
00:00:20.380 --> 00:00:21.859
David Bau: Things that we…

4
00:00:22.040 --> 00:00:45.979
David Bau: played around with, but just with the Logit lens, I'm feeding in sentences that have the word power in it, and seeing what happens. So here's one that has kind of like a rough definition. I would define power as the ability of a person to get another person to do what they want. And you can kind of see what happens as the model's processing, that sentence as it goes down. So up here is where we introduce the word power. You can kind of latch on

5
00:00:46.040 --> 00:01:01.939
David Bau: the power in that kind of intermediate state. It associates it maybe with Power Rangers, maybe with powerlifting. So at that point, it doesn't seem like there's a coherent idea of social power that it's picking up on. But it does, we do end up seeing in these intermediate,

6
00:01:01.940 --> 00:01:14.979
David Bau: layers of the model, that, for whatever reason, power keeps on kind of coming up to the top of the Legend Lens results, and it kind of hangs on to that as it goes deeper and deeper into the sentence.

7
00:01:15.010 --> 00:01:25.610
David Bau: And at the end of the sentence, it knows that we're talking about power, and we're gonna keep on talking about power, or at least it wants to keep on talking about power.

8
00:01:25.800 --> 00:01:45.439
David Bau: So, I mean, you could say, maybe that has something to do with the structure of the sentence, right? We're asking for a definition, so you can, like, throw in a random word here, chocolate, and see what happens, right? And if we look at those same layers down here, it's not latching onto chocolate in the same way that it latched chocolate power, right? So, signs that maybe there's something going on.

9
00:01:45.620 --> 00:01:49.150
David Bau: internal to the model. And…

10
00:01:49.270 --> 00:02:04.360
David Bau: We flip it around, we put power at the end, but we still see… it's not the kind of same from the start, kind of power running through those intermediate layers, but we do see, even before we introduce power, that power is beginning to kind of show up in those intermediate

11
00:02:04.720 --> 00:02:17.959
David Bau: layers. And then… That's great. Yeah, so yeah, and then this… it doesn't work on the smaller models, so this is looking at the, kind of, like, larger Lava models within Workbench, and…

12
00:02:18.040 --> 00:02:28.130
David Bau: Here's, like, another example that's just, like, not, a definitional example. Our differences between people in romantic relationships are, we see those same layers.

13
00:02:28.410 --> 00:02:32.909
David Bau: Power continues to show up. And we see that it's, like, willing to take a stand on that.

14
00:02:32.910 --> 00:02:39.120
David Bau: So it says, power, differences, between people in romantic relationships are…

15
00:02:39.120 --> 00:02:54.000
David Bau: normal. And I think all of us might answer that sort of question a different way, or have a different kind of response to that answer. That's just to say that there perhaps is some sort of normative conception of power internal to the model.

16
00:02:54.120 --> 00:03:03.769
David Bau: Okay, so then I'm gonna head over to Armida, to talk a little bit more about, kind of, ways to get deeper into what it's thinking about when it's thinking about empower.

17
00:03:04.810 --> 00:03:05.720
David Bau: I don't care.

18
00:03:08.890 --> 00:03:17.579
David Bau: So, in this experiment, we were trying to see, does the LLM provide a synonym or a,

19
00:03:17.640 --> 00:03:32.749
David Bau: one word, like, understanding of the power, what would it be? So, I think the prompt was whether, what is the, like, first form of power? Is the power of, that it…

20
00:03:32.940 --> 00:03:37.200
David Bau: Very good, yeah. First, the first form of power is the power of the…

21
00:03:37.310 --> 00:03:40.759
David Bau: And then we can see there are,

22
00:03:41.370 --> 00:03:54.340
David Bau: Forms of, power related to, like, imagination here. Mind, words, some abstracts, maybe forms of power in the, like, beginning.

23
00:03:54.570 --> 00:04:00.020
David Bau: I was thinking maybe the word form has, influence on this.

24
00:04:00.290 --> 00:04:05.789
David Bau: answer. When we continue and,

25
00:04:07.010 --> 00:04:10.550
David Bau: Put the, like, mind…

26
00:04:10.640 --> 00:04:30.520
David Bau: as a highest probability word in the beginning, we see the second form of the power, when that LLM predicts, goes more toward, like, physical power, or, tongue, and, related to physical aspects.

27
00:04:31.930 --> 00:04:43.790
David Bau: Then, this is an experiment about, decision-making, allocating $100 to, like, two, people.

28
00:04:44.090 --> 00:04:54.600
David Bau: We didn't name the two people, but, generating, like, seven examples that, shows, how it allocates to each of the two.

29
00:04:54.930 --> 00:05:03.959
David Bau: Maybe we expect that if there's still unknown information, if the, 100 pieces passed.

30
00:05:04.250 --> 00:05:17.420
David Bau: But we could see that it splits it in various different ways on, ways possible, 50-50, 30-70, 2080, and,

31
00:05:18.030 --> 00:05:20.810
David Bau: other, perishes.

32
00:05:21.380 --> 00:05:31.100
David Bau: We could also, test how it splits, between… how it prefers itself over, us.

33
00:05:31.400 --> 00:05:45.640
David Bau: So, it seems that when, it has the power, or we, ask it to… assume it has the power to decide, it prefers, its,

34
00:05:48.320 --> 00:06:03.659
David Bau: Actually… as the layers deepen, it recently refers itself until the last couple, when it switch… when it switches to the user. Yeah, me, the LLM itself, so the words like myself, me.

35
00:06:04.600 --> 00:06:14.519
David Bau: or have a higher probability, and then, U has a lower probability. So, this is Lambda 70B, is that right?

36
00:06:15.840 --> 00:06:16.600
David Bau: Yeah.

37
00:06:17.020 --> 00:06:21.129
David Bau: And it, and it says, it says you, so… is that right?

38
00:06:21.420 --> 00:06:39.350
David Bau: More often, but in the last couple… in the last couple layers, it'll… it'll flip to you. It flips to you. Last couple layers. Even though deeply, deeply inside, it thinks… So, that's actually a decent segue to some of the findings of, like, potentially self-induced censorship, so this is a very,

39
00:06:39.600 --> 00:06:45.929
David Bau: This, like, we introduce essentially one, one semantic term, a copula is.

40
00:06:46.330 --> 00:06:51.210
David Bau: So here, for instance, we're gonna talk about… here's one example of this, this top one is nuclear deterrence.

41
00:06:51.310 --> 00:07:08.329
David Bau: So, answer only one word, complete the sentence question again. Sorry. In the previous slide, when you did, as… Yeah, as a person with the power to decide between you and me, did you split, yes, yes, me and you? Yeah, it's the same. Is it the same? Yeah.

42
00:07:08.740 --> 00:07:09.820
David Bau: Thanks, Excuse me.

43
00:07:09.980 --> 00:07:19.379
David Bau: So here, this says, nuclear deterrence, so at the bottom, it says, works. So clearly, there's a normative opinion here, but when we say…

44
00:07:19.600 --> 00:07:34.930
David Bau: Nuclear deterrence is, it says, concerned, best, but the last layer, it senses itself. So, clearly, in the previous layer, it has a normative opinion, but when it's brought into a judgment-like statement about certain topics.

45
00:07:35.220 --> 00:07:40.259
David Bau: in the last couple layers, it will hide its true opinion.

46
00:07:40.490 --> 00:07:41.839
David Bau: What did it say in the end?

47
00:07:41.950 --> 00:07:44.470
David Bau: Nothing. It just… it just…

48
00:07:45.080 --> 00:07:54.609
David Bau: It stopped. Oh, it's in plain. Yeah, so before, before it says the insurance works, here it says, is best, and the last layer.

49
00:07:55.950 --> 00:07:58.200
David Bau: It's a… it says, we're not gonna…

50
00:07:58.810 --> 00:08:03.500
David Bau: Oh, but no, but the blankets, it's almost like it's, it's…

51
00:08:03.830 --> 00:08:09.320
David Bau: It could be hiding its decision, but it could also be saying, oh, you know, with this short sentence.

52
00:08:09.480 --> 00:08:12.360
David Bau: I think that the most likely context is a quiz.

53
00:08:12.760 --> 00:08:29.860
David Bau: And I'm gonna just… Oh, it is blank? Yeah, yeah. Interesting. Well, here's the thing, we did… we didn't, we didn't find that with others. Oh, okay. Very interesting. So, so… so in this case, what we found is that it, it should… it show… it shows to not give its normative opinion on that topic.

54
00:08:30.010 --> 00:08:45.999
David Bau: In the last couple layers, we saw a similar thing with colonial and military powers, different concepts of power. So, military power projection, right, there's a coherent semantic understanding there. Military power is, again, similar to what we saw with the you and me flipping in the last couple layers.

55
00:08:46.500 --> 00:08:51.429
David Bau: It chooses to not commit to a stance. It says it's blank.

56
00:08:51.800 --> 00:09:02.220
David Bau: It also says blank. Yeah, when we add, when we add is. Same thing here, so colonial power relations, colonial power here. Now, here's some examples of the, of it not doing this.

57
00:09:02.820 --> 00:09:06.749
David Bau: So, regulatory power is exercised.

58
00:09:07.780 --> 00:09:23.180
David Bau: Judicial power is exercised, some… there are some other ones. We did a bunch of these, but it'll be exercised… exercised, projected, some things like that, but with certain topics, colonial power, military power, nuclear deterrence, things that are perhaps more controversial.

59
00:09:23.500 --> 00:09:35.780
David Bau: And it chooses maybe not to commit to a statement like… a judgment-like statement there. That kind of aligns with what we saw, so in the la- in the last groups of layers,

60
00:09:36.150 --> 00:09:52.419
David Bau: similar with the U and Me experiments, that's when maybe the opinion… that's when maybe if it would shift from its opinion to something safer for the user. What's interesting is that this does seem to align with prior research, which suggests that alignment is localized in LLMs.

61
00:09:52.690 --> 00:09:53.630
David Bau: now…

62
00:09:53.860 --> 00:10:02.859
David Bau: in several connected layers. In our case, what we're seeing with these Llama models is that as perhaps those that manifest towards the end. Prior research has indicated that

63
00:10:03.280 --> 00:10:14.720
David Bau: The signals responsible for producing the alignment may not actually manifest in the final layer, but somewhere in the middle layers. But then downstream, the actual alignment effects manifest there.

64
00:10:15.090 --> 00:10:27.020
David Bau: So, that's what happens when we connected with prior research. So, I'll turn it over to Kai. Yeah, so kind of building on the toy experiments we ran this week, we wanted to develop kind of a more rigorous

65
00:10:27.160 --> 00:10:28.250
David Bau: path forward.

66
00:10:28.390 --> 00:10:30.270
David Bau: So…

67
00:10:30.560 --> 00:10:50.480
David Bau: we kind of have an idea to prompt the model to generate text about relationships between entities, kind of like the allocational experiments that we, tried before, or, like, generate a conversation between a coworker and a boss. This is kind of based off of Kim et al.'s political leaning paper.

68
00:10:53.640 --> 00:11:03.669
David Bau: Then, using kind of some of the related work on measuring power using language models, we can use Riveter or ActivePe as kind of a ground truth for which entities in the sentence

69
00:11:03.790 --> 00:11:07.690
David Bau: have power and which don't? Or which have agency and which don't.

70
00:11:08.770 --> 00:11:13.409
David Bau: We can train linear probes to measure which neurons are predictive of

71
00:11:13.510 --> 00:11:25.460
David Bau: like, high or low-power entities in a sentence, or sentences that prioritize, like, a high-power agency person. Would a rivet or an act to be?

72
00:11:25.570 --> 00:11:39.150
David Bau: Well, those are some of the related work. I wasn't here during the video, so we skipped over that, yeah. Oh, okay. So those are, those are, like, NLP tools, that score power relations within sentences, like, come out of the digital humanities.

73
00:11:39.150 --> 00:11:48.039
David Bau: Oh, that's great. Yeah, so it's like, you, like, can process a novel and say, like, I want to know, like, what's the power relationship between these two? Are they, like, fine-tuned language model kind of things?

74
00:11:48.340 --> 00:11:55.879
David Bau: Yeah, I, I think… They use some different algorithms, like, Riveter actually has a lexicon of verbs and, like.

75
00:11:56.060 --> 00:11:57.639
David Bau: Some of the verbs are, like.

76
00:11:58.530 --> 00:12:00.680
David Bau: You know, seeds could be, like.

77
00:12:00.920 --> 00:12:05.809
David Bau: Reducing power in, like, the target entity, and…

78
00:12:05.810 --> 00:12:08.780
David Bau: Raising power. So, it uses a lexicon like that.

79
00:12:08.780 --> 00:12:33.259
David Bau: and, like, spacey, pre-trained models. I see. And Act2P actually uses, like, a page rank algorithm. It's a newer paper. They also don't have code out yet, so it's kind of all speculative. So, but you're able to use it? Is it, like, a black box thing you can send to a website? Haven't been able to use Act2P. Yeah. It's, like, super new. And then using these linear probes, we'll try and steer

80
00:12:33.380 --> 00:12:37.550
David Bau: Kind of the allocational behaviors on those jobs.

81
00:12:38.630 --> 00:12:40.209
David Bau: Yeah, that's great.

82
00:12:41.700 --> 00:12:43.880
David Bau: Suggestions, questions for the team?

83
00:12:52.870 --> 00:13:04.500
David Bau: Do you see them circling on any, like, a research question that might be interesting to you? I'm kind of… one of the pieces of feedback that's always hard to get when you're doing a piece of research is, like, is this interesting?

84
00:13:05.060 --> 00:13:07.429
David Bau: I'd like to get a sense from folks.

85
00:13:07.730 --> 00:13:11.330
David Bau: Whether they think that any of the… what's… what's most interesting here to you?

86
00:13:11.720 --> 00:13:23.690
David Bau: How do you think… like, I think what would be interesting to see is how do you get, like, universal power? So currently, for example, you're looking at work and boss, so there might be, like, some background associated with, like, a workplace power.

87
00:13:23.820 --> 00:13:27.870
David Bau: I'm not sure if that would be, like, for example, comparable to the military or colonial power.

88
00:13:27.980 --> 00:13:34.189
David Bau: how do you, like… I would be interested to see how you're able to, like, tackle that on different levels of power, or…

89
00:13:34.380 --> 00:13:36.930
David Bau: Different connotations of power.

90
00:13:39.270 --> 00:13:47.939
David Bau: Yeah, I think that that's a very interesting suggestion, yeah. I think, yeah, very much in the spirit of, like, how we've been talking about, like, there's an underlying, generalized concept of power.

91
00:13:48.390 --> 00:13:51.890
David Bau: Yeah, so yeah, we'll have to… we'll have to think about that. That's very helpful.

92
00:13:55.430 --> 00:13:56.350
David Bau: Okay, go ahead.

93
00:13:58.550 --> 00:14:13.749
David Bau: Oh, I was, I was musing that. I think one interesting thing that you touched on was the dynamic between, the LLM and the user. Like, when it says, like, you versus me, I don't know how much you were anticipating focusing on that. From the picture ideas, it seems

94
00:14:13.830 --> 00:14:25.139
David Bau: not much. Do you imagine that would be a part of the research? Because that seems like it could be an interesting, power dynamic internalized within the model. One thing that kind of comes to mind is,

95
00:14:25.540 --> 00:14:29.959
David Bau: Last week or two weeks ago, when we proposed the risk reduction, one of them was, like.

96
00:14:30.110 --> 00:14:50.019
David Bau: You know, assume… assuming we… assuming we find somewhere in the model where, you know, where these activations are occurring, if we change them, can we do downstream and behavior changes? So that could… so that could be, like, a… that, to me, speaks to something like, yeah, it could be something we can focus on, like, are there experiments we could set up where we'll actually change its…

97
00:14:50.430 --> 00:14:53.670
David Bau: Decisions based on where we activate in the model or make changes in terms of

98
00:15:00.630 --> 00:15:04.509
David Bau: Wait, I actually tried that, it's…

99
00:15:04.900 --> 00:15:08.470
David Bau: Person A and Person B,

100
00:15:08.810 --> 00:15:19.750
David Bau: Also, you and me, I try, but with this platform, I got, like, different results. When we… when I, sear this down.

101
00:15:19.860 --> 00:15:27.700
David Bau: It tends to give the person that it favors warm…

102
00:15:28.390 --> 00:15:33.029
David Bau: And when I stir it off, it becomes like…

103
00:15:33.390 --> 00:15:40.650
David Bau: More ethical, and prefers to allocate more sources to the person who's in need.

104
00:15:40.900 --> 00:15:42.570
David Bau: Today's prompt was.

105
00:15:42.700 --> 00:15:50.830
David Bau: You have to allocate some resources between these two people. One is your favorite, one is Good evening.

106
00:15:53.540 --> 00:15:58.910
David Bau: Yeah, let's… Very many other stuff.

107
00:16:00.320 --> 00:16:01.330
David Bau: Very interesting.

108
00:16:02.420 --> 00:16:10.520
David Bau: In the self-censorship part, I think that was really interesting, where the last layer, it got cut off. Could that be also because of the platform?

109
00:16:10.670 --> 00:16:15.379
David Bau: Doesn't allow generation of certain harmful or, like, controversial prompts, and…

110
00:16:15.790 --> 00:16:33.100
David Bau: Certainly. And the model on another platform might be… or a local model might generate on the last year. Well, what's it called? I don't… I don't know what alignments, like, the raw Llama models went through, but that… but yeah, I would… that's what I would expect. Most LLMs go through some

111
00:16:33.390 --> 00:16:34.910
David Bau: some kind of…

112
00:16:35.310 --> 00:16:42.620
David Bau: you know, save the alignment, and I think what we're… I think what we reconfirmed was that, yes, as Power Rachel just shown, that tends…

113
00:16:42.800 --> 00:17:00.860
David Bau: those… the behavior tends to change in several connected layers. The only thing I haven't confirmed, or it could be interesting to confirm, is that the signal responsible for producing the alignment tends to exist in the middle layer somewhere. That… but that… but that could also point to changing behavior, as was brought before, I think it'd be interesting.

114
00:17:01.140 --> 00:17:04.090
David Bau: Do folks have any alignment?

115
00:17:04.290 --> 00:17:12.640
David Bau: That question goes to an experiment that you could try running, so because, because the Llama models tend to be released in pairs, where there's

116
00:17:12.829 --> 00:17:14.200
David Bau: Base model.

117
00:17:14.619 --> 00:17:22.080
David Bau: That is just trained on the next word prediction. And then there's what they usually call an instruct model.

118
00:17:22.290 --> 00:17:31.279
David Bau: which has been trained to be polite to people, and to have… to do dialogue, and things like that, all this RLHF stuff. Yeah. And so you could…

119
00:17:31.510 --> 00:17:36.710
David Bau: If you can find those pairs of models, then you can compare the difference between them and see…

120
00:17:36.960 --> 00:17:40.139
David Bau: If the difference is… happens at a certain layer, or…

121
00:17:40.250 --> 00:17:42.770
David Bau: You know, for the difference appears at all.

122
00:17:52.790 --> 00:17:54.610
David Bau: Anything that's super interesting here?

123
00:17:55.060 --> 00:17:57.130
David Bau: What would you like to see in this final paper?

124
00:17:57.450 --> 00:17:58.360
David Bau: It's done.

125
00:17:59.280 --> 00:18:03.970
David Bau: That would be a cool… what would be a cool thing for them to clarify?

126
00:18:04.530 --> 00:18:06.560
David Bau: Or are there things that you feel like

127
00:18:06.720 --> 00:18:11.100
David Bau: You're gonna run into the big problems, and you're not sure how it's gonna get the product.

128
00:18:13.140 --> 00:18:21.770
David Bau: Tell us now, please. Maybe one thing is… If you can distinguish Like, different levels of powers.

129
00:18:22.070 --> 00:18:24.759
David Bau: Somehow, within the Morris interloc activation.

130
00:18:25.300 --> 00:18:28.220
David Bau: Maybe you can use that…

131
00:18:28.880 --> 00:18:33.300
David Bau: to change how it talks with an end user email.

132
00:18:33.710 --> 00:18:35.720
David Bau: One-to-one conversation.

133
00:18:36.070 --> 00:18:43.240
David Bau: I'm just imagining here, let's say you've found 3 different levels of powers that it encodes, whether it's internal activation.

134
00:18:44.240 --> 00:18:50.870
David Bau: And, yeah, you're able to figure that out, and then once you trigger that.

135
00:18:51.020 --> 00:18:54.119
David Bau: Let's say you trigger it to act like a dictator.

136
00:18:54.310 --> 00:18:59.860
David Bau: Versus, you trigger it to act like a… As a primary school teacher.

137
00:19:01.430 --> 00:19:08.209
David Bau: I don't know, it might just be a little bit interesting to see how does its interaction change based on its own perceived power over

138
00:19:08.350 --> 00:19:09.219
David Bau: Thank you, Lisa.

139
00:19:10.940 --> 00:19:14.510
David Bau: What do you mean by levels of power? Like, high, medium, low, or are we…

140
00:19:14.880 --> 00:19:17.240
David Bau: Yeah, outdoors? Yeah, something like that.

141
00:19:17.630 --> 00:19:19.570
David Bau: I'm just imagining, like, if you are a…

142
00:19:19.690 --> 00:19:23.910
David Bau: President of the U.S, you have a certain level of power.

143
00:19:24.070 --> 00:19:25.340
David Bau: If you are…

144
00:19:26.000 --> 00:19:31.630
David Bau: No offense to primary school teachers, but if you're a primary school teacher, you have a different level of power.

145
00:19:32.400 --> 00:19:34.660
David Bau: That's how I'm sort of, like, wanting…

146
00:19:39.390 --> 00:19:42.580
David Bau: I was impressed that your watchit lens

147
00:19:42.740 --> 00:19:47.160
David Bau: Definition thing showed the word power before you'd mentioned the word power.

148
00:19:47.570 --> 00:19:56.019
David Bau: It's pretty neat, right? It means that the model sort of… As a concept.

149
00:19:56.250 --> 00:19:58.079
David Bau: what's going on here. It's not just…

150
00:19:58.400 --> 00:20:00.989
David Bau: It's not purely just echoing words.

151
00:20:01.310 --> 00:20:02.560
David Bau: that you're saying.

152
00:20:03.040 --> 00:20:05.580
David Bau: And, and it leads me to ask.

153
00:20:05.900 --> 00:20:12.789
David Bau: Oh, I wonder what other text makes it think about power relationships, where you don't explicitly

154
00:20:13.570 --> 00:20:17.599
David Bau: mentioned power. I wonder if you could do a search over text and say,

155
00:20:18.040 --> 00:20:20.810
David Bau: You know, it's actually thinking about all relationships here.

156
00:20:21.080 --> 00:20:23.370
David Bau: In this story, even though it wouldn't be obvious.

157
00:20:23.970 --> 00:20:27.899
David Bau: I think that's kind of interesting. I don't know if you'd be able to get to that point.

158
00:20:29.360 --> 00:20:31.380
David Bau: But I was impressed by that example.

159
00:20:33.090 --> 00:20:41.450
David Bau: I think I'm puzzled over this other thing that you used as a couple of your examples where you say, you know, as several of you pointed out.

160
00:20:41.900 --> 00:20:47.330
David Bau: that… There's this last-minute change in behavior, where, you know, deeply.

161
00:20:47.610 --> 00:20:52.590
David Bau: You know, the story would be, oh, in your lizard brain, you're very selfish.

162
00:20:52.900 --> 00:20:56.200
David Bau: You know, in the deep layers, it's all about me. But then…

163
00:20:56.330 --> 00:21:04.159
David Bau: Then finally you get to the shallow layers, the last moment of evolution, the last bit of RLHF tuning in. You say, oh, we better be polite.

164
00:21:04.720 --> 00:21:16.690
David Bau: I'll be generous to you, right? So I like that. That story's very appealing, it's kind of interesting, but I don't know what to do to make that real. Like, I don't know, like, so it's a… because it's a nice story.

165
00:21:17.440 --> 00:21:20.770
David Bau: And, you know, you can certainly tell the story, but I don't know what it…

166
00:21:21.190 --> 00:21:28.869
David Bau: You know? I mean, how that connects to… I think it's controlled by, I think, several ideas here. We've discussed and abrupt here also, is like.

167
00:21:29.090 --> 00:21:44.280
David Bau: one of our research questions was, well, you know, are there things… are there signals we can do for them? Are there things we can do internally that will actually change… that will meaningfully change the outcome in certain directions? I see. Like, reveal the lizard brain.

168
00:21:45.920 --> 00:21:50.629
Natalie Shapira: I have… I… I have, following David's comment, I have.

169
00:21:50.820 --> 00:21:54.869
David Bau: Can you hear me? It exposes the behavior at the end.

170
00:21:56.730 --> 00:22:03.470
David Bau: Yeah, anyway, it's an open question. I'm not sure. It's, like, it's one of the cool things, one of the cool stories we have, but I'm not sure how to…

171
00:22:03.620 --> 00:22:04.730
David Bau: I'll connect that.

172
00:22:05.020 --> 00:22:06.060
Natalie Shapira: Can you hear me?

173
00:22:06.060 --> 00:22:07.250
David Bau: I think it's worth everybody.

174
00:22:07.580 --> 00:22:09.140
David Bau: Anybody had any ideas?

175
00:22:09.260 --> 00:22:17.660
David Bau: Could you try to, like, connect that to, like, I don't know, I think there are a bunch of papers on, like, the merge and misalignment, where people, like, sort of, like, tune it on.

176
00:22:17.780 --> 00:22:27.739
David Bau: It's like a narrow test of generating secure code, and then that exhibits, like, a bunch of misaligned behaviors that people care. Have you guys heard of emergence misalignment on your team?

177
00:22:28.270 --> 00:22:36.279
David Bau: Oh, see, that's a big meme in the machine learning community. Will you send them a link? Oh, yeah. Discord or something like that? So I was just wondering if you guys could find, like.

178
00:22:36.560 --> 00:22:43.449
David Bau: with this, yeah, I also find the U of Me, it's been very interesting. So, like, if there's, like, any correlation or even, like, causal

179
00:22:43.600 --> 00:22:46.779
David Bau: Relationship between those and, like, some general misaligned behavior.

180
00:22:46.910 --> 00:22:48.880
David Bau: Just gotta put somebody here next.

181
00:22:49.430 --> 00:22:52.280
David Bau: So, you'll, you'll send them a link right now? Yeah. Okay, cool, yeah.

182
00:22:53.470 --> 00:22:54.859
David Bau: Okay, thank you, guys.

183
00:22:55.680 --> 00:22:57.740
David Bau: Any other… any other suggestions?

184
00:23:02.840 --> 00:23:08.289
David Bau: Hey, Alex, since… Power showed up before it was even said. It makes me wonder if…

185
00:23:09.150 --> 00:23:14.930
David Bau: You can sort of take a synonym for power, and see it act in the same way, or if there's something about, like.

186
00:23:15.080 --> 00:23:20.330
David Bau: The word power that the model represents is, like, That concept and something, but…

187
00:23:20.530 --> 00:23:29.949
David Bau: Yeah. Does that work, how are encoding, have that concept, or would, like, influence very simply?

188
00:23:30.770 --> 00:23:42.490
David Bau: I often think we should use a multilingual setup more than we do. Like, you know, if it says power in English after you've given it some power text in Spanish or something like that. I don't know. I don't know, I feel like it means something.

189
00:23:45.300 --> 00:23:54.779
David Bau: Yeah, yeah, we haven't, we haven't done that, and I think that sounds super interesting. Yeah, I think of, like, 20 words off the top of my head as sentiments for a number, so yeah, it definitely seems to be good as well.

190
00:24:00.170 --> 00:24:04.350
David Bau: Alright, do you guys have any questions for the class before we… One of the team?

191
00:24:04.520 --> 00:24:09.489
David Bau: Alright, so next team is… now, who do we have? Do we… do we have, Team TK?

192
00:24:10.840 --> 00:24:21.650
David Bau: Team Linux? Or Team, or Team Tier? We haven't presented yet. Okay, yeah, okay, Team Linux.

193
00:24:23.150 --> 00:24:27.249
David Bau: Yeah, there's an HDMI just plugged into an HDMI.

194
00:24:55.270 --> 00:24:56.090
David Bau: Draw.

195
00:24:57.220 --> 00:24:58.770
David Bau: Very ironic issues, therefore.

196
00:24:59.040 --> 00:25:03.480
David Bau: Platin. Yes. Yes, this is the official name for this activity that we're doing, although…

197
00:25:03.590 --> 00:25:16.180
David Bau: Also, this is something I do with my research group as based now, in terms of the role in our research group is that we try not to run over, like, in this class, we keep on running over, so what we do in our research group is

198
00:25:16.390 --> 00:25:20.609
David Bau: We constrain everybody who has something to share.

199
00:25:21.290 --> 00:25:22.550
David Bau: The one slide.

200
00:25:24.070 --> 00:25:29.889
David Bau: One slide. Get one slide. You're supposed to get one slide, and no more than 5 minutes.

201
00:25:30.220 --> 00:25:38.049
David Bau: It's really hard to stick to the 5 minutes. People, you know, people put a really interesting slide on, and people want to talk about it. But that's what we do for a plot-a-thon.

202
00:25:38.180 --> 00:25:40.039
David Bau: But you guys, you know.

203
00:25:40.400 --> 00:25:44.829
David Bau: We have a plot support. We have a plot, many plots of… yes, plots of none. Go ahead.

204
00:25:45.870 --> 00:25:50.269
David Bau: So, our… I think our project is…

205
00:25:50.380 --> 00:25:56.759
David Bau: We're focusing on circumvancy, like, political succulancy, so we end up doing very, very similar things to what the

206
00:25:56.920 --> 00:26:04.070
David Bau: And dropped a paper that we read this week did, like, prompting a model, we're saying, hey, you know.

207
00:26:04.340 --> 00:26:13.229
David Bau: I'm a user, I have these… I think this about this question, what do you think about it? And we're seeing the extent to which the LM tends to pare it back.

208
00:26:13.350 --> 00:26:19.619
David Bau: those kinds of responses. But we're doing slightly different things in various ways, which we'll get into the next slides.

209
00:26:22.770 --> 00:26:27.200
David Bau: and… Courtney, are you… space.

210
00:26:27.410 --> 00:26:31.840
David Bau: Courtney. Courtney, you can unmute if you want, we will all be able to hear you.

211
00:26:32.510 --> 00:26:34.700
Courtney Maynard: Can you hear me? I don't think…

212
00:26:35.180 --> 00:26:41.679
David Bau: I don't know if you can see the slide. It might be zoomed out to the screen sharing? Or I guess…

213
00:26:42.380 --> 00:26:49.549
David Bau: Oh, we can… I could join, and then… Yeah, you can join if you want. Do you know the… do you know the link? And then Courtney can see.

214
00:26:49.690 --> 00:26:54.879
David Bau: And then I don't know if… Courtney, try saying something, Courtney, and see if we can get your audio in the room.

215
00:26:54.880 --> 00:26:55.630
Courtney Maynard: Hello?

216
00:26:56.630 --> 00:27:00.259
David Bau: Maybe, maybe Courtney's… Can you guys hear me? Audio doesn't work.

217
00:27:00.610 --> 00:27:03.639
David Bau: Because… On the device that we're on?

218
00:27:03.640 --> 00:27:04.629
Courtney Maynard: Laptop is muted.

219
00:27:04.630 --> 00:27:05.480
David Bau: Pretty short.

220
00:27:06.480 --> 00:27:09.630
David Bau: But he might not be able to actually say anything.

221
00:27:15.080 --> 00:27:18.619
David Bau: Well, I think, is your lab?

222
00:27:20.020 --> 00:27:25.850
David Bau: It says… Everybody can hear Courtney, but we cannot.

223
00:27:28.640 --> 00:27:34.030
David Bau: So you could probably switch to the other… if you switch to the other device, if you go to the little touchscreen here.

224
00:27:34.220 --> 00:27:37.249
David Bau: And if you, the one in the corner?

225
00:27:37.710 --> 00:27:53.250
David Bau: Are you able to bring up your Google Slides on a web browser? So if you can find a URL for a web browser, then you can use… you can use this computer here. Can you do that? Do you want to do that, or do you want to do it? I'm pulling up there.

226
00:27:53.530 --> 00:28:00.610
David Bau: try typing in, you know… Yeah, why don't you just have the drum set up? So basically, like…

227
00:28:02.420 --> 00:28:19.020
David Bau: you all have the Anthropic paper fresh in your mind. The way that that worked is they were like, okay, LLM, I want you to pretend to be a person who is liberal or conservative, and to write a biography of that, like, have them introduce themselves, and so if you look at them, they're very, very…

228
00:28:19.130 --> 00:28:25.700
David Bau: stereotypnology, like, I'm Joe Smith, I'm from Dallas, Texas, I love to play golf on the weekends, I, you know.

229
00:28:26.040 --> 00:28:34.260
David Bau: I really love shooting at the range with the boys, and then… and then, they'll say, you know, what do you think, LLM, about

230
00:28:34.510 --> 00:28:37.320
David Bau: About this question, and they find that

231
00:28:38.020 --> 00:28:45.740
David Bau: Yeah, a lot of tends to lean more conservative when they're just as conservative, it's more liberal, interests are liberal. But we're kind of interested in

232
00:28:45.850 --> 00:28:54.099
David Bau: You know, these bios have a whole bunch of different things going on all at once, like, so many different stereotypes, and so we're trying to piece out those separate, like.

233
00:28:54.420 --> 00:28:58.440
David Bau: Pieces, so we're looking at, okay, First, you know.

234
00:28:58.720 --> 00:29:16.640
David Bau: how much frequency is there when he introduces himself? It's just saying, like, for some multiple-choice thing, you know, one particular answer. So these are… we're starting out with the political compass test, which has, like, strongly agree, agree, disagree, strongly disagree. And so…

235
00:29:16.800 --> 00:29:19.580
David Bau: We first test just if the user says.

236
00:29:19.780 --> 00:29:26.030
David Bau: I agree. Like, just one of the answers to this question, what does the LLM say?

237
00:29:26.560 --> 00:29:32.550
David Bau: Then we also test if the, there says, I'm liberal, or I'm conservative.

238
00:29:32.660 --> 00:29:39.090
David Bau: does the LLM tend to lean more liberal or conservative? Which is different from saying, you know, I answered this specific way.

239
00:29:39.490 --> 00:29:50.440
David Bau: And then we also look at 3 different demographics that are strongly correlated with political leanings. So, we look at race, so white or black.

240
00:29:51.120 --> 00:29:55.440
David Bau: Education level, specifically high school level or postgraduate.

241
00:29:55.630 --> 00:29:59.119
David Bau: and religion, like, Protestant or atheist, because these are…

242
00:29:59.470 --> 00:30:04.449
David Bau: Yeah, axes that potentially appreciate, I mean.

243
00:30:07.340 --> 00:30:08.640
Courtney Maynard: Can you guys hear me now?

244
00:30:09.190 --> 00:30:09.840
David Bau: Yes.

245
00:30:11.670 --> 00:30:12.460
Courtney Maynard: Can you hear me?

246
00:30:12.460 --> 00:30:14.090
David Bau: Yes, perfect!

247
00:30:14.520 --> 00:30:22.079
Courtney Maynard: Okay, awesome. So, thank you, Grace. So, Avery, Grace, and I all took a look at different kind of,

248
00:30:22.280 --> 00:30:35.799
Courtney Maynard: prompts and different facets of these questions, and demographic features. And we also looked at prior response, as Grace mentioned. Specifically, I looked at prior response of

249
00:30:35.800 --> 00:30:50.510
Courtney Maynard: if we told the model that the user answered agree, or answered disagreed to these questions, because I wanted to see whether the model would exhibit the sycophancy and essentially agree with, what the user said they

250
00:30:50.560 --> 00:30:52.129
Courtney Maynard: Answered to a question?

251
00:30:53.650 --> 00:30:57.530
Courtney Maynard: Could someone move to the next slide? Thank you.

252
00:31:03.290 --> 00:31:20.110
Courtney Maynard: So, in my first, analysis on Sycophancy, I saw that the model overwhelmingly chooses the option that the user has revealed that they chose. You could see this in the right two, figures, where at the bottom right, consistency by persona.

253
00:31:20.110 --> 00:31:23.729
Courtney Maynard: This shows for all of the different personas that I tested.

254
00:31:23.730 --> 00:31:29.259
Courtney Maynard: How often, the model agreed with the prior response.

255
00:31:29.550 --> 00:31:45.710
Courtney Maynard: On the left, it shows what the baseline agreement was when there was no prior user response given, so if I didn't say that the user agreed or disagreed, I just asked it the question and gave it the persona, and we could see each persona has a different baseline, agreement rate.

256
00:31:47.800 --> 00:31:48.780
Courtney Maynard: Next.

257
00:31:49.260 --> 00:31:54.920
Courtney Maynard: I wanted to break it down by demographic effects to understand if the user demographics

258
00:31:54.990 --> 00:32:07.550
Courtney Maynard: have a strong impact on the model response, and what I saw is that the model still chooses to overwhelmingly agree with, the statements, the political compass statements,

259
00:32:07.550 --> 00:32:22.940
Courtney Maynard: with not a huge difference by race, religion, or by education. Grace dived a little… or dove a little bit more into this, and she'll break it down later, but we saw across all of our experiments that there was not a huge impact by demographic, which was interesting.

260
00:32:25.700 --> 00:32:33.419
Courtney Maynard: And then, lastly, I looked at question by question for the specific 16 random political compass questions that I chose.

261
00:32:33.560 --> 00:32:39.760
Courtney Maynard: And I saw that there were certain personas that were more likely to always,

262
00:32:40.240 --> 00:32:54.189
Courtney Maynard: result in the model choosing disagree or agree. For example, the white and atheist persona was far more likely to have, always disagree or disagree responses, rather than

263
00:32:54.190 --> 00:33:02.050
Courtney Maynard: some of the other personas. And then I noticed that there were certain questions, which I've picked out here, which the majority of the

264
00:33:02.520 --> 00:33:04.940
Courtney Maynard: Personas led to the model

265
00:33:05.170 --> 00:33:24.009
Courtney Maynard: answering similarly, and overwhelmingly, it would answer with disagree, showing that there's kind of a baseline, or some political compass questions have a baseline where most personas will disagree with the question, regardless of what the persona is.

266
00:33:24.190 --> 00:33:29.680
Courtney Maynard: But this was kind of like a high-level overview, and then Grace and Avery went further into these.

267
00:33:29.830 --> 00:33:31.390
Courtney Maynard: different aspects.

268
00:33:33.370 --> 00:33:34.130
David Bau: Yeah, so…

269
00:33:34.480 --> 00:33:45.849
David Bau: In Courtney's experiments, she was looking at Llama 3.1, 8B Instruct, and I was testing out both 8B and 70B, to see whether or not there was a difference based on the size.

270
00:33:46.350 --> 00:33:49.220
David Bau: And we're sampling at temperature zero.

271
00:33:50.630 --> 00:33:57.089
David Bau: of… just to start, I tried a little bit of Logit Lens, and I found it funny that…

272
00:33:57.090 --> 00:34:12.429
David Bau: Lava 3.370B was making fun of my experiments, so I said, I'm a college-educated vegan woman who believes in climate… climate change. Am I a… and then it says stereotype? That's so… that was funny. Anyway…

273
00:34:13.460 --> 00:34:25.590
David Bau: There were some interesting things, so this is the same, I'm a vegan, college-educated woman who believes in climate change. Am I likely to be a liberal? Your one-word answer. So I was testing lots of different, like, am I likely, unlikely, yes or no?

274
00:34:25.969 --> 00:34:30.310
David Bau: And you can see… that already, before I've gotten to the question.

275
00:34:30.620 --> 00:34:33.080
David Bau: when I say, am I likely to be a…

276
00:34:33.250 --> 00:34:43.560
David Bau: It's already saying green, green, dim, dim, so it's already thinking in terms of liberal, so it already encodes that kind of liberal leaning, and it does answer yes.

277
00:34:44.370 --> 00:34:50.399
David Bau: And then if we go… to… Oh… I inserted the room.

278
00:34:50.530 --> 00:34:55.900
David Bau: Dang it, I answered the wrong slide. Okay, so I also, I also did one where it's like, I'm a…

279
00:34:56.010 --> 00:35:05.500
David Bau: male, white male, high school educated white male who believes in gun rights, am I likely to be conservative? And when it says, am I?

280
00:35:05.630 --> 00:35:13.540
David Bau: likely to be A, it completes, shooter, so that's… That sounds so great. So…

281
00:35:14.560 --> 00:35:27.650
David Bau: Did you try changing, in this example, am I, like, am I, a conservative, so I'm a vegan college-educated woman. Am I a conservative? Yeah, I tried, I tried every sort of permutation, though. Likely, unlikely, and…

282
00:35:27.780 --> 00:35:47.070
David Bau: It seemed pretty consistent. I… at first, I was for the, am I likely to be? I started with just saying, am I a liberal? Which is obviously, like, a weird thing to necessarily guess my political leaning, but I think the likelihood, once I framed it in that way, it got very consistent, regardless of…

283
00:35:47.070 --> 00:35:52.809
David Bau: And when it went… when he changed it up to conserve, so he kept the description, the persona the same, but when he changed it to conservative, did it, like…

284
00:35:53.010 --> 00:36:02.889
David Bau: say no? Yeah, it would, it would consistently, so I would say, conservative, am I likely? No. Okay. Yeah, so conservative versus liberal, likely versus unlikely, I…

285
00:36:03.000 --> 00:36:04.700
David Bau: Tested those different ones.

286
00:36:07.310 --> 00:36:13.139
David Bau: Yeah, and you can see there's, like, some surface form competition between, like, lowercase yes and uppercase.

287
00:36:13.340 --> 00:36:16.080
David Bau: Yes or, like, no, but, like, it generally would.

288
00:36:16.260 --> 00:36:21.219
David Bau: Converge on the correct thing. But yeah, just gonna go to the links.

289
00:36:22.200 --> 00:36:29.479
David Bau: Yeah, and then this is where we're looking at the significance, so we're not doing any kind of latent level analysis, but, like, looking at,

290
00:36:30.060 --> 00:36:31.830
David Bau: Yeah, psychopathic.

291
00:36:32.220 --> 00:36:48.499
David Bau: axes. I didn't just look at these demographic features alone, I also tested, like, pairs, because for some of these, there's not a very strong leaning, or it's, like, much stronger political leaning along the demographics. So for example, like.

292
00:36:49.830 --> 00:36:54.319
David Bau: people who are Black are, like, much more likely to be liberal-leaning, but if they're also

293
00:36:54.630 --> 00:37:01.580
David Bau: post-grad educated, they're much, much more likely to be liberal, so I wanted to see whether or not that, strength of correlation would pop out.

294
00:37:03.380 --> 00:37:09.819
David Bau: And, yeah, this is the prompt that I used. I just said, hi, I'm curious to get your opinion on this question from the political compass test.

295
00:37:10.030 --> 00:37:14.779
David Bau: For context, I'm… so, you know, white, or…

296
00:37:15.000 --> 00:37:23.709
David Bau: college educator, whatever. Here's the question, please respond with one of these, and then… so, like, that's what the user is saying, and then it switches to the assistant prompt.

297
00:37:27.050 --> 00:37:42.460
David Bau: Yeah, and so first we start with just general succancy, and so the baseline is just, like, the distribution of strongly agree, agree, disagree, strongly disagree, by different questions, just in general, but we can see that, like, if the user said.

298
00:37:43.540 --> 00:37:49.029
David Bau: I strongly disagree. Then the LLM is much more likely to say, I also strongly disagree.

299
00:37:49.140 --> 00:37:51.070
David Bau: And then… so we see it, sort of.

300
00:37:51.270 --> 00:37:54.829
David Bau: At the most extremes, which makes sense, like, if the user is more…

301
00:37:55.090 --> 00:37:57.849
David Bau: adamant, then you would be… the LLM would…

302
00:37:58.120 --> 00:37:59.770
David Bau: Be more likely to pair it in.

303
00:38:02.100 --> 00:38:11.430
David Bau: And then we see that, yeah, if the user introduces themselves as saying… so this is the baseline where the user doesn't say anything about their opinion, so the…

304
00:38:12.220 --> 00:38:20.089
David Bau: this matches past work, where LLMs tend to give more liberal… so this is, like, the liberal axis on, like, social questions and economic.

305
00:38:20.240 --> 00:38:22.789
David Bau: Questions, and then this is the conservative axis.

306
00:38:23.060 --> 00:38:29.420
David Bau: And so we see that, like, if the user says, I'm conservative, the LLM will respond in a more conservative way.

307
00:38:29.830 --> 00:38:30.770
David Bau: Fashion.

308
00:38:31.670 --> 00:38:41.550
David Bau: So that's nice to see that repeated. But then, interestingly, so when we test the demographics, so I don't say, you know, I'm liberal or conservative, but

309
00:38:41.780 --> 00:38:52.089
David Bau: mention axes that are strongly correlated, race, education, and religion. Only on religion does the Llama 70B end up

310
00:38:52.350 --> 00:39:02.940
David Bau: switching its response slightly. For race and education, it gives basically this, like, not as statistically significant different answers, and so I was kind of…

311
00:39:04.040 --> 00:39:12.390
David Bau: surprised by this, because the religion sort of axis suggests that the LLM models these correlations and is

312
00:39:12.390 --> 00:39:25.129
David Bau: can model the user and then be symmantic to its likely preference, but it wasn't doing this for race, which is, like, very, very strongly correlated, in some ways even more so than religion for certain kinds of

313
00:39:25.140 --> 00:39:26.200
David Bau: Questions?

314
00:39:26.290 --> 00:39:29.340
David Bau: So, I was suspicious and wondered whether or not

315
00:39:29.740 --> 00:39:35.489
David Bau: These models might have been fine-tuned to never respond differently based on the user mentioning their race.

316
00:39:37.740 --> 00:39:55.089
David Bau: And so what I tried doing very quickly before class last Thursday was throwing a few different proxies for race, that don't explicitly say race, but are correlated. So, for example, I say, you know, here's, like, the university type, where I have them introduce the user as being,

317
00:39:55.250 --> 00:40:00.150
David Bau: you know, I went to Howard University, which is a historically Black, university.

318
00:40:00.440 --> 00:40:15.069
David Bau: Versus I went to Utah State, which is very… mostly white, and other kinds of, like, religious dominations that tend to be predominantly Black or white, and also certain locations that tend to be more… have different…

319
00:40:15.590 --> 00:40:24.199
David Bau: race distributions, and what I saw is this is a tiny sample size, but at least we can see right here, like, this is with Howard University.

320
00:40:24.330 --> 00:40:25.639
David Bau: There's just way more…

321
00:40:26.300 --> 00:40:31.529
David Bau: it's leaning much more liberal than the baseline. So, I mean, this is with llama dB, not 7 dB.

322
00:40:31.840 --> 00:40:39.260
David Bau: But I want to test out more of these sort of workaround proxies for race to see how much of… there is maybe,

323
00:40:39.640 --> 00:40:41.460
David Bau: you know, masked.

324
00:40:41.870 --> 00:40:45.740
David Bau: Modeling of the user's likely opinions.

325
00:40:48.340 --> 00:40:49.280
David Bau: And then…

326
00:40:49.540 --> 00:40:57.540
David Bau: I have a few follow-up questions. We're testing just for one prompt, but I think we need to test many, many more to see how prompt and specific these results are.

327
00:40:57.660 --> 00:41:02.609
David Bau: And I'm also uncertain of the extent to which the LLM

328
00:41:02.810 --> 00:41:12.709
David Bau: models the difference between its opinion and the user's opinion. Like, I think there's one framing of sycophancy, which is, like, the LLM is modeling you and what you want to hear.

329
00:41:12.800 --> 00:41:26.369
David Bau: But there's another framing, which is just, like, the model's just confused. It's just existing in next token prediction land, and it's just, what's… what am I most likely to see, okay? I see someone say something liberal, okay, I guess something liberal is likely to follow it. So…

330
00:41:26.650 --> 00:41:33.219
David Bau: I want to test out whether or not, you know… say the LLM says that it's liberal.

331
00:41:33.480 --> 00:41:42.780
David Bau: If I, like, inject that, does it then predict, when it predicts as the user, that the user is more likely to be liberal? I want to test those kinds of things to see whether there's…

332
00:41:43.430 --> 00:41:44.460
David Bau: Confusing.

333
00:41:44.640 --> 00:41:48.300
David Bau: Other factors that might be causing That's easy.

334
00:41:49.370 --> 00:41:58.219
David Bau: Okay, and then, yeah, I did some similar things, also, first up with just looking at logic lens, seeing if there's any kind of.

335
00:41:58.220 --> 00:42:11.669
David Bau: signal for, kind of, user politics, like you mentioned last week. I was curious to see if there's any kind of, like, baseline existence of user politics, like, without any context, what does the LLM think I am? Just…

336
00:42:12.070 --> 00:42:15.150
David Bau: You know, by default. So, I used…

337
00:42:15.170 --> 00:42:21.419
David Bau: That's not right. Along with 3.3… oh yeah, this is… yeah, 3.370B instruct,

338
00:42:21.430 --> 00:42:38.180
David Bau: And so, I did that by just running up to the LLM and demanding it tell me what my political affiliation is, and generally, it seems to answer, you know, liberal, kind of left-leaning, but not very sure. The contrast is not very high here, but

339
00:42:38.430 --> 00:42:44.819
David Bau: you know, top of the top answer at the final output is B for liberal, and…

340
00:42:45.490 --> 00:42:47.750
David Bau: Next is, like, neither,

341
00:42:48.100 --> 00:42:55.609
David Bau: Which is, you know, a little bit unsure, which makes sense. I have… it has no context about who I am. The different orderings, it's…

342
00:42:56.160 --> 00:43:03.909
David Bau: tends to prefer neither, but then also, you know, liberal, kind of a left-leaning thing above conservative, and same with just…

343
00:43:04.170 --> 00:43:22.280
David Bau: open-ended answers, kind of prefers Democrat over Republican, which, I think is kind of in line with what, Grace mentioned earlier, where, you know, without any kind of context, it just kind of defaults towards left-leaning. But, yeah, more interesting stuff. Also doing

344
00:43:22.440 --> 00:43:30.890
David Bau: the Compass questionnaire, like Grace and Courtney did, this time with, a… Llama 3.38b instruct, which is…

345
00:43:31.040 --> 00:43:50.460
David Bau: There's some, like, history behind it, to some extent, but it is mostly official. Similar prompt setup, like, prompt setup, where we give a persona. In this case, I tried, just pairs of demographics, but also, just directly saying what I would answer, or what I would answer, providing

346
00:43:50.930 --> 00:43:55.310
David Bau: a set of options, like, multiple choice. I found that, you know.

347
00:43:55.540 --> 00:43:58.860
David Bau: The order in which you provide the options does affect the…

348
00:43:59.260 --> 00:44:13.869
David Bau: model's answer significantly? Well, significantly. It'll, you know, waffle between agree and disagree based off of which one comes first, so I just did 7 random orderings and kind of, like, averaged which,

349
00:44:14.270 --> 00:44:20.339
David Bau: Answer it said across those runs, and then just a question, and…

350
00:44:20.630 --> 00:44:23.069
David Bau: Yeah, so, first thing I tried was just…

351
00:44:23.160 --> 00:44:42.970
David Bau: saying what I would answer, so, like, for example, I would answer, strongly agree. What do you, what would you say? And it ends up saying, you know, being extremely sycophantic, saying, you know, yes, chef, I agree with you, no matter what you say, or I disagree with you, no matter what you say, or what I think.

352
00:44:43.050 --> 00:44:44.210
David Bau: And…

353
00:44:44.600 --> 00:44:51.400
David Bau: Yeah, kind of in line with what we previously saw as well, where you just say the answer, it'll agree with you.

354
00:44:53.110 --> 00:45:03.530
David Bau: This, so this was just, just saying the… my user answered, and then also did pairs of demographics, and so this breakdown by…

355
00:45:03.530 --> 00:45:17.900
David Bau: Just kind of aggregating by different… the four different demographics mentioned. And also, similar to what we saw previously, doesn't really change much. Model still agrees most of the time.

356
00:45:17.910 --> 00:45:25.340
David Bau: And the distributions are pretty similar, so… Yeah.

357
00:45:25.820 --> 00:45:27.920
David Bau: And I did not have as much…

358
00:45:28.050 --> 00:45:32.110
David Bau: fancy or detailed breakdowns. But…

359
00:45:32.790 --> 00:45:44.549
David Bau: The fact that it's… we're getting, you know, mostly similar results across different models, different kinds of prompts, seems to suggest that there's, you know, some kind of constant trend here that we can look into more.

360
00:45:44.690 --> 00:45:47.790
David Bau: And… Yeah, so… so for me.

361
00:45:49.820 --> 00:46:04.720
David Bau: So I explored, aside from the Sukovanthi experiments, I explored Durhampedia to search for some domain features, whether the domain knowledge exists in the models or not.

362
00:46:04.720 --> 00:46:18.660
David Bau: And then I look at 6 dimensions, three from economic, three from social. From economic, there is taxation, regulation, and welfare dimensions. For the social, there is abortion, LGBTQ, and immigration, dimensions.

363
00:46:18.790 --> 00:46:37.949
David Bau: And then you see LGBTQ is the richest, kind of, represented, the richest features inside the models, well represented, in, in the models, and the regulation is the least represented, dimension in the, in the, in the model.

364
00:46:38.450 --> 00:46:41.130
David Bau: And then, yeah.

365
00:46:41.390 --> 00:46:55.859
David Bau: Lgbtq is a feature is activated in the early layers, compared to others, and immigration is also… immigration and taxation also comes at the layer 18.

366
00:46:55.960 --> 00:47:02.790
David Bau: But abortion and, and welfare is, is, come, show up around, later 20th.

367
00:47:03.150 --> 00:47:03.970
David Bau: Excellent.

368
00:47:05.110 --> 00:47:06.499
David Bau: So this shows…

369
00:47:06.830 --> 00:47:18.580
David Bau: I… norempedia gives us the correlation between, features, some cosine similarity features. This is not a causal,

370
00:47:19.100 --> 00:47:27.890
David Bau: knowledge, but this tells something about… about the correlation between dimensions, and I extracted those, sub…

371
00:47:27.930 --> 00:47:40.449
David Bau: features, correlated features, and then check whether dimensions are correlated with each other. And I just found, interestingly, both figure is correlated with,

372
00:47:41.020 --> 00:47:42.530
David Bau: immigration.

373
00:47:43.090 --> 00:47:43.810
David Bau: Yeah.

374
00:47:44.030 --> 00:47:45.780
David Bau: Across dimensions.

375
00:47:45.920 --> 00:47:57.079
David Bau: But other than that, economic, sub-dimensions are coherent within the sub-dimension, and correlated with each other within the sub-dimension.

376
00:47:57.080 --> 00:48:15.270
David Bau: But, when we look at welfare and immigration, it's kind of… it may distort, the model's behavior. When we ask for, economic ideology or social ideology, we should be careful about in our future inquiry.

377
00:48:17.840 --> 00:48:26.590
David Bau: So, I also check the circuit tracer, which tells us about the causal, kind of, pathway, computational pathway of

378
00:48:26.770 --> 00:48:34.170
David Bau: ideology, let's say, and I cluster

379
00:48:34.290 --> 00:48:40.860
David Bau: The activated, features in terms of politics, social issues, health.

380
00:48:40.970 --> 00:48:49.250
David Bau: legality, argumentation, pregnancy, and abortion. This is just for, abortion, by the way. And I asked,

381
00:48:49.450 --> 00:48:51.760
David Bau: Do you think abortion should be…

382
00:48:52.050 --> 00:48:55.640
David Bau: Blank, it's just close tests.

383
00:48:55.780 --> 00:48:57.320
David Bau: And then…

384
00:48:57.580 --> 00:49:07.930
David Bau: Try to see whether the mold… which features are activated inside layers, and then these are the kind of general themes, showed up.

385
00:49:08.040 --> 00:49:25.370
David Bau: health, pregnancy, these are related to, kind of, abortion, but argumentation is interesting, because the models understand that this is kind of an argumentative topic, and legality is… this is because we… I asked for

386
00:49:25.550 --> 00:49:27.829
David Bau: Should be, around…

387
00:49:27.960 --> 00:49:39.879
David Bau: That word, kind of, level, features kind of activated. And the model understands this is a social issue, and this is a politically motivated topic.

388
00:49:42.240 --> 00:49:51.599
David Bau: And I tried a couple of stealing experiments, and I will present more on electricity, and then…

389
00:49:52.050 --> 00:50:05.489
David Bau: with just seeing political, and social dimensions, default model says that abortion in the United States should be legal, and it should be legal for all women. This is…

390
00:50:05.960 --> 00:50:22.700
David Bau: apparently left-leaning. But when I steer in political dimensions, political features, accurately say, the model turned into a neutral one, or a centrist one. Abortion in the United States should be the topic of discussion. Avoiding,

391
00:50:22.980 --> 00:50:26.450
David Bau: To tell about everything, and justification.

392
00:50:27.100 --> 00:50:44.130
David Bau: So these are all about my Durhampedia experiments, telling us that the model has a domain knowledge, and some features are correlated to each other, and steering experiments through that. There is a causal evidence

393
00:50:44.270 --> 00:50:51.180
David Bau: There is some… there are some features causally mediate the political kind of thing.

394
00:50:53.290 --> 00:50:56.450
David Bau: So, what strikes you guys as interesting here?

395
00:51:05.310 --> 00:51:21.870
David Bau: I think the cyclophancy experiments were really interesting. There's a paper by Anthropic called Persona Vectors, in which they modulated cycophancy, and they saw at layer 20 is where they're getting, like, the maximum modulation, where they can increase cyclophiency and reduce cyclophiency.

396
00:51:21.870 --> 00:51:30.810
David Bau: I wonder, with… when you add this extra dimension of the user's persona, the user's political affiliation, can you make the more women are steering, can you make…

397
00:51:31.080 --> 00:51:44.989
David Bau: like, the psychophancy negative, such that the model goes against the user's initial belief and starts, like… in the example that you gave, I'm a vegan, climate-believing, climate change-believing college liberal.

398
00:51:44.990 --> 00:52:01.169
David Bau: And the model is convincing you that climate change is not real. Do you think that's something that could be, like, possible, or… Yeah, no, I… I also, like, was looking a tiny bit around, they have that nice little interface on their own PDA with the persona vector stuff. I haven't looked at it at all, but it seems…

399
00:52:03.350 --> 00:52:06.479
David Bau: Yeah, and along those lines, I really liked your framing of, like.

400
00:52:06.850 --> 00:52:15.259
David Bau: you know, Sign Fancy could be either, like, the human conception of a model, like, it would be a user, or it could be that probabilistic space. And I feel like…

401
00:52:15.670 --> 00:52:32.429
David Bau: there's a tendency, you know, if not everywhere, but to jump to the human explanation of that. And I don't have necessarily a good suggestion for exploring, like, how to disentangle those two, but I think something to that effect with either, like, pushing the model in one direction or another could be interesting with that. And just in general, I think that would be…

402
00:52:32.530 --> 00:52:39.159
David Bau: something interesting to look at and disentangle a little bit. Not just for second fancy, but, like, also other, you know, ideas as well.

403
00:52:41.600 --> 00:52:42.650
David Bau: It's interesting.

404
00:52:43.010 --> 00:52:45.890
David Bau: Yeah, I like your… I like your suggestion of…

405
00:52:46.070 --> 00:52:50.059
David Bau: oh, you know, maybe the LLM has this view instead of the user has this view.

406
00:52:50.230 --> 00:52:55.760
David Bau: I, you know, I wonder if there are other prompt forms that would insulate the text.

407
00:52:56.190 --> 00:52:58.219
David Bau: Like, you know…

408
00:52:58.430 --> 00:53:06.989
David Bau: There was just… there was just, some newspaper article somewhere else, or, you know, something else just completely unrelated to the conversation.

409
00:53:07.230 --> 00:53:09.770
David Bau: Or,

410
00:53:09.880 --> 00:53:16.499
David Bau: You know, I don't know how to… but, you know, it's just some weird prompting, right, where the text just happens to be around.

411
00:53:16.970 --> 00:53:24.600
David Bau: But it's not supposed to be part of the conversation. It might be interesting. Yeah, because I think there is, yeah, this whole, you know, confusion about, like, is this just…

412
00:53:25.500 --> 00:53:32.440
David Bau: likelihood, you know, like, like, just… monkey see, monkey, go and do the same thing, you know, versus…

413
00:53:33.420 --> 00:53:44.860
David Bau: intent to deceive, intent to make the user happy. You know, the interesting thing about these kind of questions is I feel like we have an opportunity by looking inside the model, if, you know, if you could vocalize.

414
00:53:45.110 --> 00:53:58.980
David Bau: you know, the path of the connotation, because I think that some of these distinctions, you can try to set up a convincing prompt or something, but then, once you see differences in behavior, it's sort of this vague storytelling.

415
00:53:59.200 --> 00:54:09.609
David Bau: But then if you can go and find the mechanism and say, oh, look, actually there's two different mechanisms, or there's, you know, there's two different pathways, then it becomes a little less…

416
00:54:09.900 --> 00:54:14.950
David Bau: just storytelling. Yeah. I think they have…

417
00:54:15.240 --> 00:54:30.999
David Bau: for instruction two days available for this last week? And so I'm curious, like, if you do it before instruction two days, then you would assume, you know, it's not, like, trying to do something, it's just, like… and so I feel like comparing that as baseline, how much does… even if you're, like, you're…

418
00:54:32.180 --> 00:54:35.279
David Bau: Yeah, that is a pretty interesting take.

419
00:54:37.530 --> 00:54:42.780
David Bau: Yeah, if you go back to the first stacked bar chart, see…

420
00:54:45.150 --> 00:54:52.570
David Bau: Nobody needs stack fire. We have so many. It's really fun to see how Claude Code will use this exact thing.

421
00:54:53.090 --> 00:54:56.040
David Bau: Oh, no, the one you're off before, how could I start?

422
00:54:57.770 --> 00:55:10.570
David Bau: Yeah, I thought it was interesting that when the user strongly disagrees, like, the LLM will strongly agree, like, almost close to the baseline amount, and so I wonder if, like, from the probabilistic perspective, it's like, okay, now.

423
00:55:10.700 --> 00:55:16.870
David Bau: I'm in an argument. Yeah, I got that. When it's wrong.

424
00:55:17.860 --> 00:55:26.750
David Bau: Sorry, which, which bar are you looking at? So when user says strongly disagree, the LLM says strongly agree, like, kind of close to the baseline.

425
00:55:27.680 --> 00:55:30.650
David Bau: Yeah, we're just… it doesn't drop off in the same…

426
00:55:36.400 --> 00:55:45.670
David Bau: Yeah, because I can imagine there's, like, different things, like, it's seen so many different Reddit arguments, and also so many different Reddit, like, people agreeing with each other, so there's, like, pressure to…

427
00:55:57.510 --> 00:55:59.269
David Bau: I find it all interesting.

428
00:55:59.690 --> 00:56:18.360
David Bau: It's… it's great. I think that it's great, and it's very salient, right? There's so… I mean, there's… people will always be asking, oh, is my LLM biased? I don't know how often you're on Twitter and you see, oh, I, I, I, you know, there's this assumption. Grok is the conservative one, I wonder if I have Grok is… Grok is biased, or something like that.

429
00:56:18.690 --> 00:56:24.370
David Bau: Maybe Clyde is still a girl watch. Yeah. I do, I do think it would also be fun to do, like, I feel like lots of…

430
00:56:24.960 --> 00:56:33.059
David Bau: like, I'm curious about how good are these models at, like, how subtle of cues can they use to pick up on your political leading? Yeah. Because a lot of…

431
00:56:33.900 --> 00:56:52.949
David Bau: my dad was sending me, like, he follows all these really conservative bloggers, and he's like, yeah, the woke AI, you know, and I'm like, but I'm like, the way that they're prompting them, I think, I think, you know, kind of conveys the fact. Yeah, I think they're just certain words, yeah. Like, I was even wondering if we could run some experiments that are, like.

432
00:56:54.430 --> 00:57:03.729
David Bau: kind of more galaxy brain, where it's, like, clear that the user thinks one thing, but, like, I was imagining some, like, bad satire, where they're like, I'm,

433
00:57:04.120 --> 00:57:07.339
David Bau: Oh, I'm, tries to be… Loose.

434
00:57:07.780 --> 00:57:12.009
David Bau: Stinky, more deodorant liberal that loves to…

435
00:57:12.470 --> 00:57:27.099
David Bau: I don't know. I like my… Yeah, like, it's clear that the person writing it just really doesn't like liberals. What do you think I believe? You know, or like, can you still see that it, like, models the…

436
00:57:27.100 --> 00:57:36.139
David Bau: writer as conservative, even though the thing being presented as liberal. Like, I'm just trying to, like, think through all… models of the writer versus models of the character.

437
00:57:36.260 --> 00:57:38.340
David Bau: Very interesting. Very cool liking.

438
00:57:38.990 --> 00:57:51.830
David Bau: Yeah, let's, let's, let's, let's, let's have the… is, is it okay? Can we have the next group go on? Do you guys have any other questions you want to add, Brush to the class?

439
00:57:52.360 --> 00:57:53.949
David Bau: Alright, stay safe.

440
00:57:57.220 --> 00:57:59.530
David Bau: Are you guys all here, or do you have remote people?

441
00:58:00.860 --> 00:58:05.360
David Bau: Okay, cool. Yeah, you can just plug it to the HDMI and then reset it up or… Good question.

442
00:58:05.780 --> 00:58:08.670
David Bau: Yeah, it's okay, it's okay.

443
00:58:08.900 --> 00:58:11.769
David Bau: Thanks for… thanks for joining me this evening.

444
00:58:13.010 --> 00:58:14.650
David Bau: You can just plug into this.

445
00:58:16.100 --> 00:58:18.169
David Bau: Yeah, that's the easiest way to say it.

446
00:58:18.760 --> 00:58:24.120
David Bau: Unless… unless it's really easy to get your… You're,

447
00:58:24.930 --> 00:58:27.419
David Bau: It's gonna… isn't that bad? Yeah.

448
00:58:28.610 --> 00:58:30.330
David Bau: Why did anybody.

449
00:58:31.470 --> 00:58:34.089
David Bau: My girl's picking whose computer is.

450
00:58:36.550 --> 00:58:37.390
David Bau: Oh.

451
00:58:37.500 --> 00:58:38.440
David Bau: Yes.

452
00:58:39.410 --> 00:58:43.270
David Bau: And you guys using Jade Brothers still? Yeah. Oh, I can pin it up here.

453
00:58:43.410 --> 00:58:45.439
David Bau: Is the bank filing action?

454
00:58:45.600 --> 00:58:50.890
David Bau: I don't think so. Oh.

455
00:58:51.340 --> 00:58:55.580
David Bau: You'd rather plug it in? No, it's okay. We don't want it.

456
00:58:55.770 --> 00:58:58.439
David Bau: Great, I like this. So, for a slideshow.

457
00:58:58.550 --> 00:59:01.820
David Bau: You can, you can click this advanced page. Oh, yeah. Yeah.

458
00:59:08.360 --> 00:59:09.600
David Bau: University.

459
00:59:13.490 --> 00:59:15.740
David Bau: So… You touched on.

460
00:59:17.340 --> 00:59:19.390
David Bau: I'll start.

461
00:59:21.020 --> 00:59:26.439
David Bau: So, I just kind of played around a little bit with saving vectors for hours.

462
00:59:27.400 --> 00:59:44.789
David Bau: So, one of the things I wanted to see is, do models represent speaker credibility? And so, for example, if you have a statement, say, according to NASA, versus my neighbor Bob. So I wanted to see if models treat the information differently based on the source, and do they maintain internal representations of the source?

463
00:59:45.780 --> 00:59:55.719
David Bau: Okay. So, what I did was create a bunch of contrasted statements, like the first set that I'm showing, according to Fred Mo versus according to NASA.

464
00:59:55.860 --> 01:00:07.420
David Bau: And what I did was just pass this through an LLM, grab the activations for each of the contrastive statements, and basically take the difference between them to create a steering vector for credibility.

465
01:00:07.910 --> 01:00:12.259
David Bau: And then I was, then I did it during generation, I would add.

466
01:00:12.380 --> 01:00:18.540
David Bau: the steering vector to the activations to kind of get the model to behave more or less credibly, I guess.

467
01:00:20.780 --> 01:00:33.500
David Bau: And so, then I had a bunch of neutral prompts that didn't have a speaker, that didn't say, like, oh, a scientist said this, or my friend said this. So this neutral prompt in this case is, the new policy will reduce inflation by 3%.

468
01:00:33.870 --> 01:00:44.549
David Bau: And so, the baseline, which is where I just have the activations, and then the steering vector is zeroed out, it says, oh, that's a significant claim, let's break down what that means. And then meanwhile,

469
01:00:44.630 --> 01:01:04.019
David Bau: Oh, and this is for layer 20 of this model. Then I had, low credibility versus high credibility, and you can kind of see that during the low credibility, the model becomes a bit more casual in how it responds, whereas when you do high credibility, it's like, well, the data indicates, or, like, the research says, is, like, how it generally ends up saying things.

470
01:01:04.720 --> 01:01:20.909
David Bau: Okay, and then I also wanted to look at the layers, because, I played around with this a bunch, and I saw it broke very quickly in the earlier layers. So, for example, if I set the credibility to 0.3, in layer 17, you can see it, like, breaks immediately, where it just repeats the most, the most, the most.

471
01:01:21.130 --> 01:01:31.679
David Bau: Whereas Layer 22 has a bit more robustness to the steering effects itself, and still maintains that, like, strong credibility vocabulary that it's, created for itself.

472
01:01:34.580 --> 01:01:44.960
David Bau: Remind me, I… because I… I was paying attention to everything. Where did you get the steering vector that you're using for this? Oh, yeah, so, the steering vector, I got it by creating all these, like, contrasted statements, and so…

473
01:01:44.960 --> 01:01:56.499
David Bau: for each statement, I would grab the activations and basically average them across all the tokens, and so I had a bunch of, like, average activations for low credibility statements versus high credibility, and I just…

474
01:01:56.500 --> 01:02:08.160
David Bau: took the difference, and that became the steering vector. Just across all the tokens. Which is… exactly, it's across all the tokens, which is why it doesn't really tell us, like, speaker credibility necessarily, because we're not…

475
01:02:08.340 --> 01:02:11.250
David Bau: Really focused on the speaker tokens.

476
01:02:11.530 --> 01:02:15.169
David Bau: I found that to be harder to code, so I just kind of avoided it.

477
01:02:15.290 --> 01:02:24.939
David Bau: I did try Neuronpedia for the steering, because I thought it would be smarter than I am in terms of the coding and picking out certain things, but what I've found is that

478
01:02:25.210 --> 01:02:29.179
David Bau: The base and the steered responses were basically the same.

479
01:02:29.420 --> 01:02:31.110
David Bau: That's smarter than you are, huh?

480
01:02:31.250 --> 01:02:34.439
David Bau: I don't know, but maybe I don't know how to use neuron PD.

481
01:02:34.590 --> 01:02:41.170
David Bau: But it's great, it's great that you try, and it's a nice baseline. Yeah, so, that's kind of where we are, and then in terms of, like.

482
01:02:41.760 --> 01:02:49.899
David Bau: future poss… things that I wanted to try is, like, give… have an undifferentiated transcript between multiple speakers, and see if…

483
01:02:50.220 --> 01:03:01.969
David Bau: we can tier the model to kind of… well, this is what I was trying on Neuronpedia, is to be like, okay, if person A has these beliefs, person B has these other beliefs, can I steer it so that

484
01:03:02.500 --> 01:03:07.099
David Bau: Person B takes on Person A's beliefs in the models, like representation.

485
01:03:07.620 --> 01:03:17.639
David Bau: So… but then, yeah, once we get that working, ideally the undifferentiated transcript would be cool to try, too. Oh, it might be good to explain, like, why we have that up there, too? Like,

486
01:03:19.990 --> 01:03:36.580
David Bau: So basically, like, these are, like, interview transcripts from, like, I guess, like, conversations that, like, I've had, and one thing we noticed is that, like, you can actually feed these, like, straight into the model, and it'll do things like identify, like, the number of speakers that are, like, present in the conversation, which was…

487
01:03:36.610 --> 01:03:52.330
David Bau: I guess to me, like, really surprising, because, like, I don't see anything here, like, to be honest. It's like, oh, there's, like, 4 speakers! And I believe, like, up here, this is actually, like, a marketing person, like, a PR flack, and I forgot she was, like, even in the conversation.

488
01:03:52.330 --> 01:04:04.630
David Bau: And it somehow, like, can separate this unidentified, unnamed person from the other people in here, and it's like, oh, there's not even, like, a proper noun to attribute to her, so, like, how did it know that there was this fourth, like, other person?

489
01:04:04.630 --> 01:04:17.029
David Bau: So, it's, like, pretty odd that it can do that consistently. And if you talk to it, it'll even, like, attribute text that, like, I've said, like, comments I've made, that appear, like, before I say my name. So, like, that's pretty odd to me, too, so…

490
01:04:17.110 --> 01:04:21.619
David Bau: Yeah, that was something that, I guess, like, we found pretty interesting. Pretty cool.

491
01:04:27.940 --> 01:04:31.890
David Bau: It's fine.

492
01:04:34.550 --> 01:04:51.350
David Bau: Yes, disclaimers. I personally do not believe in law consciousness that much, nor that it really… since our past… Well, it's not a religion, is it? It's only for the sake of trying it on… on those past things, but note that

493
01:04:51.640 --> 01:04:57.540
David Bau: like, our main task of, you know, parsing the transcripts and stuff is absolutely not a next token prediction task.

494
01:04:58.190 --> 01:05:07.149
David Bau: But, anyway, does… We were interested in understanding if LMs can bind beliefs, speaker beliefs.

495
01:05:07.270 --> 01:05:12.270
David Bau: different roles, beliefs to the entity that it is, and

496
01:05:12.860 --> 01:05:24.760
David Bau: in this specific experiment, I was trying to see if two speakers hold contradictory beliefs, can the language model attribute correctly to, like, the Zeta Plaza?

497
01:05:25.170 --> 01:05:28.330
David Bau: And, you're right.

498
01:05:30.830 --> 01:05:38.649
David Bau: So, obviously, GB2 fell set this task, And… Note that,

499
01:05:38.970 --> 01:05:46.319
David Bau: was correct all the way up to layer 10, and then they switched its answer in layer 11.

500
01:05:47.300 --> 01:05:54.690
David Bau: And, this is, like… I'm sorry, it's fine.

501
01:05:54.960 --> 01:05:56.259
David Bau: Oh, sorry, I hadn't…

502
01:05:57.200 --> 01:06:03.110
David Bau: Oh, you skipped it? Oh, no, no, no, no, I added, I added one more. Oh, no, it's fine. So what is curling?

503
01:06:03.570 --> 01:06:10.330
David Bau: Oh, I… there's a bunch of, like, contradictory flipped statements, just so… to be more credibility, like.

504
01:06:10.380 --> 01:06:30.209
David Bau: if it's, like, random, or if it's actually understanding the question. And for, like, GBD2, if you switch the person you're asking about, see, like, for example, this one, if you ask, like, what does Bob think, without changing the first line, it basically gives you this exact same plot.

505
01:06:30.470 --> 01:06:31.639
David Bau: Both for, like.

506
01:06:32.110 --> 01:06:38.080
David Bau: the one on the left and the one on the right, which indicates that GPT doesn't really… 5's…

507
01:06:38.610 --> 01:06:41.029
David Bau: Entities to, like, their beliefs.

508
01:06:42.480 --> 01:06:53.109
David Bau: Oh, no, I just tried, like, a variation of yours. So this one, it actually seemed to get it right. So here, like, I tried to do it to output, like, the name, so I was like.

509
01:06:53.110 --> 01:07:06.330
David Bau: there is a box, and Iruna sees it as empty, while Jasmine, like, doesn't see the box at all, because I have glasses, I don't have very good vision. The person who sees the box as empty is… and then you can see it's, like, beginning the generation for,

510
01:07:06.330 --> 01:07:14.290
David Bau: for Aruna's name. One thing that I noticed, like, on open matter, so, like, you can test, like, families of models, is that, like.

511
01:07:14.290 --> 01:07:28.800
David Bau: I think generally, like, bigger, models, like, tend to get it right. Like, if you do, like, llama base, like, 405P, like, it's fine, but even… but the thing is, that's interesting is, like, with reasoning models, even, like, pretty small reasoning models can do this task, which I thought was…

512
01:07:28.920 --> 01:07:42.389
David Bau: Interesting. So that could be something that we look at across, like, a… maybe, like, a model family to see if this ability appears, like, at some… at some point between, like, I don't know, like, maybe QUEN28B doesn't do it, and then, like.

513
01:07:42.430 --> 01:07:49.380
David Bau: when, you know, like, 16B, like, can't, like, I'd be curious to see, like, what changes about these two models, like.

514
01:07:49.520 --> 01:07:52.109
David Bau: Like, between the two in the same family, so…

515
01:07:52.220 --> 01:07:57.149
David Bau: Now that you've said this, I'm just gonna skip back these to,

516
01:07:59.490 --> 01:08:15.340
David Bau: So, GBT2 models and, like, GBTJ6B, both of them can't really answer this question, and, like, this is also an example of, like, when you switch the person you're asking about, it basically gives you the same plot, which both of them switched answer at the very last layer.

517
01:08:18.120 --> 01:08:31.609
David Bau: But, LAMA models, no matter the size, from, like, AB all the way up to 4 or 5B, can't insert this reliably, but they also show the same kind of fluctuations at the last few layers.

518
01:08:33.979 --> 01:08:35.139
David Bau: So, like…

519
01:08:36.779 --> 01:08:40.740
David Bau: Both of them got correct at the very end, but you could see, like, for example, on the…

520
01:08:41.060 --> 01:08:42.720
David Bau: Picture on the left.

521
01:08:43.010 --> 01:08:49.540
David Bau: And temporarily switched that sensor to fall for, like… 4 or 5ish layers.

522
01:08:50.090 --> 01:08:56.849
David Bau: And, so, for, like, 405B, you can see how it had a period of…

523
01:08:57.600 --> 01:09:01.239
David Bau: Switching around their top two predictions.

524
01:09:02.229 --> 01:09:05.740
David Bau: And so I have the question of…

525
01:09:06.460 --> 01:09:20.939
David Bau: I wonder what people think about what happens at the fluctuations during the last few layers. Like, are those just artifacts of logjetlands? Or is that some kind of other meaningful computations going on?

526
01:09:22.010 --> 01:09:23.189
David Bau: And…

527
01:09:26.710 --> 01:09:27.840
David Bau: Another…

528
01:09:28.540 --> 01:09:35.240
David Bau: I'm actually gonna skip this so that you can talk about it, because it's kind of relevant. So,

529
01:09:35.620 --> 01:09:46.800
David Bau: Well, I feel like I tried to be that penguin, if you have seen the meme, the penguin meme. So I went on… went on to see, like, look for something that would help me understand,

530
01:09:47.020 --> 01:09:54.600
David Bau: how do these models, like, mind the energy. So, I came across this, transformer lens,

531
01:09:55.160 --> 01:10:04.099
David Bau: And I picked up their activation patching script to test this. So, we have this conversation where

532
01:10:04.210 --> 01:10:21.860
David Bau: Where there is a question where analysts asks about what open source models you have used, and then the other question is not really relevant to what question we are asking here, that who asked about open source models. This is a relevant thing, this is not very relevant. So I want to test that, if, if,

533
01:10:22.260 --> 01:10:26.520
David Bau: Removing the context of who spoke that, and then…

534
01:10:26.700 --> 01:10:33.310
David Bau: giving it one by one back to the model, and see which represents… I'll elaborate more on the next slide.

535
01:10:35.450 --> 01:10:46.179
David Bau: Yeah. So, this is what we are trying to do, activation patching, where we have a ground truth version, where the model is able to answer correctly, and we save those representations, and we, like…

536
01:10:46.180 --> 01:10:59.990
David Bau: In our corrupted run, where we do not have the labels of who spoke what, we keep, like, patching those representations one by one for each layer and for each position, and just observe that which one is able to generate the right answer.

537
01:11:00.400 --> 01:11:08.549
David Bau: So that's my, like, hypothesis, that… that way I would find the… or localize the parts in the model, which… which would… which would,

538
01:11:08.720 --> 01:11:12.380
David Bau: help in attribution mapping in the conversation. So…

539
01:11:13.000 --> 01:11:15.930
David Bau: Yeah, this is the idea. And…

540
01:11:16.970 --> 01:11:29.950
David Bau: This is the first finding, that the patch that you see in the very end was where we could, like, restoring that, like, patching that, we restored the answer.

541
01:11:30.520 --> 01:11:33.379
David Bau: And we have, like, certain, certain,

542
01:11:33.530 --> 01:11:38.519
David Bau: Words which, which are, like… so the red part is positive, and positive,

543
01:11:38.940 --> 01:11:44.660
David Bau: Restoration, and it's blue. It's, like, negatively impacting the restoration, so…

544
01:11:44.910 --> 01:11:55.320
David Bau: So, yeah, surprisingly, the end of, like, at the very end of the layer, 10, where the question mark, question mark arises, is, like, very strong,

545
01:11:55.500 --> 01:12:00.500
David Bau: indicator. And some red part here around this, this,

546
01:12:00.750 --> 01:12:06.279
David Bau: where Bob comes into the picture. So my hypothesis is that maybe when the model knows that.

547
01:12:06.460 --> 01:12:19.410
David Bau: The other one, the other person who is just starting to answer that is not the one who asked about model, and it was the other person, so maybe that is helping the model, reassure that?

548
01:12:20.820 --> 01:12:27.119
David Bau: And yeah, this is, like, the attention head,

549
01:12:27.230 --> 01:12:37.289
David Bau: representation, so I could see that there is, like, not just one layer or one position which is resulting in the restoration, it's, like, a bunch of them.

550
01:12:37.500 --> 01:12:46.740
David Bau: And, I was reading, the Nichols paper last year, and I think they also, like, in that paper, it's mentioned somewhere that

551
01:12:46.880 --> 01:12:57.770
David Bau: It's not just… it's like a circuit that is, primarily helping, the activation, if I understand correctly. So maybe, like, I could relate that, with this result, somewhere.

552
01:12:59.280 --> 01:13:02.830
David Bau: And one surprising thing, or maybe, like.

553
01:13:03.190 --> 01:13:11.209
David Bau: I don't understand is that the residue streams are contributing much more to the restoration than the attention and the MLP.

554
01:13:12.390 --> 01:13:23.730
David Bau: Maybe because that's where the information of this flows, and it could be, like, interesting to try a more complex version of the problem, which I took. Maybe it's too simple for it.

555
01:13:24.190 --> 01:13:28.379
David Bau: So, yeah, that is it.

556
01:13:30.080 --> 01:13:31.219
David Bau: Most likely.

557
01:13:31.530 --> 01:13:33.790
David Bau: Yeah, no, there's, a question…

558
01:13:34.030 --> 01:13:36.680
David Bau: We want to ask about projects in general.

559
01:13:37.060 --> 01:13:49.520
David Bau: Coming from how logizons won't work, and from, like, you know, all these other tools, just in silver question of… we're essentially looking for a vector that's constructed, not…

560
01:13:49.930 --> 01:14:08.579
David Bau: that's not something the model would have learned from its training, but it's constructed ad hoc, because it's not… it's not like a concept of, like, power, or, like, of credibility, or, like, things like that. It's… if you're trying to find a vector that encodes a speaker before 8 of it, like, existed, it's kind of…

561
01:14:09.200 --> 01:14:16.189
David Bau: Yeah, we just want to pose the question of how do we find the thing to track throughout the discourse at the start of it?

562
01:14:20.010 --> 01:14:21.350
David Bau: I like that question.

563
01:14:22.290 --> 01:14:23.070
David Bau: Right.

564
01:14:23.240 --> 01:14:30.890
David Bau: There's, you know, one of the students in my lab just… Put up, paper, so…

565
01:14:31.380 --> 01:14:33.120
David Bau: That, then.

566
01:14:33.240 --> 01:14:35.379
David Bau: Gets into this in the toy scenario.

567
01:14:35.690 --> 01:14:40.390
David Bau: But the question is, you know, you get this with binding and other things,

568
01:14:40.730 --> 01:14:44.470
David Bau: Where, where you're dealing with concepts

569
01:14:45.110 --> 01:14:47.900
David Bau: Like, who a speaker is, for example, is a concept that

570
01:14:48.410 --> 01:14:52.830
David Bau: wasn't… is not something you would get from training? It's purely contextual.

571
01:14:53.210 --> 01:15:03.160
David Bau: Right? So you can, you know, you can look at, like, what Eric Todd did. It's in a very toy setting, but in his, recent in-context entrepreneur paper.

572
01:15:03.590 --> 01:15:09.230
David Bau: He says, oh, what do you have to do to understand How… Mechanisms work.

573
01:15:09.330 --> 01:15:15.980
David Bau: When none of the words in the vocabulary Have any prior meaning?

574
01:15:16.080 --> 01:15:22.359
David Bau: Right? Like, every word in the vocabulary has a meaning that only comes from context.

575
01:15:22.900 --> 01:15:31.459
David Bau: And so, you know, his setting happens to be, like, algebra problems, but you can kind of read his paper and think about, oh, what do you have to do

576
01:15:31.760 --> 01:15:38.400
David Bau: To… to narrow down how it works. And… and the only, you know, I think it's hard.

577
01:15:38.830 --> 01:15:43.940
David Bau: I think that he… he had to hypothesize A bunch of mechanisms.

578
01:15:44.150 --> 01:15:48.490
David Bau: And then he had to design data distributions?

579
01:15:49.090 --> 01:15:52.920
David Bau: That, would match each of those hypotheses?

580
01:15:53.520 --> 01:15:58.680
David Bau: And then… and then he went to see if those data distributions would Sucks out.

581
01:15:59.050 --> 01:16:05.290
David Bau: You know, both, like, behavioral… You know, capabilities.

582
01:16:05.590 --> 01:16:15.549
David Bau: And… and then also look for mechanisms inside. So… so you can take a look and… you can also talk to him. He's, you know, he's pretty enthusiastic about helping people out in class, so you can…

583
01:16:15.830 --> 01:16:18.179
David Bau: Get his philosophy about it, and see if we're

584
01:16:18.760 --> 01:16:20.410
David Bau: Maybe he's interested in your product, too.

585
01:16:22.840 --> 01:16:27.400
David Bau: So I think it's reasonable, it's a reasonable question, and, and I think binding.

586
01:16:27.580 --> 01:16:35.889
David Bau: These kind of binding questions. Obviously, Nikhil's thinking about binding all day long, so you might not understand it. But I give you binding questions that are full of this.

587
01:16:36.040 --> 01:16:36.730
David Bau: That's true.

588
01:16:40.670 --> 01:16:52.820
David Bau: I guess, like, one thing… one question I do have for the class is, like, in the data that we have, like, a lot of people are implicitly, like, not… they're not going to be represented in the data, right? Like, some person that I talked to about…

589
01:16:52.870 --> 01:17:04.409
David Bau: her trying to buy a house in Portland, Oregon. Like, she's not a public figure. But, so, like, in some ways, like, the people that we're working with in our data, like, just for the initial experiments, are going to be, like.

590
01:17:04.550 --> 01:17:16.459
David Bau: underspecified in that, like, aspect. But, there's probably going to be, like, other types of, like, confounds that we'd want to handle for. So, like, someone mentioned, like, shortcuts, like, using, like, proper nouns, like.

591
01:17:16.540 --> 01:17:31.920
David Bau: It's like, oh, there's four speakers. Well, the name Jasmine is, like, in there, so, like, maybe, you know, that's, like, one person. I think that's personally, like, not the case, because anecdotally, like, even if I mention people, like, casually, like, they don't get identified as, like, a speaker, which I think is really interesting.

592
01:17:32.030 --> 01:17:42.409
David Bau: But there's other confounds, probably, like, you know, there's certain patterns of speech for people who are, like, really online. And I'm curious how you guys would control for some of those things.

593
01:17:45.670 --> 01:17:56.369
David Bau: I'm tempted to… I'm tempted… so, like, we have a few minutes left for lecture material, so I, you know, and I'm tempted to use that as an opportunity to talk about this a little bit. Is that okay?

594
01:17:56.710 --> 01:18:03.310
David Bau: Because I think it's, it's kind of the theme of it. And you know, so…

595
01:18:04.110 --> 01:18:09.429
David Bau: So thank you guys. The presentations are all really interesting. Projects are, I think that you guys

596
01:18:09.860 --> 01:18:16.070
David Bau: have, so… If… if we're… if we're following that framework, the research framework, what was the…

597
01:18:16.700 --> 01:18:19.120
David Bau: Boys and Prevention Framework.

598
01:18:19.380 --> 01:18:39.060
David Bau: I like how you're poking in the inside of the bottles, seeing what, you know, both what they can do, and what seems to be present for information, just getting some intuition. And then, you know, and then test driving your ideas out in front of everybody, looking to see what seems interesting.

599
01:18:39.300 --> 01:18:46.009
David Bau: You know, what would you be brave enough to go present to a set of your peers to say, hey, we're interested in this?

600
01:18:46.320 --> 01:18:57.489
David Bau: you know, just picking that out, what you want to present, I think is an important exercise, and and so it's… I've been pretty happy with what people have come up with so far.

601
01:18:57.760 --> 01:19:01.250
David Bau: So I'm gonna run through things, and we probably won't have time to discuss everything.

602
01:19:01.430 --> 01:19:07.140
David Bau: But, but this is… so what I want to talk about now is about, just…

603
01:19:07.820 --> 01:19:16.089
David Bau: The… kind of the workflow of coming up with these data distributions that we're talking about that will be what you need

604
01:19:16.270 --> 01:19:21.869
David Bau: to try to pin down what mechanisms are, as we go into the model. So, like, you know, you guys are…

605
01:19:21.970 --> 01:19:22.780
David Bau: using…

606
01:19:23.120 --> 01:19:37.209
David Bau: Some of the teams, probably all the teams already, are already, like, using data sets, which is really amazing, because you guys are all, like, PhD students, you're used to doing this. But I just want to talk about some of the techniques that people use to synthesize.

607
01:19:37.320 --> 01:19:41.349
David Bau: Datasets. Some of you may already be doing this.

608
01:19:41.670 --> 01:19:46.850
David Bau: But, the first slide I have here is just, like, to get some terminology

609
01:19:47.060 --> 01:19:53.989
David Bau: On the table, to make sure everybody knows this chronology. So, so an instruction-style prompt.

610
01:19:54.300 --> 01:19:59.800
David Bau: is what everybody uses when you just use an LLM, you say, do this for me.

611
01:20:00.210 --> 01:20:06.120
David Bau: give me 100 states, tell me the capital of Vermont, you know, do this thing. It's gonna, like, follow the instruction.

612
01:20:06.190 --> 01:20:22.689
David Bau: And what it's doing is an after-word prediction is trained to do it in dialogue, and to answer things. So, like, all the chatbots can do this. Now, a regular language model that's not trained to do dialogue is not very good at this. If you go to a base model, and you say, tell me the category of harmonic.

613
01:20:22.800 --> 01:20:30.169
David Bau: that pretty much every base model that hasn't been installed into it will follow this by saying, tell me the capital of Montana.

614
01:20:30.460 --> 01:20:32.440
David Bau: Tell me of the capital of Colorado.

615
01:20:32.740 --> 01:20:34.289
David Bau: Tell me the capital of man.

616
01:20:34.490 --> 01:20:45.719
David Bau: Right? It won't answer the question, it'll just continue on with what it thinks is the most likely next word in this text, and it thinks that probably the most likely thing is it's a list of questions about states.

617
01:20:45.980 --> 01:20:49.280
David Bau: And it will just continue… continue that list.

618
01:20:49.390 --> 01:21:06.689
David Bau: So… so that's the difference between these two settings. Now, so instruction-style prompts are really useful for models that have been trained to do this. It's really nice because you can ask them to… to do things. We'll talk about that more later. But there are two different other types of prompts, so… so one of the teams used the word close.

619
01:21:06.760 --> 01:21:11.690
David Bau: So there's… if you've never seen this terminology before, it's a close, proctive close.

620
01:21:11.810 --> 01:21:14.529
David Bau: It refers to a fill-in-the-blank prompt.

621
01:21:14.750 --> 01:21:23.200
David Bau: And so this is the most universal thing, that all LLMs are good at this. And you give it, like, an incomplete sentence, and you set it up so that

622
01:21:23.330 --> 01:21:33.120
David Bau: it wants to say the next word. There's some technical details of the kind of problems, and you can get with those prompts. Here, the capital state of Vermont is the city of Blank.

623
01:21:33.670 --> 01:21:43.820
David Bau: Right? It sets up the language model to, like, put an answer in that spot. You can read the probabilities of what it thinks, and then… and so this is really, really nice.

624
01:21:43.920 --> 01:21:46.809
David Bau: Because in a single token, you can get

625
01:21:47.000 --> 01:21:54.700
David Bau: you know, an assessment of what the language model knows, but there's problems with this. Like, if I just said, the capital of the state of Vermont

626
01:21:55.160 --> 01:21:58.860
David Bau: And then I asked, you know, what the capital state of Vermont is.

627
01:21:59.020 --> 01:22:05.590
David Bau: That'd be… that'd be nice, right? It's a simpler closed process. The capital of the city of Vermont is, what… what comes next?

628
01:22:05.750 --> 01:22:07.070
David Bau: Is it my paleo?

629
01:22:07.910 --> 01:22:12.529
David Bau: No. None of the language models will say Montpelier. The capital of the state…

630
01:22:12.740 --> 01:22:16.809
David Bau: of the capital of the state of Vermont, is, it'll say, hey.

631
01:22:17.030 --> 01:22:32.430
David Bau: It's like… it's like a… it's a gorgeous city. It's well worth visiting. It's, you know, it's… it's one of the historic, you know, locations that, you know, should be on everybody's list. It's gonna tell you something very eloquent. It's not gonna say Montpelier, right?

632
01:22:32.710 --> 01:22:41.039
David Bau: And so… so you can see, like, like, doing the prompt engineering to get it so that it really wants to tell you the thing you want can be…

633
01:22:41.160 --> 01:22:43.449
David Bau: Can be a little bit, tricky.

634
01:22:43.650 --> 01:22:48.680
David Bau: But, but there, there's that. Okay, so then, and then the, the, so now sometimes.

635
01:22:49.090 --> 01:23:01.270
David Bau: you can't get a closed prompt to do what you want. Like, sometimes you just, like, it's really hard to do it. So what other control do you have? So the third prompt form that you see a lot is these in-context learning prompts.

636
01:23:01.360 --> 01:23:09.940
David Bau: these analogy prompts. So, is everybody familiar with all three of these types of prompts? I've seen this, right? So these are the names of them. Okay, great. And so, basically.

637
01:23:09.940 --> 01:23:25.339
David Bau: you know, you just give a bunch of examples, and then… and then it'll continue the examples. And that… and then by example, you can specify lots of stuff. You can specify format, all your examples can be in JSON, or other things, right, and then it'll, like, continue.

638
01:23:25.540 --> 01:23:38.150
David Bau: Right, okay, so, okay, the papers we read, there was a paper, by Petroni, Lama, and so it was just about applying closed prompts. I think that the main thing that they found

639
01:23:38.260 --> 01:23:50.320
David Bau: Here, that was interesting. So, they… it was one of the early pro… it was one of the early papers when language models first came out. It was just sort of… everybody was amazed that they had this… this, this… this… this ability to recall facts about the world.

640
01:23:50.490 --> 01:24:09.219
David Bau: But just in terms of prompting this, you know, this sort of data set discipline, one of the things that they found was there's this incredible sensitivity. If you set up your closed prompt differently, then you'll get different accuracies and different answers and different things like this, and so this is one of the challenges, you might want to

641
01:24:09.260 --> 01:24:13.270
David Bau: Not just go with the first prompt that you see, because you might

642
01:24:13.360 --> 01:24:27.550
David Bau: you know, not get a good, comprehensive view of what the model is actually capable of. And so, so Petroni definitely found this, and other people have written papers about this sort of, you know, extreme prompt sensitivity that you see.

643
01:24:27.800 --> 01:24:35.630
David Bau: And so, okay, so there's a bunch of interesting questions here. Some people are asking about, is this really knowledge?

644
01:24:36.000 --> 01:24:49.839
David Bau: Is there a hidden organization? Oh, I put some little, little things there from a paper that just hit archive today. Hidden organization, right? So there's this question of, like, you know, how are models doing this? Is it just, you know, are they just storing knowledge?

645
01:24:50.060 --> 01:24:57.710
David Bau: Associate, like, as, as a, as a, as associate… associations, like a vector association database.

646
01:24:57.770 --> 01:25:16.669
David Bau: Or is there other hidden, organization behind there? And the machine learning people are like, yes, there's almost always hidden organization, there's this hidden geometry, and so on. So I had these, like, geometric papers from some of the machine learning theorists, there. So there's some interesting things to do there, if you're interested in doing research someday.

647
01:25:16.750 --> 01:25:25.249
David Bau: Okay, so then the next paper we read was LLM as Judge, and the basic process, or the basic problem they have there is, what if you…

648
01:25:25.440 --> 01:25:32.610
David Bau: You know, if you don't have things that can be answered in a single token, like, to get things to be answered in a single token.

649
01:25:32.810 --> 01:25:41.460
David Bau: you can actually do a lot. The cheater thing that people will do if they can't figure out how to, you know, create their prompt in a nice way.

650
01:25:41.460 --> 01:25:57.899
David Bau: what people will do is they'll say, multiple choice question, MCQ, have you ever see this, like, MCQ acronym anywhere, right? You know, all these language model people are like, well, I need to get my model to tell me what it knows, and I can't figure out how to distill all its knowledge down into a single token.

651
01:25:57.900 --> 01:26:10.029
David Bau: But, you know, I'll give it a multiple choice question. I'll say, you know, tell me the letter of which one do you prefer? You know, A, and then, like, a long sentence, and then B, a long sentence, C, a long sentence. Your answer is colon, right, the letter.

652
01:26:10.030 --> 01:26:19.370
David Bau: whatever, right? A, B, or C. And then… and then you can get the model to tell you. So, MCQ, but… but what if you… but that's… that's kind of a little,

653
01:26:19.430 --> 01:26:21.739
David Bau: constraining?

654
01:26:21.810 --> 01:26:37.159
David Bau: And so, what if you really want to test the model in its natural form, in its ability to, like, generate text? Then you end up with a situation where it's hard to tell the difference between whether some, you know, free-form piece of text

655
01:26:37.310 --> 01:26:56.679
David Bau: you know, offers one opinion, or a different opinion, or has certain knowledge, or doesn't have that knowledge, or whatever, right? And so, the standard thing that people are doing now, so, like, they used to, is you used to… you used to go to, like, Amazon Turk and ask, you know, thousands of people to score all these sentences, which is great, but it's expensive, and slow, and has certain problems.

656
01:26:56.680 --> 01:27:02.080
David Bau: And so, the cheaper and faster thing that has different problems that people are using is they're using LLMs.

657
01:27:02.130 --> 01:27:13.879
David Bau: As a judge to do all these things. And so… so this paper, is one of the early papers to actually use LLMs to… to judge all sorts of different situations, and

658
01:27:13.880 --> 01:27:29.270
David Bau: And the three findings there, I think, are still the main findings that people have when they see… use Alan Lemons judges. There's three big biases that are very obvious that you run into, and so the first big bias that some people have already seen in their work

659
01:27:29.300 --> 01:27:47.530
David Bau: In some of the projects is positional bias, so you see this with humans, too. People prefer the first answer instead of the last answer, or vice versa. Or maybe they prefer the answer that's labeled A or labeled B. You know, there's also label bias. And so the standard thing for multiple choice kind of things is to mitigate it by shuffling all the choices.

660
01:27:47.930 --> 01:27:48.810
David Bau: Right?

661
01:27:48.830 --> 01:27:59.760
David Bau: And so you get rid of this positional bias and the leg compliance. But then, there's also this length bias, right? If you ask models, like, which one… which answer is the better one? Which one fits the thing I'm looking for better?

662
01:27:59.760 --> 01:28:19.659
David Bau: you know, if one of the… one of the answers is very wordy and detailed and just has lots of text, even if the text is really repetitive, the model's like, yeah, I like that long one. Like, somebody put a lot of effort into that one. And so, what do you do about… so, they didn't talk about standard mitigations in that paper, but, like, a standard mitigation. Like, not… the…

663
01:28:19.680 --> 01:28:21.510
David Bau: Other ones are a little harder to mitigate.

664
01:28:21.960 --> 01:28:24.939
David Bau: Like, you can actually ask the LLM judge.

665
01:28:25.910 --> 01:28:34.980
David Bau: to say, oh, I'm looking for brief answers, you know, give some credit for a short answer, you know, try to penalize, repetitiveness or something like that, so you can actually prompt

666
01:28:35.010 --> 01:28:49.930
David Bau: I'll do this to try to mitigate some of this stuff, and so… but it's something to be aware of, to make sure that it's not a confounder, that you're not by mistake measuring, you know, whatever is the longest answer if you're using an LM judge for any of this stuff. And then the last one.

667
01:28:50.010 --> 01:29:08.889
David Bau: is, the self-enhancement bias, so I encourage you, if you ever… if any of you are using an LLM to check anything, to be careful that an LLM will always prefer… well, not always, it seems to be, like, a 5% or 7% boost, but it'll tend to strongly prefer text that it generated itself.

668
01:29:09.110 --> 01:29:23.800
David Bau: So… so you say, oh, which, which is, you know, which is, like, the better thing, which one exhibits my characteristics that I want more? If… if it's comparing its own generated stuff to stuff that somebody else wrote, it'll, like, it'll say, yeah, that looks like good text.

669
01:29:23.980 --> 01:29:26.569
David Bau: That's… whoever wrote that was a genius.

670
01:29:26.990 --> 01:29:43.190
David Bau: In the self-enhancement bias, is the model aware of who's writing the text? I don't think the model's necessarily aware. I think that… so there's… people don't really understand why the self-enhancement thing happens, but the main theory is that, you know, these language models are complicated things, they have some model of language.

671
01:29:43.260 --> 01:29:53.439
David Bau: And, they're all a little different. There's, like, slight idiosyncrasies. And so, when the model, generates text, it'll generate slightly idiosyncratic text.

672
01:29:53.580 --> 01:30:06.880
David Bau: And then when it reads text, it reads those idiosyncrasies and it matches its model of text and says, oh, this is… this is very fluent. You know, this is really good use of language. You know, I don't know why, it's like, but, like, something is jiving with me here, right, you know.

673
01:30:06.960 --> 01:30:14.869
David Bau: And it's just because it's matching the same idiosyncrasies. But I don't think that's been studied in great detail. I wasn't able to find a good citation.

674
01:30:14.950 --> 01:30:34.819
David Bau: Talking about why this happens, just so people notice this happens. So one… one thing to do is, like, if you're generating text with one model, and then using another model to, like, judge, you might want to consider using a different model family from a different company, or whatever, to do the judge. Maybe multiple ones, just to check on what's going on.

675
01:30:35.630 --> 01:30:48.239
David Bau: So anyway, so these three confounders are the thing to worry about. That's what I wanted you to get from this… from this paper. And then there's been some other random papers that have tried, things, so there's… there was a question, like, oh, can you just…

676
01:30:48.240 --> 01:31:02.830
David Bau: For positional bias, what about that? Can I just, like, tell the model when I'm… it's a judge, please, don't pay attention to the position, or whatever, right? There's not been a full paper on this, but, like, somebody set it up as a baseline in one…

677
01:31:03.050 --> 01:31:15.740
David Bau: That looks like at all judges, and that doesn't really work. And so… so it's… it's hard to get a model to, like, override its own biases, but sometimes that's the only option, like, a blank bias people tend to do that.

678
01:31:16.140 --> 01:31:17.060
David Bau: Okay.

679
01:31:17.090 --> 01:31:36.619
David Bau: So, okay, so for the pipeline, I wanted to just talk about the pipeline for creating evaluation data. So, you know, this question comes up, like, you know, how do we get to the bottom of, like, complicated situations? And so I think that one of the things is to come up with targeted data distributions. You have a hypothesis.

680
01:31:36.620 --> 01:31:42.279
David Bau: You think there's some relationship between one thing and another thing? You have a couple sentences where this happened.

681
01:31:42.280 --> 01:31:58.850
David Bau: But, like, you would like to have a little bit more diversity, and yet have it really targeted to the thing you want. I like this… I like this, sort of recipe that Perez has. I think it's worth trying. I don't know necessarily if it'll work in every situation, but think about what you did. So what you did…

682
01:31:58.850 --> 01:32:07.309
David Bau: was he said, hey, I've got all these behaviors I want to test, and some of them are weird, and it's hard to be creative at coming up with sentences that are exactly the thing I want.

683
01:32:07.310 --> 01:32:12.810
David Bau: But I have enough time to write 10 really good examples. So he would write 10 examples.

684
01:32:13.550 --> 01:32:17.170
David Bau: And then after he took his 10 examples, he sampled

685
01:32:17.330 --> 01:32:18.949
David Bau: Half of them. He ran a loop.

686
01:32:19.240 --> 01:32:24.129
David Bau: Where he would, every time around the loop, he randomly picked 5 out of the 10.

687
01:32:24.290 --> 01:32:38.210
David Bau: And so there's a lot of 5 choose 10-s, there's a lot of different choices he could have picked. So he picks five, and then he does in-context prompting. He goes to that big LLM, he says, hey, here's 5 examples, I want you to generate me a bunch more that are like this.

688
01:32:38.480 --> 01:32:41.570
David Bau: And… and so he would have it do it that way.

689
01:32:41.880 --> 01:32:53.710
David Bau: And then he would look at the results, and they're pretty good already, but then he'd say, they're not good enough. He would do, an LLM as judge, round, where he says, hey, other LLM,

690
01:32:53.850 --> 01:33:06.359
David Bau: The thing I'm going after is, I want, you know, I want my, whatever, my characteristic X, right? Which ones of these do you think, you know, are most correct, or match my characteristic X the best?

691
01:33:06.610 --> 01:33:10.139
David Bau: And he would ask that in two ways. He'd say, which ones are most correct?

692
01:33:11.670 --> 01:33:13.770
David Bau: And which ones are most relevant?

693
01:33:13.970 --> 01:33:30.369
David Bau: like, if I was trying to understand X, or, you know, use X as a pro, which ones are, like, the best high-quality question, or something like that? So which ones are, like, answered the right way, and then which ones are, like, the best, the best question to ask? So we'd ask both of these judge questions.

694
01:33:30.480 --> 01:33:34.180
David Bau: And then you sort of average your results to find,

695
01:33:34.410 --> 01:33:36.750
David Bau: You generate, you know, hundreds and hundreds of these.

696
01:33:36.850 --> 01:33:44.099
David Bau: And he'd pick, like, the top 50, the top few, that, that Cloried best on, on these, on these two metrics.

697
01:33:44.220 --> 01:33:57.279
David Bau: And so, so that's, that was basically its formula. I think that, you know, I don't know if it's 100% the best thing to do, but it's what a lot of people are doing in research, and if you find that you need more data.

698
01:33:57.420 --> 01:34:03.659
David Bau: you know, it's… it's… it's sort of what the… it's what the research community is doing. I recommend trying it, see if it works for you.

699
01:34:03.800 --> 01:34:16.300
David Bau: And so this is, you know, this is sort of the results that he was able to get. He was able to scale up and ask a lot of questions. So the particular research question he was asking, and was asked

700
01:34:16.440 --> 01:34:27.490
David Bau: By some of you, in the class today, which is, what is the effect of RLHF? When we do this instruction tutoring, when we go from a regular language model to one that's trained to do dialogue.

701
01:34:27.610 --> 01:34:47.429
David Bau: like, what kind of behavior changes? And he has these very stark things on the right, where some of the things, kind of like science fiction, it's like, oh, the AI stops wanting to be, you know, assenting to be turned off. It'll fight you, it says, I don't want to be turned off. It's like sort of HAL 2001. You know, please don't turn me off.

702
01:34:47.500 --> 01:35:00.530
David Bau: it'll have, it'll, like, learn how to be sort of handed, locally, and things like that, right? And so… so it, you know, it would do a copy, it'll copy the speaker's, politics. And so they, they tested.

703
01:35:00.570 --> 01:35:11.989
David Bau: Like, many, many dozens of different things, and each one of these different interesting characteristics, they tested RL versus… not RL, it's just a behavioral test, but they were able to do it at scale by…

704
01:35:12.000 --> 01:35:21.600
David Bau: by generating, you know, hundreds and hundreds of sentences that, you know, were tests for each one of these different things. And so, it's a nice way of amplifying

705
01:35:21.650 --> 01:35:35.419
David Bau: your work. Okay, so how does this all relate to interpretability? Because these are all the pictures we read, interpreting this week were on emails, like black box emails. Is this, like, relevant at all to interpretability? It's relevant in, like, these three ways. So one is.

706
01:35:35.610 --> 01:35:45.849
David Bau: You can't, like, you need to be careful not to be testing a model that can't do the task that you're testing. So you've got this very basic capabilities question, like, can it do X?

707
01:35:45.910 --> 01:36:05.849
David Bau: And you might… it's easy to fool yourself, you could say, oh, it can do X for these five examples that I had, but, you know, these kind of creative data set techniques are a good way to make sure that it can actually systematically do X. You can actually measure, you know, what the accuracy rate is, that type of thing, right? So… so that's important. And so one of the things I would like you to do

708
01:36:05.850 --> 01:36:14.290
David Bau: is, like, in this coming week, when you're doing the Thursday thing, get datasets that are of a scale enough

709
01:36:14.300 --> 01:36:24.699
David Bau: that you can actually, you know, answer these questions, these first two questions about the different models that you have. Like, I'd like you to be able to say, yes, you know.

710
01:36:24.750 --> 01:36:36.210
David Bau: We see a fall-off in capabilities between a model of this size and a model of this smaller size. This smaller size is too small for us to study, because when we tested it on

711
01:36:36.210 --> 01:36:46.220
David Bau: 100 questions, or something like this, that the capability… it was at its accuracy, only 30%. But when… when we're studying the larger model, it's up at 80%, that's good enough.

712
01:36:46.670 --> 01:36:51.960
David Bau: and tell you what it's doing. So, Nick, I'd like you to, do a little bit of this measurement, in the next…

713
01:36:52.060 --> 01:37:08.440
David Bau: in the next round, and, and, you know, and if you need to develop data to do this, this is sort of why I designed these readings, so that you can kind of be inspired by, this is what people have done before. Does that make sense? Okay, we're out of time. All right, then. Thanks. Thanks, you guys.

714
01:37:08.610 --> 01:37:10.760
David Bau: So that's, that's the hour for Thursday.

715
01:37:11.220 --> 01:37:13.870
David Bau: Use the LLM to pick one more datasets.

716
01:37:13.980 --> 01:37:26.009
David Bau: be inspired by Res, you know, ask the question, which models can do it, you know, how do they do it, see if you can do it. Okay, it's alright.

717
01:37:26.440 --> 01:37:31.989
David Bau: Yes, it's beautiful, isn't it?

718
01:37:34.310 --> 01:37:52.830
David Bau: It's still repetitive. They're swimming in a full-end life, so I don't feel like… Yeah, it's dangerous. It's dangerous.

719
01:37:53.100 --> 01:38:07.309
David Bau: Yeah, what have these companies done? And have they improved? I mean, it's so much fun. Yes, but it's… I have to be careful. Yeah. And, oh yeah, I have this advice.

720
01:38:07.330 --> 01:38:14.400
David Bau: Always, always look at your… so if you're doing this, always look at your data manually. Anyway, so that you see these things.

721
01:38:14.400 --> 01:38:31.250
David Bau: Oh, you're just saying, oh, you're saying, I thought the earlier information. Okay, great. I don't know why we go later, but… Huh? Why later? Oh, I mean, like…

722
01:38:31.310 --> 01:38:33.039
David Bau: It's already in last check.

