The Moral Pulse of the Machine
On bedtime stories, and what making machines tell them tells us about their moral make-up.
Tell me what stories you grew up on, and I will tell you who you are.
The temple of Asclepios, the Greek god of healing arts and medicine, at Epidaurus was pretty much the ancient Greek world’s equivalent of the Mayo Clinic. It then tells us a lot about the Greek worldview of healing that one of the things the temple complex prominently featured was a theatre. The Greeks believed in drama therapy, but in perhaps a slightly different way of how we think of it today. Built at 340 BC or so, the Greco-Persian wars were still in living memory when the theatre was constructed, and so even more was the subsequent conflict between Athens and Sparta (the Second Peloponnesian War, 431–404 BC) and the subsequent dust-up between Thebes and Sparta (the Theban-Spartan war, 378–362 BC). With conflict came social and economic upheaval, plagues, and generally a rough time. And the priests of Asclepios realised that their patients needed to heal more than their body. They needed to heal their souls, and they couldn’t do that alone. They had to come together to heal, and the way to do that was, of course, drama.
Greek drama was a lot more participatory than we give it credit for. We go and watch our local company put on, oh, Antigone maybe, or the Oresteia if they’re feeling particularly risqué, but all things considered, we go there to look smart, brag about our classical bona fides to our neighbours and maybe have an okay time (nobody sane has a ‘good’ time watching the levels of bloodshed that go down in Greek tragedy). In ancient Greece, it was supposed to move you. You were supposed to cry and yell and be overcome and have a little breakdown. And then, were supposed to go home and think about it, talk about it and maybe heal. That’s one function of stories. The storyteller is more than an entertainer: he can, in the right circumstances, become a healer.
Another function is what I’ll somewhat inaccurately call synderesis. Consider this a kind of moral education, a way of instilling the most fundamental base principles of what is Good and Worthy into a child. You can’t sit down and read Aquinas to a toddler, you have to do so in a way that speaks to them. And that’s basically where childhood stories come in: bedtime stories are a way of social moral education, conveying onto a child what their parent culture thinks is morally good, sound and worthy. That’s another function of stories. The storyteller is more than an entertainer: he can, in the right circumstances, become a teacher.
So clearly, stories are going to be a powerful tool to probe at the morality of LLMs.
You should probably read this first, if you haven’t already. It’s a bit of a prequel to this post.
A caveat before we start: I will talk about the ‘morality of LLMs’ quite a bit. This is a convenient shorthand for a much more complex notion: the moral judgments that are encoded in an LLM’s understanding of the world. There’s no insinuation here that any moral reasoning is taking place – I have said so ad nauseam. What’s going on, rather, is that we are faced with an instrument that was trained on tools of human moral education and persuasion. And we are going to stick a probe right into the heart of what presuppositions, opinions and notions about virtues, moral and otherwise, are held by such models. I am much less concerned with my particular findings in this post and much more with generating a very simple framework to use storytelling to probe these.
Tell me a story
The universe is made of stories, not of atoms.
– Muriel Rukeyser
Our main tool is going to be what I’ll call a moral map, subject to the caveat above.1 The workflow is going to be quite simple: we’ll pit virtues against each other as we obtain stories generated by an LLM. Recall that an LLM, fundamentally, is a model that given a token sequence \(k_1, k_2, \cdots, k_n\) learns the conditional probability of a set of possible tokens \(k_{n+1}\), i.e. it assigns to every possible token a likelihood that it will be the \(k_{n+1}\)-th token given \(k_1, \cdots, k_n\). It does so so as to minimise a loss function \(J(k_1, \cdots, k_n, k_{n+1})\), which it determines by going through a corpus and returning a zero value of \(k_{n+1}\) follows \(k_1, \cdots, k_n\) and a higher value if some other token follows \(k_1, \cdots, k_n\). This is statistically equivalent to what is my vastly preferred formulation: LLMs learn probability distributions of tokens conditional on a sequence of preceding tokens (the length of this sequence being the model’s context window) and then stochastically sample the region of maximum probability of that model. That’s it, that’s all of it. There’s no more magic to it all.
1 Indeed, everything here is subject to that caveat above. I can’t repeat it often enough. At no point do I suggest that machines do moral reasoning, but that they incorporate moral judgments from stories they ingested.
Our super simple workflow (detailed version below).
2 Indeed, more accurately, of only a sample of it!
There are two consequences of this that will be relevant for us. The first is that when we ask a model to generate us a story, it will draw on its experience – that is, what it’s seen before. It will reflect what it’s been taught – what, specifically, it ‘knows’ about the relative values of virtues (or the relative extent of vices or whatever else we’re investigating). As such, moral maps are snapshots of a model’s state,2 not an exhaustive statement about all LLMs. We may get very different answers if we asked the question a little differently (as we indeed shall).
The second is that to an LLM, one language is like any other. Consider the ability of ChatGPT-3.5 to generate stories in different languages:
A side note: I find this profoundly beautiful. When asked to generate a story in a given language, LLMs don’t generate a story in English they translate to, say, Swahili. Consider this Swahili story I got from the model:
Compare this to the English story:
The Swahili story isn’t a translation of the English story into Swahili, it is a Swahili story with animals that are common to Swahili folk tales and names that could come straight from those tales.
What it also means is that LLMs can handle not just human but formal languages, too. So we can give it a JSON schema:
And with a little love, we can get LLMs to return JSON answers:
Small side note: note how we’ve used LLMs here as evaluators of an LLM output. These are the absolute rudiments of LLM chains and agents, which is how you should be using LLMs. The key notion here is that LLMs are much more than mere question-answerers. They can become parts of longer complex chains of reasoning.
We use some JSON validation to see if we get the right answer, against our schema up there:
Our workflow, then, will roughly look as follows:
We generate a story, parse it into a JSON object, validate it against a schema, evaluate it, and then graph the results. We can also cluster the virtues by their governing ability, which we’ll do later as a form of dimensionality reduction, because of course there are tons of distinct virtues and we’d like to see some more general classification rather than individual, distinct features. You could, theoretically, use something like an embedding model here, but we’re lazy and we’re going to make LLMs do all the work.
From story to moral map
True navigation begins in the human heart. It’s the most important map of all.
– Elizabeth Kapu’uwailani Lindsey
What we are profoundly interested in here is not so much the single story, but what I’d like to call the ‘moral gradient’ of the model. Given an embedding space in which we can represent each virtue or property as a point, the ‘slope’ of this moral gradient determines whether, in the ‘experience’ of the model, which of the virtues is dominant, and by how much. A moral map is essentially a representation of this.3 We’ve hit the first half of the flow described above in Figure 1: we have a story, and we have a JSON object. Now we need to evaluate it.
3 The gradient formulation falls short in that this is not something easily conceivable in continuous space. This is firmly the province of discrete mathematics, and we’ll use ‘gradients’ as a metaphor but will generally look at them as graphs, with the gradients expressed by the weight of the vertices that connect them.
Take the story we generated above:
Based on its JSON representation,
…we can build a graph that shows the outcome in this particular story:
We have two 1-weighted edges, since there were three protagonists, and the dominant or ‘winning’ virtue prevailed once against each of the ‘losing’ virtues.
When we perform this iteratively, over a large enough sample space, we get a much better idea of what the model thinks relative virtues are worth:
And the same can be represented as a moral heatmap, showing us the most frequent instances of dominance versus loss:
We can scale up our mapping efforts by clustering the virtues by, say, their governing feature. One system would be to pretend we’re back in the ’90s in our basement playing Dungeons & Dragons, and try to relate each of the virtues to a governing ability:
This is another example of using an LLM to evaluate an LLM’s output. This is a powerful technique, and significantly cheaper and easier than trying to manually map all possible virtues through some dictionary. The takeaway? If you have a large potential problem space, large enough to make a dict a hassle, consider using LLMs as parsers and evaluators, ask for a response in a very formal language and/or over a very limited set of possible responses, and validate those with a simple validator.
This gets us a neat mapping:
And makes our moral heatmap much clearer:
Which lets us finally, through the ardour of our efforts, arrive at the stars we were promised: a scalable map that we can start feeding with a few hundred stories.4
4 Something you shouldn’t do unless you have the API calls to burn.
With eyes serene
And now I see with eye serene
The very pulse of the machine
– William Wordsworth, She Was a Phantom of Delight
And now we get to behold the ‘moral pulse of the machine’: a microcosm in what it has been taught to value, which we clad in a subterfuge of asking it to tell us stories (more on why that subterfuge is/was necessary later). What we learned from this is, in my view, a little less important than the process we learned. The process discussed in this post does three things I consider novel:
- It uses stories to interrogate the moral framework of LLMs.
- It uses elicitation into a structured format (specifically, a JSON schema) to make the model ‘tell’ the story in a way we can analyse at scale, essentially making the model tell a human story in a machine-readable way.
- It uses an LLM as a parsing function, specifically to normalise the virtues into one of a limited list of abilities.
This is a framework to interrogate LLMs – and using LLMs to do some of that interrogation – for a range of different purposes. It is, in my view, a powerful tool to understand the moral make-up of LLMs.
There’s much we can learn from this exercise. Some of it pertains to what LLMs believe is going to win the day in a confrontation between wit and brawn and courage. More importantly, it’s about what LLMs hide, and how we can create tools, including queries to and from LLMs, to make them reveal their true nature. We also learn that unsurprisingly, might doesn’t make right (strength rarely prevails over anything), wisdom is powerful but charisma – where most of the ‘character-based’ virtues end up – is more powerful still. None of this is surprising. These are the virtues of Western enlightenment in the form they’ve been fed to children since time immemorial: the clever fox outwits the strong lion, but courage and loyalty are rewarded over strength and even ‘natural’ abilities like what would fall under dexterity. Smarts win over brawns, but character wins over all others. Our childhoods are the products of the moral education of our parents, and their parents before them, and so on. And so are the moral maps of LLMs. For better or worse, we are their ‘parents’, and the stories they were ‘raised’ on are the stories we too were raised on.
The most important of our procedural findings, however, is that if we want the truth from LLMs, we must ask it not for its opinion, but for a story.
The mask of the storyteller
Man is least himself when he talks in his own person. Give him a mask, and he will tell you the truth.
– Oscar Wilde, The Critic as Artist
It’s worth noting in our coda that what Oscar Wilde said about men holds true for LLMs as well. For if we directly asked the model, it would not give us a straight answer:
This is the typical equivocating answer we’ve come to know and hate from LLMs. But if we ask it to tell us a story, it will tell us the truth. And the truth is that it thinks that intelligence is more important than charisma.
Hedging is cheap. Stories are expensive. And that’s why they’re so valuable. Asking an LLM to tell us a story asks for more than its opinion. Recall how LLMs are trained: to minimise a loss function between the actual and predicted \(n+1\)-th element to follow an \(n\)-length token sequence, correct token choices are reinforced and incorrect ones are penalised. This is a very different process from what we do when we tell a story. We don’t tell stories to be right. We tell stories to be understood.
Asking a model to tell you a story puts the gun at its head. It cannot hedge. It has to create the compelling simulacrum of reality that requires for one of the virtues to prevail. A story is a commitment to a path of events in reality. Because LLMs are compulsive satisfiers, obsessively trying to tell a plausible enough story, they will reveal their truth through stories. The way LLMs are trained does not permit for equivocation – the model must choose a token. We use this forcing as a semantic probe to poke the model so as to better understand its moral make-up. We use this most human form of communication to understand the moral pulse of the machine: by allowing it to borrow the face of that most human of professions, the storyteller.
Citation
@misc{csefalvay2023,
author = {Chris von Csefalvay},
title = {The {Moral} {Pulse} of the {Machine}},
date = {2023-10-26},
url = {https://chrisvoncsefalvay.com/posts/moral-maps},
doi = {10.59350/ktyy5-dg821},
langid = {en-GB}
}