Rescuing the utility function

“Saving the phenomena” is the name for the rule that brilliant new scientific theories still need to reproduce our mundane old observations. The point of heliocentric astronomy is not to predict that the Sun careens crazily over the sky, but rather, to explain why the Sun appears to rise and set each day—the same old mundane observations we had already. Similarly quantum mechanics is not supposed to add up to a weird universe unlike the one we observe; it is supposed to add up to normality. New theories may have not-previously-predicted observational consequences in places we haven’t looked yet, but by default we expect the sky to look the same color.

“Rescuing the utility function” is an analogous principle meant to apply to naturalistic moral philosophy: new theories about which things are composed of which other things should, by default, not affect what we value. For example, if your values previously made mention of “moral responsibility” or “subjective experience”, you should go on valuing these things after discovering that people are made of parts.

As the above sentence contains the word “should”, the principle of “rescuing the utility function” is being asserted as a normative principle rather than a descriptive theory.

Metaphorical example: heat and kinetic energy

Suppose, for the sake of metaphor, that our species regarded “warmth” as a terminal value over the world. It wouldn’t just be nice to feel warm in a warm coat; instead you would prefer that the outside world actually be warm, in the same way that e.g. you prefer for your friends to actually be happy in the outside world, and ceteris paribus you wouldn’t be satisfied to only deceive yourself into thinking your friends were happy.

One day, scientists propose that “heat” may really be composed of “disordered kinetic energy”—that when we experience an object as warm, it’s because the particles comprising that object are vibrating and bumping into each other.

You imagine this possibility in your mind, and find that you don’t get any sense of lovely warmth out of imagining lots of objects moving around. No matter how fast you imagine an object vibrating, this imagination doesn’t seem to produce a corresponding imagined feeling of warmth. You therefore reject the idea that warmth is composed of disordered kinetic motion.

After this, science advances a bit further and proves that heat is composed of disordered kinetic energy.

One possible way to react to this revelation—again, assuming for the sake of argument that you cared about warmth as a terminal value—would be by experiencing great existential horror. Science has proven that the universe is devoid of ontologically basic heat! There’s really no such thing as heat! It’s all just temperature-less particles moving around!

Sure, if you dip your finger in hot water, it feels warm. But neuroscientists have shown that when our nerves tell us there’s heat, they’re really just being fooled into firing by being excited with kinetic energy. When we touch an object and it feels hot, this is just an illusion being produced by fast-moving particles activating our nerves. This is why our brains make us think that things are hot even though they’re just bouncing particles. Very sad, but at least now we know the truth.

Alternatively, you could react as follows:

  • Heat doesn’t have to be ontologically basic to be valuable. Valuable things can be made out of parts.

  • The parts that heat was made out of, turn out to be disordered kinetic energy. Heat isn’t an illusion, vibrating particles just are heat. It’s not like you’re getting a consolation prize of vibrating particles when you really wanted heat. You have heat. It’s right there in the warm water.

  • From now on, you’ll think of warmth when you think of disordered kinetic energy and vibrating particles. Since your native emotions don’t automatically light up when you use the vibrating-particles visualization of heat, you will now adopt the rule that whenever you imagine disordered kinetic energy being present, you will imagine a sensation of warmth so as to go on binding your emotions to this new model of reality.

This reply would be “rescuing the utility function”.

Argument for rescuing the utility function (still in the heat metaphor)

Our minds have an innate and instinctive representation of the universe to which our emotions natively and automatically bind. Warmth and color are basic to that representation; we don’t instinctively imagine them as made out of parts. When we imagine warmth in our native model, our emotions automatically bind and give us the imagined feeling of warmth.

After learning more about how the universe works and how to imagine more abstract and non-native concepts, we can also visualize a lower-level model of the universe containing vibrating particles. But unsurprisingly, our emotions automatically bind only to our native, built-in mental models and not the learned abstract models of our universe’s physics. So if you imagine tiny billiard balls whizzing about, it’s no surprise that this mental picture doesn’t automatically trigger warm feelings.

It’s a descriptive statement about our universe, a way the universe is, that ‘heat’ is a high-level representation of the disordered kinetic energy of colliding and vibrating particles. But to advocate that we should re-bind our emotions to this non-native mental model by feeling cold is merely one possible normative statement among many.

Saying “There is really no such thing as heat!” is from this perspective a normative statement rather than a descriptive one. The real meaning is “If you’re in a universe where the observed phenomenon of heat turns out to be comprised of vibrating particles, then you shouldn’t feel any warmth-related emotions about that universe.” Or, “Only ontologically basic heat can be valuable.” Or, “If you’ve only previously considered questions of value over your native representation, and that’s the only representation to which your emotions automatically bind without further work, then you should attach zero value to every possible universe whose physics don’t exactly match that representation.” This normative proposition is a different statement than the descriptive truth, “Our universe contains no ontologically basic heat.”

The stance of “rescuing the utility function” advocates that we have no right to expect the universe to function exactly like our native representation of it. According to this stance, it would be a strange and silly demand to make of the universe that its lowest level of operation correspond exactly to our built-in mental representations, and insist that we’re not going to feel anything warm about reality unless heat is basic to physics. The high-level representations our emotions natively bind to, could not reasonably have been fated to be identical with the raw low-level description of the universe. So if we couldn’t ‘rescue the utility function’ by identifying high-level heat with vibrating particles, this portion of our values would inevitably end in disappointment.

Once we can see as normative the question of how to feel about a universe that has dancing particles instead of ontologically basic warmth, we can see that going around wailing in existential despair about the coldness of the universe doesn’t seem like the right act or the right judgment. Instead we should rebind our native emotions to the non-instinctive but more accurate model of the universe.

If we aren’t self-modifying AIs and can’t actually rewrite our emotions to bind to learned abstract models, then we can come closer to normative reasoning by adopting the rule of visualizing warmth whenever we visualize whizzing particles.

On this stance, it is not a lie to visualize warmth when we visualize whizzing particles. We are not giving ourselves a sad consolation prize for the absence of ‘real’ heat. It’s not an act of self-deception to imagine a sensation of lovely warmth going along with a bunch of vibrating particles. That’s just what heat is in our universe. Similarly, our nerves are not lying to us when they make us feel that fast-vibrating water is warm water.

On the normative stance of “rescuing the utility function”, when X turns out to be composed of Y, then by default we should feel about Y the same way we felt about X. There might be other considerations that modify this, but that’s the starting point or default.

After ‘rescuing the utility function’, your new theory of how the universe operates (that heat is made up of kinetic energy) adds up to moral normality. If you previously thought that a warm summer was better than a cold winter, you will still think that a warm summer is better than a cold winter after you find out what heat is made out of.

(This is a good thing, since the point of a moral philosophy is not to be amazing and counterintuitive.)

Reason for using the heat metaphor

One reason to start from this metaphorical example is that “heat” has a relatively understandable correspondence between high-level and low-level models. On a high level, we can see heat melting ice and flowing from hotter objects to cooler objects. We can, by imagination, see how vibrating particles could actually constitute heat rather than causing a mysterious extra ‘heat’ property to be present. Vibrations might flow from fast-vibrating objects to slow-vibrating objects via the particles bumping into each other and transmitting their speed. Water molecules vibrating quickly enough in an ice cube might break whatever bonds were holding them together in a solid object.

Since it does happen to be relatively easy to visualize how heat is composed of kinetic energy, we can see in this case that we are not lying to ourselves by imagining that lovely warmth is present wherever vibrating particles are present.

For an even more transparent reductionist identity, consider, “You’re not really wearing socks, there are no socks, there’s only a bunch of threads woven together that looks like a sock.” Your visual cortex can represent this identity directly, so it feels immediately transparent that the sock just is the collection of threads; when you imagine sock-shaped woven threads, you automatically feel your visual model recognizing a sock.

If the relation between heat and kinetic energy were too complicated to visualize easily, it might instead feel like we were being given a blind, unjustified rule that reality contains mysterious “bridging laws” that make a separate quality of heat be present when particles vibrate quickly. Instructing ourselves to feel “warmth” as present when particles vibrate quickly would feel more like fooling ourselves, or self-deception. But on the position of “rescuing the utility function”, the same arguments ought to apply in this hypothetical, even if the level transition is less transparent.

The gap between mind and brain is larger than the gap between heat and vibration, which is why humanity understood heat as disordered kinetic energy long before anyone had any idea how ‘playing chess’ could be decomposed into non-mental simpler parts. In some cases, we may not know what the reductionist identity will be. Still, the advice of “rescuing the utility function” is not to morally panic about realizing that various emotionally valent things will turn out to be made of parts, or even that our mind’s representations in general may run somewhat skew to reality.

Complex rescues

In the heat metaphor, the lower level of the universe (jiggling particles) corresponds fairly exactly to the high-level notion of heat. We’d run into more complicated metamoral questions if we’d previously lumped together the ‘heat’ of chili peppers and the ‘heat’ of a fireplace as valuable ‘warmth’.

We might end up saying that there are two physical kinds of valuable warm things: ceteris paribus and by default, if X turns out to consist of Y, then Y inherits X’s role in the utility function. Alternatively, by some non-default line of reasoning, the discovery that chili peppers and fireplaces are warm in ontologically different ways might lead us to change how we feel about them on a high level as well. In this case we might have to carry out a more complicated rescue, where it’s not so immediately obvious which low-level Ys are to inherit X’s value.

Non-metaphorical utility rescues

We don’t actually have terminal values for things being warm (probably). Non-metaphorically, “rescuing the utility function” says that we should apply similar reasoning to phenomena that we do in fact value, whose corresponding native emotions we are having trouble reconciling with non-native, learned theories of the universe’s ontology.

Examples might include:

  • Moral responsibility. How can we hold anyone responsible for their actions, or even hold ourselves responsible for what we see as our own choices, when our acts have causal histories behind them?

  • Happiness. What’s the point of people being happy if it’s just neurons firing?

  • Goodness and shouldness. Can there be any right thing to do, if there isn’t an ontologically basic irreducible rightness property to correspond to our sense that some things are just right?

  • Wanting and helping. When a person wants different things at different times, and there are known experiments that expose circular and incoherent preferences, how can we possibly “help” anyone by “giving them what they want” or “extrapolating their volition”?

In cases like these, it may be that our native representation is in some sense running skew to the real universe. E.g., our minds insist that something called “free will” is very important to moral responsibility, but it seems impossible to define “free will” in a coherent way. The position of “rescuing the utility function” still takes the stance of “Okay, let’s figure out how to map this emotion onto a coherent universe as best we can” not “Well, it looks like the human brain didn’t start out with a perfect representation of reality, therefore, normatively speaking, we should toss the corresponding emotions out the window.” If in your native representation the Sun goes around the Earth, and then we learn differently from astronomy, then your native representation is in fact wrong, but normatively we should (by default) re-bind to the enormous glowing fusion reactor rather than saying that there’s no Sun.

The role such concepts play in our values lends a special urgency to the question of how to rescue them. But on an even more general level, one might espouse that it is the job of good reductionists to say how things exist if they have any scrap of reality, rather than it being the job of reductionists to go around declaring that things don’t exist if we detect the slightest hint of fantasy. Leif K-Brooks presented this general idea as follows (with intended application to ‘free will’ in particular):

If you define a potato as a magic fairy orb and disprove the existence of magic fairy orbs, you still have a potato.

“Rescue” as resolving a degree of freedom in the pretheoretic viewpoint

A human child doesn’t start out endorsing any particular way of turning emotions into utility functions. As humans, we start out with no clear rules inside the chaos of our minds, and we have to make them up by considering various arguments appealing to our not-yet-organized intuitions. Only then can we even try to have coherent metaethical principles.

The core argument for “rescuing the utility function” can be seen as a base intuitive appeal to someone who hasn’t yet picked out any explicit rules, claiming that it isn’t especially sensible to end up as the kind of agent whose utility function ends up being zero everywhere.

In other words, rather than the rescue project needing to appeal to rules that would only be appealing to somebody who’d already accepted the rescue project ab initio—which would indeed be circular—the start of the argument is meant to also work as an intuitive appeal to the pretheoretic state of mind of a normal human. After that, we also hopefully find that these new rules are self-consistent.

In terms of the heat metaphor, if we’re considering whether to discard heat, we can consider three types of agents:

  • (1). A pretheoretic or confused state of intuition, which knows itself to be confused. An agent like this is not reflectively consistent—it wants to resolve the internal tension.

  • (2). An agent that has fully erased all emotions relating to warmth, as if it never had them. This type of agent is reflectively consistent; it doesn’t value warmth and doesn’t want to value warmth.

  • (3). An agent that values naturalistic heat, i.e., feels the way about disordered kinetic energy that a pretheoretic human feels about warmth. This type of agent has also resolved its issues and become reflectively consistent.

Since 2 and 3 are both internally consistent resolutions, there’s potentially a reflectively consistent degree of freedom in how (1) can resolve its current internal tension or inconsistency. That is, it’s not the case that the only kind of coherent agents are 2-agents or 3-agents, so just a desire for coherence qua coherence can’t tell a 1-agent whether it should become a 2-agent or 3-agent. By advocating for rescuing the utility function, we’re appealing to a pretheoretic and maybe chaotic and confused mess of intuitions, aka a human, arguing that if you want to shake out the mess, it’s better to shake out as a 3-agent rather than a 2-agent.

In making this appeal, we can’t appeal to firm foundations that already exist, since a 1-agent hasn’t yet decided on firm philosophical foundations and there’s more than one set of possible foundations to adopt. An agent with firm foundations would already be reflectively coherent and have no further philosophical confusion left to resolve (except perhaps for a mere matter of calculation). An existing 2-agent is of course nonplussed by any arguments that heat should be valued, in much the same way that humans would be nonplussed for arguments in favor of valuing paperclips (or for that matter, things being hot). But to point this out is no argument for why a confused 1-agent should shake itself out as a consistent 2-agent rather than a consistent 3-agent; a 3-agent is equally nonplussed by the argument that the best thing to do with an ontology identification problem is throw out all corresponding terms of the utility function.

It’s perhaps lucky that human beings can’t actually modify their own code, meaning that somebody partially talked into taking the 2-agent state as a new ideal to aspire to, still actually has the pretheoretic emotions and can potentially “snap out of it”. Rather than becoming a 2-agent or a 3-agent, we become “a 1-agent that sees 2-agency as ideal” or “a 1-agent that sees 3-agency as ideal”. A 1-agent aspiring to be a 2-agent can still potentially be talked out of it—they may still feel the weight of arguments meant to appeal to 1-agents, even if they think they ought not to, and can potentially “just snap out” of taking 2-agency as an ideal, reverting to confused 1-agency or to taking 3-agency as a new ideal.

Looping through the meta-level in “rescuing the utility function”

Since human beings don’t have “utility functions” (coherent preferences over probabilistic outcomes), the notion of “rescuing the utility function” is itself a matter of rescue. Natively, it’s possible for psychology experiments to expose inconsistent preferences, but instead of throwing up our hands and saying “Well I guess nobody wants anything and we might as well let the universe get turned into paperclips!”, we try to back out some reasonably coherent preferences from the mess. This is, arguendo, normatively better than throwing up our hands and turning the universe into paperclips.

Similarly, according to the normative stance behind extrapolated volition, the very notion of “shouldness” is something that gets rescued. Many people seem to instinctively feel that ‘shouldness’ wants to map onto an ontologically basic, irreducible property of rightness, such that every cognitively powerful agent with factual knowledge about this property is thereby compelled to perform the corresponding actions. (“Moral internalism.”) But this demands an overly direct correspondence between our native sense that some acts have a compelling rightness quality about them, and wanting there to be an ontologically basic compelling rightness quality out there in the environment.

Despite the widespread appeal of moral internalism once people are exposed to it as an explicit theory, it still seems unfair to say that humans natively want or pretheoretically demand that this is what our sense of rightness correspond to. E.g. a hunter-gatherer, or someone else who’s never debated metaethics, doesn’t start out with an explicit commitment about whether a feeling of rightness must correspond to universes that have irreducible rightness properties in them. If you’d grown up thinking that your feeling of rightness corresponded to computing a certain logical function over universes, this would seem natural and non-disappointing.

Since “shouldness” (the notion of normativity) is something that itself may need rescuing, this rescue of “shouldness” is in some sense being implicitly invoked by the normative assertion that we should try to “rescue the utility function”.

This could be termed circular, but we could equally say that it is self-consistent. Or rather, we are appealing to some chaotic, not-yet-rescued, pretheoretic notion of “should” in saying that we should try to rescue concepts like “shouldness” instead of throwing them out the window. Afterwards, once we’ve performed the rescue and have a more coherent notion of concepts like “better”, we can see that the loop through the meta-level has entered a consistent state. According to this new ideal (less confused, but perhaps also seeming more abstract), it remains better not to give up on concepts like “better”.

The “extrapolated volition” rescue of shouldness is meant to bootstrap, by appeal to a pretheoretic and potentially confused state of wondering what is the right thing to do (or how we even should resolve this whole “rightness” issue, and if maybe it would be better to just give up on it), into a more reflectively consistent state of mind. Afterwards we can see both pretheoretically, and also in the light of our new explicit theory, that we ought to try to rescue the concept of oughtness, and adopt the rescued form of reasoning as a less-confused ideal. We will believe that ideally, the best justification for extrapolated volition is to say that we know of no better candidate for what theory we’d arrive at if we thought about it for even longer. But since humans can perhaps thankfully not directly rewrite their own code, we will also remain aware of whether this seems like a good idea in the pretheoretic sense, and perhaps prepared to unwind or jump back out of the system if it turns that out that the explicit theory has big problems we didn’t know about when we originally jumped to it.

Reducing tension /​ “as if you’d always known it”

We can possibly see the desired output of “rescuing the utility function” as being something like reducing a tension between native emotion-binding representations, and reality, with a minimum of added fuss and complexity.

This can look a lot like the intuition pump, “Suppose you’d grown up always knowing the true state of affairs, and nobody had suggested that you panic over it or experience any existential angst; what would you have grown up thinking?” If you grow up with warmth-related emotions and already knowing that heat is disordered kinetic energy, and nobody has suggested to you that anyone ought to wail in existential angst about this, then you’ll probably grow up valuing heat-as-disordered-kinetic-energy (and this will be a low-tension resolution).

For more confusing grades of cognitive reductionism, like free will, people might spontaneously have difficulty reconciling their internal sense of freedom with being told about deterministic physical laws. But a good “rescue” of the corresponding sense of moral responsibility ought to end up looking like the sort of thing that you’d quietly take for granted as a quiet, obvious-seeming mapping of your sense of moral responsibility onto the physical universe, if you’d grown up taking those laws of physics for granted.

“Rescuing” pretheoretic emotions and intuitions versus “rescuing” explicit moral theories

Rewinding past factual-mistake-premised explicit moral theories

In the heat metaphor, suppose you’d previously adopted a ‘caloric fluid’ model of heat, and the explicit belief that this caloric fluid was what was valuable. You still have wordless intuitions about warmth and good feelings about warmth. You also have an explicit world-model that this heat corresponds to caloric fluid, and an explicit moral theory that this caloric fluid is what’s good about heat.

Then science discovers that heat is disordered kinetic energy. Should we try to rescue our moral feelings about caloric by looking for the closest thing in the universe to caloric fluid—electricity, maybe?

If we now reconsider the arguments for “rescuing the utility function”, we find that we have more choices beyond “looking for the closest thing to caloric” and “giving up entirely on warm feelings”. An additional option is to try to rescue the intuitive sense of warmth, but not the explicit beliefs and explicit moral theories about “caloric fluid”.

If we instead choose to rescue the pretheoretic emotion, we could see this as retracing our steps after being led down a garden-path of bad reasoning, aka, “not rescuing the garden path”. We started with intuitively good feelings about warmth, came to believe a false model about the causes of warmth, reacted emotionally to this false model, and developed an explicit moral theory about caloric fluid.

The extrapolated-volition model of normativity (what we would want* if we knew all the facts) suggests that we could see the reasoning after adopting the false caloric model as “mistaken” and not rescue it. When we’re dealing with explicit moral beliefs that grew up around a false model of the world, we have the third option to “rewind and rescue” rather than “rescue” or “give up”.

Nonmetaphorically: Suppose you believe in a divine command theory of metaethics; goodness is equivalent to God wanting something. Then one day you realize that there’s no God in which to ground your moral theory.

In this case we have three options for resolution, all of which are reflectively consistent within themselves, and whose arguments may appeal to our currently-confused pretheoretic state:

  • (a) Prefer to go about wailing in horror about the unfillable gap in the universe left by the absence of God.

  • (b) Try to rescue the explicit divine command theory, e.g. by looking for the closest thing to a God and re-anchoring the divine command theory there.

  • (c) Give up on the explicit model of divine command theory; instead, try to unwind past the garden path you went down after your native emotions reacted to the factually false model of God. Try to remap the pretheoretic emotions and intuitions onto your new model of the universe.

Again, (a) and (b) and (c) all seem reflectively consistent in the sense that a simple agent fully in one of these states will not want to enter either of the other two states. But given these three options, a confused agent might reasonably find either (b) or (c) more pretheoretically compelling than (a), but also find (c) more pretheoretically compelling than (b).

The notion of “rescuing” isn’t meant to erase the notion of “mistakes” and “saying oops” with respect to b-vs.-c alternatives. The arguments for “rescuing” warmth implicitly assumed that we were talking about a pretheoretic normative intuition (e.g. an emotion associated with warmth), not explicit models and theories about heat that could just as easily be revised.

Conversely, when we’re dealing with preverbal intuitions and emotions whose natively bound representations are in some way running skew to reality, we can’t rewind past the fact of our emotions binding to particular mental representations. We were literally born that way. Then our only obvious alternatives are to (a) give up entirely on that emotion and value, or (c) rescue the intuitions as best we can. In this case (c) seems more pretheoretically appealing, ceteris paribus and by default.

(For example, suppose you were an alien that had grown up accepting commands from a Hive Queen, and you had a pretheoretic sense of the Hive Queen as knowing everything, and you mostly operated on an emotional-level Hive-Queen-command theory of rightness. One day, you begin to suspect that the Hive Queen isn’t actually omniscient. Your alien version of “rescuing the utility function” might say to rescue the utility function by allowing valid commands to be issued by Hive Queens that knew a lot but weren’t omniscient. Or it might say to try and build a superintelligent Hive Queen that would know as much as possible, because in a pretheoretic sense that would feel better. The aliens can’t rewind past their analogue of divine command theory because, by hypothesis, the alien’s equivalent of divine command metaethics is built into them on a pretheoretic and emotional level. Though of course, in this case, such aliens seem more likely to actually resolve their tension by asking the Hive Queen what to do about it.)

Possibility of rescuing non-mistake-premised explicit moral theories

Suppose Alice has an explicit belief that private property ought to be a thing, and this belief did not develop after she was told that objects had tiny XML tags declaring their irreducible objective owners, nor did she originally arrive at the conclusion based on a model in which God assigned transferable ownership of all objects at the dawn of time. We can suppose, somewhat realistically, that Alice is a human and has a pretheoretic concept of ownership as well as deserving rewards for effort, and was raised by small-l libertarian parents who told her true facts about how East Germany did worse economically than West Germany. Over time, she came to adopt an explicit moral theory of “private property”: ownership can only transfer by consent, and that violations of this rule violate the just-rewards principle.

One day, Alice starts having trouble with her moral system because she’s realized that property is made of atoms, and that even the very flesh in her body is constantly exchanging oxygen and carbon dioxide with the publicly owned atmosphere. Can atoms really be privately owned?

The confused Alice now again sees three options, all of them reflectively consistent on their own terms if adopted:

  • (a) Give up on everything to do with ownership or deserving rewards for efforts; regard these emotions as having no valid referents.

  • (b) Try to rescue the explicit moral theory by saying that, sure, atoms can be privately owned. Alice owns a changeable number of carbon atoms inside her body and she won’t worry too much about how they get exchanged with the atmosphere; that’s just the obvious way to map private property onto a particle-based universe.

  • (c) Try to rewind past the explicit moral theory, and figure out from scratch what to do with emotions about “deserves reward” or “owns”.

Leaving aside what you think of Alice’s explicit moral theory, it’s not obvious that Alice will end up preferring (c) to (b), especially since Alice’s current intuitive state is influenced by her currently-active explicit theory of private property.

Unlike the divine command theory, Alice’s private property theory was not (obviously to Alice) arrived at through a path that traversed wrong beliefs of simple fact. With the divine command theory, since it was critically premised on a wrong factual model, we face the prospect of having to stretch the theory quite a lot in order to rescue it, making it less intuitively appealing to a confused mind than the alternative prospect of stretching the pretheoretic emotions a lot less. Whereas from Alice’s perspective, she can just as easily pick up the whole moral theory and morph it onto reductionist physics with all the internal links intact, rather than needing to rewind past anything.

We at least have the apparent option of trying to rescue Alice’s utility function in a way that preserves her explicit moral theories not based on bad factual models—the steps of her previous explicit reasoning that did not, of themselves, introduce any new tensions or problems in mapping her emotions or morals onto a new representation. Whether or not we ought to do this, it’s a plausible possibility on the table.

Which explicit theories to rescue?

It’s not yet obvious where to draw the line on which explicit moral theories to rescue. So far as we can currently see, any of the following might be a reasonable way to tell a superintelligence to extrapolate someone’s volition:

  • Preserve explicit moral theories wherever it doesn’t involve an enormous stretch.

  • Be skeptical of explicit moral theories that were arrived at by fragile reasoning processes, even if they could be rescued in an obvious way.

  • Only extrapolate pretheoretic intuitions.

Again, all of these viewpoints are internally consistent (they are degrees of freedom in the metaphorical meta-utility-function), so the question is which rule for drawing the line seems most intuitively appealing in our present state of confusion:

Argument from adding up to normality

Arguendo: Preserving explicit moral theories is important for having the rescued utility function add up to normality. After rescuing my notion of “shouldness”, then afterwards I should, by default, see mostly the same things as rescued-right.

Suppose Alice was previously a moral internalist, and thought that some things were inherently irreducibly right, such that her very notion of “shouldness” needed rescuing. That doesn’t necessarily introduce any difficulties into re-importing her beliefs about private property. Alice may have previously refused to consider some arguments against private property because she thought it was irreducibly right, but this is a separate issue in extrapolating her volition from throwing out her entire stock of explicit moral theories because they all used the word “should”. By default, after we’re done rescuing Alice, unless we are doing something that’s clearly and explicitly correcting an error, her rescued viewpoint should look as normal-relative-to-her-previous-perspective as possible.

Argument from helping

Arguendo: Preserving explicit moral theories where possible is an important aspect of how an ideal advisor or superintelligence ought to extrapolate someone else’s volition.

Suppose Alice was previously a moral internalist, and thought that some things were inherently irreducibly right, such that her very notion of “shouldness” needed rescuing. Alice may not regard it as “helping” her to throw away all of her explicit theories and try to re-extrapolate her emotions from scratch into new theories. If there weren’t any factual flaws involved, Alice is likely to see it as less than maximally helpful to her if we needlessly toss one of her cherished explicit moral theories.

Argument from incoherence, evil, chaos, and arbitrariness

Arguendo: Humans are really, really bad at systematizing explicit moral theories; a supermajority of explicit moral theories in today’s world will be incoherent, evil, or both. E.g., explicit moral principles may de-facto be chosen mostly on the basis of, e.g., how hard they appear to cheer for a valorized group. An extrapolation dynamic that tries to take into account all these chaotic, arbitrarily-generated group beliefs will end up failing to cohere.

Argument from fragility of goodness

Arguendo: Most of what we see as the most precious and important part of ourselves are explicit moral theories like “all sapient beings should have rights”, which aren’t built into human babies. We may well have arrived at that destination through a historical trajectory that went through factual mistakes, like believing that all human beings had souls created equal by God and were loved equally by God. (E.g. Christian theology seems to have been, as a matter of historical fact, causally important in the development of explicit anti-slavery sentiment.) Tossing the explicit moral theories is as unlikely to be good, from our perspective, as tossing our brains and trying to rerun the process of natural selection to generate new emotions.

Argument from dependency on empirical results

Arguendo: Which version of extrapolation we’ll actually find appealing will depend on which extrapolation algorithm turns out to have a reasonable answer. We don’t have enough computing power to guess, right now, whether:

  • Any reasonable-looking construal of “Toss out all the explicit cognitive content and redo by throwing pretheoretic emotions at true facts” leads to an extrapolated volition that lacks discipline and coherence, looking selfish and rather angry, failing to regenerate most of altruism or Fun Theory; or

  • Any reasonable-looking construal of “Try to preserve explicit moral theories” leads to an incoherent mess of assertions about various people going to Hell and capitalism being bad for you.

Since we can’t guess using our present computing power which rule would cause us to recoil in horror, but the actual horrified recoil would settle the question, we can only defer this single bit of information to one person who’s allowed to peek at the results.

(Counterargument: “Perhaps the ancient Greeks would have recoiled in horror if they saw how little the future would think of a glorious death in battle, thus picking the option we see as wrong, using the stated rule.”)