Rescuing the utility function

“Sav­ing the phe­nom­ena” is the name for the rule that brilli­ant new sci­en­tific the­o­ries still need to re­pro­duce our mun­dane old ob­ser­va­tions. The point of he­lio­cen­tric as­tron­omy is not to pre­dict that the Sun ca­reens crazily over the sky, but rather, to ex­plain why the Sun ap­pears to rise and set each day—the same old mun­dane ob­ser­va­tions we had already. Similarly quan­tum me­chan­ics is not sup­posed to add up to a weird uni­verse un­like the one we ob­serve; it is sup­posed to add up to nor­mal­ity. New the­o­ries may have not-pre­vi­ously-pre­dicted ob­ser­va­tional con­se­quences in places we haven’t looked yet, but by de­fault we ex­pect the sky to look the same color.

“Res­cu­ing the util­ity func­tion” is an analo­gous prin­ci­ple meant to ap­ply to nat­u­ral­is­tic moral philos­o­phy: new the­o­ries about which things are com­posed of which other things should, by de­fault, not af­fect what we value. For ex­am­ple, if your val­ues pre­vi­ously made men­tion of “moral re­spon­si­bil­ity” or “sub­jec­tive ex­pe­rience”, you should go on valu­ing these things af­ter dis­cov­er­ing that peo­ple are made of parts.

As the above sen­tence con­tains the word “should”, the prin­ci­ple of “res­cu­ing the util­ity func­tion” is be­ing as­serted as a nor­ma­tive prin­ci­ple rather than a de­scrip­tive the­ory.

Me­taphor­i­cal ex­am­ple: heat and ki­netic energy

Sup­pose, for the sake of metaphor, that our species re­garded “warmth” as a ter­mi­nal value over the world. It wouldn’t just be nice to feel warm in a warm coat; in­stead you would pre­fer that the out­side world ac­tu­ally be warm, in the same way that e.g. you pre­fer for your friends to ac­tu­ally be happy in the out­side world, and ce­teris paribus you wouldn’t be satis­fied to only de­ceive your­self into think­ing your friends were happy.

One day, sci­en­tists pro­pose that “heat” may re­ally be com­posed of “di­s­or­dered ki­netic en­ergy”—that when we ex­pe­rience an ob­ject as warm, it’s be­cause the par­ti­cles com­pris­ing that ob­ject are vibrat­ing and bump­ing into each other.

You imag­ine this pos­si­bil­ity in your mind, and find that you don’t get any sense of lovely warmth out of imag­in­ing lots of ob­jects mov­ing around. No mat­ter how fast you imag­ine an ob­ject vibrat­ing, this imag­i­na­tion doesn’t seem to pro­duce a cor­re­spond­ing imag­ined feel­ing of warmth. You there­fore re­ject the idea that warmth is com­posed of di­s­or­dered ki­netic mo­tion.

After this, sci­ence ad­vances a bit fur­ther and proves that heat is com­posed of di­s­or­dered ki­netic en­ergy.

One pos­si­ble way to re­act to this rev­e­la­tion—again, as­sum­ing for the sake of ar­gu­ment that you cared about warmth as a ter­mi­nal value—would be by ex­pe­rienc­ing great ex­is­ten­tial hor­ror. Science has proven that the uni­verse is de­void of on­tolog­i­cally ba­sic heat! There’s re­ally no such thing as heat! It’s all just tem­per­a­ture-less par­ti­cles mov­ing around!

Sure, if you dip your finger in hot wa­ter, it feels warm. But neu­ro­scien­tists have shown that when our nerves tell us there’s heat, they’re re­ally just be­ing fooled into firing by be­ing ex­cited with ki­netic en­ergy. When we touch an ob­ject and it feels hot, this is just an illu­sion be­ing pro­duced by fast-mov­ing par­ti­cles ac­ti­vat­ing our nerves. This is why our brains make us think that things are hot even though they’re just bounc­ing par­ti­cles. Very sad, but at least now we know the truth.

Alter­na­tively, you could re­act as fol­lows:

  • Heat doesn’t have to be on­tolog­i­cally ba­sic to be valuable. Valuable things can be made out of parts.

  • The parts that heat was made out of, turn out to be di­s­or­dered ki­netic en­ergy. Heat isn’t an illu­sion, vibrat­ing par­ti­cles just are heat. It’s not like you’re get­ting a con­so­la­tion prize of vibrat­ing par­ti­cles when you re­ally wanted heat. You have heat. It’s right there in the warm wa­ter.

  • From now on, you’ll think of warmth when you think of di­s­or­dered ki­netic en­ergy and vibrat­ing par­ti­cles. Since your na­tive emo­tions don’t au­to­mat­i­cally light up when you use the vibrat­ing-par­ti­cles vi­su­al­iza­tion of heat, you will now adopt the rule that when­ever you imag­ine di­s­or­dered ki­netic en­ergy be­ing pre­sent, you will imag­ine a sen­sa­tion of warmth so as to go on bind­ing your emo­tions to this new model of re­al­ity.

This re­ply would be “res­cu­ing the util­ity func­tion”.

Ar­gu­ment for res­cu­ing the util­ity func­tion (still in the heat metaphor)

Our minds have an in­nate and in­stinc­tive rep­re­sen­ta­tion of the uni­verse to which our emo­tions na­tively and au­to­mat­i­cally bind. Warmth and color are ba­sic to that rep­re­sen­ta­tion; we don’t in­stinc­tively imag­ine them as made out of parts. When we imag­ine warmth in our na­tive model, our emo­tions au­to­mat­i­cally bind and give us the imag­ined feel­ing of warmth.

After learn­ing more about how the uni­verse works and how to imag­ine more ab­stract and non-na­tive con­cepts, we can also vi­su­al­ize a lower-level model of the uni­verse con­tain­ing vibrat­ing par­ti­cles. But un­sur­pris­ingly, our emo­tions au­to­mat­i­cally bind only to our na­tive, built-in men­tal mod­els and not the learned ab­stract mod­els of our uni­verse’s physics. So if you imag­ine tiny billiard balls whizzing about, it’s no sur­prise that this men­tal pic­ture doesn’t au­to­mat­i­cally trig­ger warm feel­ings.

It’s a de­scrip­tive state­ment about our uni­verse, a way the uni­verse is, that ‘heat’ is a high-level rep­re­sen­ta­tion of the di­s­or­dered ki­netic en­ergy of col­lid­ing and vibrat­ing par­ti­cles. But to ad­vo­cate that we should re-bind our emo­tions to this non-na­tive men­tal model by feel­ing cold is merely one pos­si­ble nor­ma­tive state­ment among many.

Say­ing “There is re­ally no such thing as heat!” is from this per­spec­tive a nor­ma­tive state­ment rather than a de­scrip­tive one. The real mean­ing is “If you’re in a uni­verse where the ob­served phe­nomenon of heat turns out to be com­prised of vibrat­ing par­ti­cles, then you shouldn’t feel any warmth-re­lated emo­tions about that uni­verse.” Or, “Only on­tolog­i­cally ba­sic heat can be valuable.” Or, “If you’ve only pre­vi­ously con­sid­ered ques­tions of value over your na­tive rep­re­sen­ta­tion, and that’s the only rep­re­sen­ta­tion to which your emo­tions au­to­mat­i­cally bind with­out fur­ther work, then you should at­tach zero value to ev­ery pos­si­ble uni­verse whose physics don’t ex­actly match that rep­re­sen­ta­tion.” This nor­ma­tive propo­si­tion is a differ­ent state­ment than the de­scrip­tive truth, “Our uni­verse con­tains no on­tolog­i­cally ba­sic heat.”

The stance of “res­cu­ing the util­ity func­tion” ad­vo­cates that we have no right to ex­pect the uni­verse to func­tion ex­actly like our na­tive rep­re­sen­ta­tion of it. Ac­cord­ing to this stance, it would be a strange and silly de­mand to make of the uni­verse that its low­est level of op­er­a­tion cor­re­spond ex­actly to our built-in men­tal rep­re­sen­ta­tions, and in­sist that we’re not go­ing to feel any­thing warm about re­al­ity un­less heat is ba­sic to physics. The high-level rep­re­sen­ta­tions our emo­tions na­tively bind to, could not rea­son­ably have been fated to be iden­ti­cal with the raw low-level de­scrip­tion of the uni­verse. So if we couldn’t ‘res­cue the util­ity func­tion’ by iden­ti­fy­ing high-level heat with vibrat­ing par­ti­cles, this por­tion of our val­ues would in­evitably end in dis­ap­point­ment.

Once we can see as nor­ma­tive the ques­tion of how to feel about a uni­verse that has danc­ing par­ti­cles in­stead of on­tolog­i­cally ba­sic warmth, we can see that go­ing around wailing in ex­is­ten­tial de­spair about the cold­ness of the uni­verse doesn’t seem like the right act or the right judg­ment. In­stead we should re­bind our na­tive emo­tions to the non-in­stinc­tive but more ac­cu­rate model of the uni­verse.

If we aren’t self-mod­ify­ing AIs and can’t ac­tu­ally rewrite our emo­tions to bind to learned ab­stract mod­els, then we can come closer to nor­ma­tive rea­son­ing by adopt­ing the rule of vi­su­al­iz­ing warmth when­ever we vi­su­al­ize whizzing par­ti­cles.

On this stance, it is not a lie to vi­su­al­ize warmth when we vi­su­al­ize whizzing par­ti­cles. We are not giv­ing our­selves a sad con­so­la­tion prize for the ab­sence of ‘real’ heat. It’s not an act of self-de­cep­tion to imag­ine a sen­sa­tion of lovely warmth go­ing along with a bunch of vibrat­ing par­ti­cles. That’s just what heat is in our uni­verse. Similarly, our nerves are not ly­ing to us when they make us feel that fast-vibrat­ing wa­ter is warm wa­ter.

On the nor­ma­tive stance of “res­cu­ing the util­ity func­tion”, when X turns out to be com­posed of Y, then by de­fault we should feel about Y the same way we felt about X. There might be other con­sid­er­a­tions that mod­ify this, but that’s the start­ing point or de­fault.

After ‘res­cu­ing the util­ity func­tion’, your new the­ory of how the uni­verse op­er­ates (that heat is made up of ki­netic en­ergy) adds up to moral nor­mal­ity. If you pre­vi­ously thought that a warm sum­mer was bet­ter than a cold win­ter, you will still think that a warm sum­mer is bet­ter than a cold win­ter af­ter you find out what heat is made out of.

(This is a good thing, since the point of a moral philos­o­phy is not to be amaz­ing and coun­ter­in­tu­itive.)

Rea­son for us­ing the heat metaphor

One rea­son to start from this metaphor­i­cal ex­am­ple is that “heat” has a rel­a­tively un­der­stand­able cor­re­spon­dence be­tween high-level and low-level mod­els. On a high level, we can see heat melt­ing ice and flow­ing from hot­ter ob­jects to cooler ob­jects. We can, by imag­i­na­tion, see how vibrat­ing par­ti­cles could ac­tu­ally con­sti­tute heat rather than caus­ing a mys­te­ri­ous ex­tra ‘heat’ prop­erty to be pre­sent. Vibra­tions might flow from fast-vibrat­ing ob­jects to slow-vibrat­ing ob­jects via the par­ti­cles bump­ing into each other and trans­mit­ting their speed. Water molecules vibrat­ing quickly enough in an ice cube might break what­ever bonds were hold­ing them to­gether in a solid ob­ject.

Since it does hap­pen to be rel­a­tively easy to vi­su­al­ize how heat is com­posed of ki­netic en­ergy, we can see in this case that we are not ly­ing to our­selves by imag­in­ing that lovely warmth is pre­sent wher­ever vibrat­ing par­ti­cles are pre­sent.

For an even more trans­par­ent re­duc­tion­ist iden­tity, con­sider, “You’re not re­ally wear­ing socks, there are no socks, there’s only a bunch of threads wo­ven to­gether that looks like a sock.” Your vi­sual cor­tex can rep­re­sent this iden­tity di­rectly, so it feels im­me­di­ately trans­par­ent that the sock just is the col­lec­tion of threads; when you imag­ine sock-shaped wo­ven threads, you au­to­mat­i­cally feel your vi­sual model rec­og­niz­ing a sock.

If the re­la­tion be­tween heat and ki­netic en­ergy were too com­pli­cated to vi­su­al­ize eas­ily, it might in­stead feel like we were be­ing given a blind, un­jus­tified rule that re­al­ity con­tains mys­te­ri­ous “bridg­ing laws” that make a sep­a­rate qual­ity of heat be pre­sent when par­ti­cles vibrate quickly. In­struct­ing our­selves to feel “warmth” as pre­sent when par­ti­cles vibrate quickly would feel more like fool­ing our­selves, or self-de­cep­tion. But on the po­si­tion of “res­cu­ing the util­ity func­tion”, the same ar­gu­ments ought to ap­ply in this hy­po­thet­i­cal, even if the level tran­si­tion is less trans­par­ent.

The gap be­tween mind and brain is larger than the gap be­tween heat and vibra­tion, which is why hu­man­ity un­der­stood heat as di­s­or­dered ki­netic en­ergy long be­fore any­one had any idea how ‘play­ing chess’ could be de­com­posed into non-men­tal sim­pler parts. In some cases, we may not know what the re­duc­tion­ist iden­tity will be. Still, the ad­vice of “res­cu­ing the util­ity func­tion” is not to morally panic about re­al­iz­ing that var­i­ous emo­tion­ally va­lent things will turn out to be made of parts, or even that our mind’s rep­re­sen­ta­tions in gen­eral may run some­what skew to re­al­ity.

Com­plex rescues

In the heat metaphor, the lower level of the uni­verse (jig­gling par­ti­cles) cor­re­sponds fairly ex­actly to the high-level no­tion of heat. We’d run into more com­pli­cated meta­moral ques­tions if we’d pre­vi­ously lumped to­gether the ‘heat’ of chili pep­pers and the ‘heat’ of a fire­place as valuable ‘warmth’.

We might end up say­ing that there are two phys­i­cal kinds of valuable warm things: ce­teris paribus and by de­fault, if X turns out to con­sist of Y, then Y in­her­its X’s role in the util­ity func­tion. Alter­na­tively, by some non-de­fault line of rea­son­ing, the dis­cov­ery that chili pep­pers and fire­places are warm in on­tolog­i­cally differ­ent ways might lead us to change how we feel about them on a high level as well. In this case we might have to carry out a more com­pli­cated res­cue, where it’s not so im­me­di­ately ob­vi­ous which low-level Ys are to in­herit X’s value.

Non-metaphor­i­cal util­ity rescues

We don’t ac­tu­ally have ter­mi­nal val­ues for things be­ing warm (prob­a­bly). Non-metaphor­i­cally, “res­cu­ing the util­ity func­tion” says that we should ap­ply similar rea­son­ing to phe­nom­ena that we do in fact value, whose cor­re­spond­ing na­tive emo­tions we are hav­ing trou­ble rec­on­cil­ing with non-na­tive, learned the­o­ries of the uni­verse’s on­tol­ogy.

Ex­am­ples might in­clude:

  • Mo­ral re­spon­si­bil­ity. How can we hold any­one re­spon­si­ble for their ac­tions, or even hold our­selves re­spon­si­ble for what we see as our own choices, when our acts have causal his­to­ries be­hind them?

  • Hap­piness. What’s the point of peo­ple be­ing happy if it’s just neu­rons firing?

  • Good­ness and should­ness. Can there be any right thing to do, if there isn’t an on­tolog­i­cally ba­sic ir­re­ducible right­ness prop­erty to cor­re­spond to our sense that some things are just right?

  • Want­ing and helping. When a per­son wants differ­ent things at differ­ent times, and there are known ex­per­i­ments that ex­pose cir­cu­lar and in­co­her­ent prefer­ences, how can we pos­si­bly “help” any­one by “giv­ing them what they want” or “ex­trap­o­lat­ing their vo­li­tion”?

In cases like these, it may be that our na­tive rep­re­sen­ta­tion is in some sense run­ning skew to the real uni­verse. E.g., our minds in­sist that some­thing called “free will” is very im­por­tant to moral re­spon­si­bil­ity, but it seems im­pos­si­ble to define “free will” in a co­her­ent way. The po­si­tion of “res­cu­ing the util­ity func­tion” still takes the stance of “Okay, let’s figure out how to map this emo­tion onto a co­her­ent uni­verse as best we can” not “Well, it looks like the hu­man brain didn’t start out with a perfect rep­re­sen­ta­tion of re­al­ity, there­fore, nor­ma­tively speak­ing, we should toss the cor­re­spond­ing emo­tions out the win­dow.” If in your na­tive rep­re­sen­ta­tion the Sun goes around the Earth, and then we learn differ­ently from as­tron­omy, then your na­tive rep­re­sen­ta­tion is in fact wrong, but nor­ma­tively we should (by de­fault) re-bind to the enor­mous glow­ing fu­sion re­ac­tor rather than say­ing that there’s no Sun.

The role such con­cepts play in our val­ues lends a spe­cial ur­gency to the ques­tion of how to res­cue them. But on an even more gen­eral level, one might es­pouse that it is the job of good re­duc­tion­ists to say how things ex­ist if they have any scrap of re­al­ity, rather than it be­ing the job of re­duc­tion­ists to go around declar­ing that things don’t ex­ist if we de­tect the slight­est hint of fan­tasy. Leif K-Brooks pre­sented this gen­eral idea as fol­lows (with in­tended ap­pli­ca­tion to ‘free will’ in par­tic­u­lar):

If you define a potato as a magic fairy orb and disprove the existence of magic fairy orbs, you still have a potato.

“Res­cue” as re­solv­ing a de­gree of free­dom in the prethe­o­retic viewpoint

A hu­man child doesn’t start out en­dors­ing any par­tic­u­lar way of turn­ing emo­tions into util­ity func­tions. As hu­mans, we start out with no clear rules in­side the chaos of our minds, and we have to make them up by con­sid­er­ing var­i­ous ar­gu­ments ap­peal­ing to our not-yet-or­ga­nized in­tu­itions. Only then can we even try to have co­her­ent metaeth­i­cal prin­ci­ples.

The core ar­gu­ment for “res­cu­ing the util­ity func­tion” can be seen as a base in­tu­itive ap­peal to some­one who hasn’t yet picked out any ex­plicit rules, claiming that it isn’t es­pe­cially sen­si­ble to end up as the kind of agent whose util­ity func­tion ends up be­ing zero ev­ery­where.

In other words, rather than the res­cue pro­ject need­ing to ap­peal to rules that would only be ap­peal­ing to some­body who’d already ac­cepted the res­cue pro­ject ab ini­tio—which would in­deed be cir­cu­lar—the start of the ar­gu­ment is meant to also work as an in­tu­itive ap­peal to the prethe­o­retic state of mind of a nor­mal hu­man. After that, we also hope­fully find that these new rules are self-con­sis­tent.

In terms of the heat metaphor, if we’re con­sid­er­ing whether to dis­card heat, we can con­sider three types of agents:

  • (1). A prethe­o­retic or con­fused state of in­tu­ition, which knows it­self to be con­fused. An agent like this is not re­flec­tively con­sis­tent—it wants to re­solve the in­ter­nal ten­sion.

  • (2). An agent that has fully erased all emo­tions re­lat­ing to warmth, as if it never had them. This type of agent is re­flec­tively con­sis­tent; it doesn’t value warmth and doesn’t want to value warmth.

  • (3). An agent that val­ues nat­u­ral­is­tic heat, i.e., feels the way about di­s­or­dered ki­netic en­ergy that a prethe­o­retic hu­man feels about warmth. This type of agent has also re­solved its is­sues and be­come re­flec­tively con­sis­tent.

Since 2 and 3 are both in­ter­nally con­sis­tent re­s­olu­tions, there’s po­ten­tially a re­flec­tively con­sis­tent de­gree of free­dom in how (1) can re­solve its cur­rent in­ter­nal ten­sion or in­con­sis­tency. That is, it’s not the case that the only kind of co­her­ent agents are 2-agents or 3-agents, so just a de­sire for co­her­ence qua co­her­ence can’t tell a 1-agent whether it should be­come a 2-agent or 3-agent. By ad­vo­cat­ing for res­cu­ing the util­ity func­tion, we’re ap­peal­ing to a prethe­o­retic and maybe chaotic and con­fused mess of in­tu­itions, aka a hu­man, ar­gu­ing that if you want to shake out the mess, it’s bet­ter to shake out as a 3-agent rather than a 2-agent.

In mak­ing this ap­peal, we can’t ap­peal to firm foun­da­tions that already ex­ist, since a 1-agent hasn’t yet de­cided on firm philo­soph­i­cal foun­da­tions and there’s more than one set of pos­si­ble foun­da­tions to adopt. An agent with firm foun­da­tions would already be re­flec­tively co­her­ent and have no fur­ther philo­soph­i­cal con­fu­sion left to re­solve (ex­cept per­haps for a mere mat­ter of calcu­la­tion). An ex­ist­ing 2-agent is of course non­plussed by any ar­gu­ments that heat should be val­ued, in much the same way that hu­mans would be non­plussed for ar­gu­ments in fa­vor of valu­ing pa­per­clips (or for that mat­ter, things be­ing hot). But to point this out is no ar­gu­ment for why a con­fused 1-agent should shake it­self out as a con­sis­tent 2-agent rather than a con­sis­tent 3-agent; a 3-agent is equally non­plussed by the ar­gu­ment that the best thing to do with an on­tol­ogy iden­ti­fi­ca­tion prob­lem is throw out all cor­re­spond­ing terms of the util­ity func­tion.

It’s per­haps lucky that hu­man be­ings can’t ac­tu­ally mod­ify their own code, mean­ing that some­body par­tially talked into tak­ing the 2-agent state as a new ideal to as­pire to, still ac­tu­ally has the prethe­o­retic emo­tions and can po­ten­tially “snap out of it”. Rather than be­com­ing a 2-agent or a 3-agent, we be­come “a 1-agent that sees 2-agency as ideal” or “a 1-agent that sees 3-agency as ideal”. A 1-agent as­piring to be a 2-agent can still po­ten­tially be talked out of it—they may still feel the weight of ar­gu­ments meant to ap­peal to 1-agents, even if they think they ought not to, and can po­ten­tially “just snap out” of tak­ing 2-agency as an ideal, re­vert­ing to con­fused 1-agency or to tak­ing 3-agency as a new ideal.

Loop­ing through the meta-level in “res­cu­ing the util­ity func­tion”

Since hu­man be­ings don’t have “util­ity func­tions” (co­her­ent prefer­ences over prob­a­bil­is­tic out­comes), the no­tion of “res­cu­ing the util­ity func­tion” is it­self a mat­ter of res­cue. Na­tively, it’s pos­si­ble for psy­chol­ogy ex­per­i­ments to ex­pose in­con­sis­tent prefer­ences, but in­stead of throw­ing up our hands and say­ing “Well I guess no­body wants any­thing and we might as well let the uni­verse get turned into pa­per­clips!”, we try to back out some rea­son­ably co­her­ent prefer­ences from the mess. This is, ar­guendo, nor­ma­tively bet­ter than throw­ing up our hands and turn­ing the uni­verse into pa­per­clips.

Similarly, ac­cord­ing to the nor­ma­tive stance be­hind ex­trap­o­lated vo­li­tion, the very no­tion of “should­ness” is some­thing that gets res­cued. Many peo­ple seem to in­stinc­tively feel that ‘should­ness’ wants to map onto an on­tolog­i­cally ba­sic, ir­re­ducible prop­erty of right­ness, such that ev­ery cog­ni­tively pow­er­ful agent with fac­tual knowl­edge about this prop­erty is thereby com­pel­led to perform the cor­re­spond­ing ac­tions. (“Mo­ral in­ter­nal­ism.”) But this de­mands an overly di­rect cor­re­spon­dence be­tween our na­tive sense that some acts have a com­pel­ling right­ness qual­ity about them, and want­ing there to be an on­tolog­i­cally ba­sic com­pel­ling right­ness qual­ity out there in the en­vi­ron­ment.

De­spite the wide­spread ap­peal of moral in­ter­nal­ism once peo­ple are ex­posed to it as an ex­plicit the­ory, it still seems un­fair to say that hu­mans na­tively want or prethe­o­ret­i­cally de­mand that this is what our sense of right­ness cor­re­spond to. E.g. a hunter-gath­erer, or some­one else who’s never de­bated metaethics, doesn’t start out with an ex­plicit com­mit­ment about whether a feel­ing of right­ness must cor­re­spond to uni­verses that have ir­re­ducible right­ness prop­er­ties in them. If you’d grown up think­ing that your feel­ing of right­ness cor­re­sponded to com­put­ing a cer­tain log­i­cal func­tion over uni­verses, this would seem nat­u­ral and non-dis­ap­point­ing.

Since “should­ness” (the no­tion of nor­ma­tivity) is some­thing that it­self may need res­cu­ing, this res­cue of “should­ness” is in some sense be­ing im­plic­itly in­voked by the nor­ma­tive as­ser­tion that we should try to “res­cue the util­ity func­tion”.

This could be termed cir­cu­lar, but we could equally say that it is self-con­sis­tent. Or rather, we are ap­peal­ing to some chaotic, not-yet-res­cued, prethe­o­retic no­tion of “should” in say­ing that we should try to res­cue con­cepts like “should­ness” in­stead of throw­ing them out the win­dow. After­wards, once we’ve performed the res­cue and have a more co­her­ent no­tion of con­cepts like “bet­ter”, we can see that the loop through the meta-level has en­tered a con­sis­tent state. Ac­cord­ing to this new ideal (less con­fused, but per­haps also seem­ing more ab­stract), it re­mains bet­ter not to give up on con­cepts like “bet­ter”.

The “ex­trap­o­lated vo­li­tion” res­cue of should­ness is meant to boot­strap, by ap­peal to a prethe­o­retic and po­ten­tially con­fused state of won­der­ing what is the right thing to do (or how we even should re­solve this whole “right­ness” is­sue, and if maybe it would be bet­ter to just give up on it), into a more re­flec­tively con­sis­tent state of mind. After­wards we can see both prethe­o­ret­i­cally, and also in the light of our new ex­plicit the­ory, that we ought to try to res­cue the con­cept of ought­ness, and adopt the res­cued form of rea­son­ing as a less-con­fused ideal. We will be­lieve that ideally, the best jus­tifi­ca­tion for ex­trap­o­lated vo­li­tion is to say that we know of no bet­ter can­di­date for what the­ory we’d ar­rive at if we thought about it for even longer. But since hu­mans can per­haps thank­fully not di­rectly rewrite their own code, we will also re­main aware of whether this seems like a good idea in the prethe­o­retic sense, and per­haps pre­pared to un­wind or jump back out of the sys­tem if it turns that out that the ex­plicit the­ory has big prob­lems we didn’t know about when we origi­nally jumped to it.

Re­duc­ing ten­sion /​ “as if you’d always known it”

We can pos­si­bly see the de­sired out­put of “res­cu­ing the util­ity func­tion” as be­ing some­thing like re­duc­ing a ten­sion be­tween na­tive emo­tion-bind­ing rep­re­sen­ta­tions, and re­al­ity, with a min­i­mum of added fuss and com­plex­ity.

This can look a lot like the in­tu­ition pump, “Sup­pose you’d grown up always know­ing the true state of af­fairs, and no­body had sug­gested that you panic over it or ex­pe­rience any ex­is­ten­tial angst; what would you have grown up think­ing?” If you grow up with warmth-re­lated emo­tions and already know­ing that heat is di­s­or­dered ki­netic en­ergy, and no­body has sug­gested to you that any­one ought to wail in ex­is­ten­tial angst about this, then you’ll prob­a­bly grow up valu­ing heat-as-di­s­or­dered-ki­netic-en­ergy (and this will be a low-ten­sion re­s­olu­tion).

For more con­fus­ing grades of cog­ni­tive re­duc­tion­ism, like free will, peo­ple might spon­ta­neously have difficulty rec­on­cil­ing their in­ter­nal sense of free­dom with be­ing told about de­ter­minis­tic phys­i­cal laws. But a good “res­cue” of the cor­re­spond­ing sense of moral re­spon­si­bil­ity ought to end up look­ing like the sort of thing that you’d quietly take for granted as a quiet, ob­vi­ous-seem­ing map­ping of your sense of moral re­spon­si­bil­ity onto the phys­i­cal uni­verse, if you’d grown up tak­ing those laws of physics for granted.

“Res­cu­ing” prethe­o­retic emo­tions and in­tu­itions ver­sus “res­cu­ing” ex­plicit moral theories

Rewind­ing past fac­tual-mis­take-premised ex­plicit moral theories

In the heat metaphor, sup­pose you’d pre­vi­ously adopted a ‘caloric fluid’ model of heat, and the ex­plicit be­lief that this caloric fluid was what was valuable. You still have word­less in­tu­itions about warmth and good feel­ings about warmth. You also have an ex­plicit world-model that this heat cor­re­sponds to caloric fluid, and an ex­plicit moral the­ory that this caloric fluid is what’s good about heat.

Then sci­ence dis­cov­ers that heat is di­s­or­dered ki­netic en­ergy. Should we try to res­cue our moral feel­ings about caloric by look­ing for the clos­est thing in the uni­verse to caloric fluid—elec­tric­ity, maybe?

If we now re­con­sider the ar­gu­ments for “res­cu­ing the util­ity func­tion”, we find that we have more choices be­yond “look­ing for the clos­est thing to caloric” and “giv­ing up en­tirely on warm feel­ings”. An ad­di­tional op­tion is to try to res­cue the in­tu­itive sense of warmth, but not the ex­plicit be­liefs and ex­plicit moral the­o­ries about “caloric fluid”.

If we in­stead choose to res­cue the prethe­o­retic emo­tion, we could see this as re­trac­ing our steps af­ter be­ing led down a gar­den-path of bad rea­son­ing, aka, “not res­cu­ing the gar­den path”. We started with in­tu­itively good feel­ings about warmth, came to be­lieve a false model about the causes of warmth, re­acted emo­tion­ally to this false model, and de­vel­oped an ex­plicit moral the­ory about caloric fluid.

The ex­trap­o­lated-vo­li­tion model of nor­ma­tivity (what we would want* if we knew all the facts) sug­gests that we could see the rea­son­ing af­ter adopt­ing the false caloric model as “mis­taken” and not res­cue it. When we’re deal­ing with ex­plicit moral be­liefs that grew up around a false model of the world, we have the third op­tion to “rewind and res­cue” rather than “res­cue” or “give up”.

Non­metaphor­i­cally: Sup­pose you be­lieve in a di­v­ine com­mand the­ory of metaethics; good­ness is equiv­a­lent to God want­ing some­thing. Then one day you re­al­ize that there’s no God in which to ground your moral the­ory.

In this case we have three op­tions for re­s­olu­tion, all of which are re­flec­tively con­sis­tent within them­selves, and whose ar­gu­ments may ap­peal to our cur­rently-con­fused prethe­o­retic state:

  • (a) Pre­fer to go about wailing in hor­ror about the un­fillable gap in the uni­verse left by the ab­sence of God.

  • (b) Try to res­cue the ex­plicit di­v­ine com­mand the­ory, e.g. by look­ing for the clos­est thing to a God and re-an­chor­ing the di­v­ine com­mand the­ory there.

  • (c) Give up on the ex­plicit model of di­v­ine com­mand the­ory; in­stead, try to un­wind past the gar­den path you went down af­ter your na­tive emo­tions re­acted to the fac­tu­ally false model of God. Try to remap the prethe­o­retic emo­tions and in­tu­itions onto your new model of the uni­verse.

Again, (a) and (b) and (c) all seem re­flec­tively con­sis­tent in the sense that a sim­ple agent fully in one of these states will not want to en­ter ei­ther of the other two states. But given these three op­tions, a con­fused agent might rea­son­ably find ei­ther (b) or (c) more prethe­o­ret­i­cally com­pel­ling than (a), but also find (c) more prethe­o­ret­i­cally com­pel­ling than (b).

The no­tion of “res­cu­ing” isn’t meant to erase the no­tion of “mis­takes” and “say­ing oops” with re­spect to b-vs.-c al­ter­na­tives. The ar­gu­ments for “res­cu­ing” warmth im­plic­itly as­sumed that we were talk­ing about a prethe­o­retic nor­ma­tive in­tu­ition (e.g. an emo­tion as­so­ci­ated with warmth), not ex­plicit mod­els and the­o­ries about heat that could just as eas­ily be re­vised.

Con­versely, when we’re deal­ing with pre­ver­bal in­tu­itions and emo­tions whose na­tively bound rep­re­sen­ta­tions are in some way run­ning skew to re­al­ity, we can’t rewind past the fact of our emo­tions bind­ing to par­tic­u­lar men­tal rep­re­sen­ta­tions. We were liter­ally born that way. Then our only ob­vi­ous al­ter­na­tives are to (a) give up en­tirely on that emo­tion and value, or (c) res­cue the in­tu­itions as best we can. In this case (c) seems more prethe­o­ret­i­cally ap­peal­ing, ce­teris paribus and by de­fault.

(For ex­am­ple, sup­pose you were an alien that had grown up ac­cept­ing com­mands from a Hive Queen, and you had a prethe­o­retic sense of the Hive Queen as know­ing ev­ery­thing, and you mostly op­er­ated on an emo­tional-level Hive-Queen-com­mand the­ory of right­ness. One day, you be­gin to sus­pect that the Hive Queen isn’t ac­tu­ally om­ni­scient. Your alien ver­sion of “res­cu­ing the util­ity func­tion” might say to res­cue the util­ity func­tion by al­low­ing valid com­mands to be is­sued by Hive Queens that knew a lot but weren’t om­ni­scient. Or it might say to try and build a su­per­in­tel­li­gent Hive Queen that would know as much as pos­si­ble, be­cause in a prethe­o­retic sense that would feel bet­ter. The aliens can’t rewind past their analogue of di­v­ine com­mand the­ory be­cause, by hy­poth­e­sis, the alien’s equiv­a­lent of di­v­ine com­mand metaethics is built into them on a prethe­o­retic and emo­tional level. Though of course, in this case, such aliens seem more likely to ac­tu­ally re­solve their ten­sion by ask­ing the Hive Queen what to do about it.)

Pos­si­bil­ity of res­cu­ing non-mis­take-premised ex­plicit moral theories

Sup­pose Alice has an ex­plicit be­lief that pri­vate prop­erty ought to be a thing, and this be­lief did not de­velop af­ter she was told that ob­jects had tiny XML tags declar­ing their ir­re­ducible ob­jec­tive own­ers, nor did she origi­nally ar­rive at the con­clu­sion based on a model in which God as­signed trans­fer­able own­er­ship of all ob­jects at the dawn of time. We can sup­pose, some­what re­al­is­ti­cally, that Alice is a hu­man and has a prethe­o­retic con­cept of own­er­ship as well as de­serv­ing re­wards for effort, and was raised by small-l liber­tar­ian par­ents who told her true facts about how East Ger­many did worse eco­nom­i­cally than West Ger­many. Over time, she came to adopt an ex­plicit moral the­ory of “pri­vate prop­erty”: own­er­ship can only trans­fer by con­sent, and that vi­o­la­tions of this rule vi­o­late the just-re­wards prin­ci­ple.

One day, Alice starts hav­ing trou­ble with her moral sys­tem be­cause she’s re­al­ized that prop­erty is made of atoms, and that even the very flesh in her body is con­stantly ex­chang­ing oxy­gen and car­bon diox­ide with the pub­li­cly owned at­mo­sphere. Can atoms re­ally be pri­vately owned?

The con­fused Alice now again sees three op­tions, all of them re­flec­tively con­sis­tent on their own terms if adopted:

  • (a) Give up on ev­ery­thing to do with own­er­ship or de­serv­ing re­wards for efforts; re­gard these emo­tions as hav­ing no valid refer­ents.

  • (b) Try to res­cue the ex­plicit moral the­ory by say­ing that, sure, atoms can be pri­vately owned. Alice owns a change­able num­ber of car­bon atoms in­side her body and she won’t worry too much about how they get ex­changed with the at­mo­sphere; that’s just the ob­vi­ous way to map pri­vate prop­erty onto a par­ti­cle-based uni­verse.

  • (c) Try to rewind past the ex­plicit moral the­ory, and figure out from scratch what to do with emo­tions about “de­serves re­ward” or “owns”.

Leav­ing aside what you think of Alice’s ex­plicit moral the­ory, it’s not ob­vi­ous that Alice will end up prefer­ring (c) to (b), es­pe­cially since Alice’s cur­rent in­tu­itive state is in­fluenced by her cur­rently-ac­tive ex­plicit the­ory of pri­vate prop­erty.

Un­like the di­v­ine com­mand the­ory, Alice’s pri­vate prop­erty the­ory was not (ob­vi­ously to Alice) ar­rived at through a path that tra­versed wrong be­liefs of sim­ple fact. With the di­v­ine com­mand the­ory, since it was crit­i­cally premised on a wrong fac­tual model, we face the prospect of hav­ing to stretch the the­ory quite a lot in or­der to res­cue it, mak­ing it less in­tu­itively ap­peal­ing to a con­fused mind than the al­ter­na­tive prospect of stretch­ing the prethe­o­retic emo­tions a lot less. Whereas from Alice’s per­spec­tive, she can just as eas­ily pick up the whole moral the­ory and morph it onto re­duc­tion­ist physics with all the in­ter­nal links in­tact, rather than need­ing to rewind past any­thing.

We at least have the ap­par­ent op­tion of try­ing to res­cue Alice’s util­ity func­tion in a way that pre­serves her ex­plicit moral the­o­ries not based on bad fac­tual mod­els—the steps of her pre­vi­ous ex­plicit rea­son­ing that did not, of them­selves, in­tro­duce any new ten­sions or prob­lems in map­ping her emo­tions or morals onto a new rep­re­sen­ta­tion. Whether or not we ought to do this, it’s a plau­si­ble pos­si­bil­ity on the table.

Which ex­plicit the­o­ries to res­cue?

It’s not yet ob­vi­ous where to draw the line on which ex­plicit moral the­o­ries to res­cue. So far as we can cur­rently see, any of the fol­low­ing might be a rea­son­able way to tell a su­per­in­tel­li­gence to ex­trap­o­late some­one’s vo­li­tion:

  • Pre­serve ex­plicit moral the­o­ries wher­ever it doesn’t in­volve an enor­mous stretch.

  • Be skep­ti­cal of ex­plicit moral the­o­ries that were ar­rived at by frag­ile rea­son­ing pro­cesses, even if they could be res­cued in an ob­vi­ous way.

  • Only ex­trap­o­late prethe­o­retic in­tu­itions.

Again, all of these view­points are in­ter­nally con­sis­tent (they are de­grees of free­dom in the metaphor­i­cal meta-util­ity-func­tion), so the ques­tion is which rule for draw­ing the line seems most in­tu­itively ap­peal­ing in our pre­sent state of con­fu­sion:

Ar­gu­ment from adding up to normality

Ar­guendo: Pre­serv­ing ex­plicit moral the­o­ries is im­por­tant for hav­ing the res­cued util­ity func­tion add up to nor­mal­ity. After res­cu­ing my no­tion of “should­ness”, then af­ter­wards I should, by de­fault, see mostly the same things as res­cued-right.

Sup­pose Alice was pre­vi­ously a moral in­ter­nal­ist, and thought that some things were in­her­ently ir­re­ducibly right, such that her very no­tion of “should­ness” needed res­cu­ing. That doesn’t nec­es­sar­ily in­tro­duce any difficul­ties into re-im­port­ing her be­liefs about pri­vate prop­erty. Alice may have pre­vi­ously re­fused to con­sider some ar­gu­ments against pri­vate prop­erty be­cause she thought it was ir­re­ducibly right, but this is a sep­a­rate is­sue in ex­trap­o­lat­ing her vo­li­tion from throw­ing out her en­tire stock of ex­plicit moral the­o­ries be­cause they all used the word “should”. By de­fault, af­ter we’re done res­cu­ing Alice, un­less we are do­ing some­thing that’s clearly and ex­plic­itly cor­rect­ing an er­ror, her res­cued view­point should look as nor­mal-rel­a­tive-to-her-pre­vi­ous-per­spec­tive as pos­si­ble.

Ar­gu­ment from helping

Ar­guendo: Pre­serv­ing ex­plicit moral the­o­ries where pos­si­ble is an im­por­tant as­pect of how an ideal ad­vi­sor or su­per­in­tel­li­gence ought to ex­trap­o­late some­one else’s vo­li­tion.

Sup­pose Alice was pre­vi­ously a moral in­ter­nal­ist, and thought that some things were in­her­ently ir­re­ducibly right, such that her very no­tion of “should­ness” needed res­cu­ing. Alice may not re­gard it as “helping” her to throw away all of her ex­plicit the­o­ries and try to re-ex­trap­o­late her emo­tions from scratch into new the­o­ries. If there weren’t any fac­tual flaws in­volved, Alice is likely to see it as less than max­i­mally helpful to her if we need­lessly toss one of her cher­ished ex­plicit moral the­o­ries.

Ar­gu­ment from in­co­her­ence, evil, chaos, and arbitrariness

Ar­guendo: Hu­mans are re­ally, re­ally bad at sys­tem­atiz­ing ex­plicit moral the­o­ries; a su­per­ma­jor­ity of ex­plicit moral the­o­ries in to­day’s world will be in­co­her­ent, evil, or both. E.g., ex­plicit moral prin­ci­ples may de-facto be cho­sen mostly on the ba­sis of, e.g., how hard they ap­pear to cheer for a val­orized group. An ex­trap­o­la­tion dy­namic that tries to take into ac­count all these chaotic, ar­bi­trar­ily-gen­er­ated group be­liefs will end up failing to co­here.

Ar­gu­ment from frag­ility of goodness

Ar­guendo: Most of what we see as the most pre­cious and im­por­tant part of our­selves are ex­plicit moral the­o­ries like “all sapi­ent be­ings should have rights”, which aren’t built into hu­man ba­bies. We may well have ar­rived at that des­ti­na­tion through a his­tor­i­cal tra­jec­tory that went through fac­tual mis­takes, like be­liev­ing that all hu­man be­ings had souls cre­ated equal by God and were loved equally by God. (E.g. Chris­tian the­ol­ogy seems to have been, as a mat­ter of his­tor­i­cal fact, causally im­por­tant in the de­vel­op­ment of ex­plicit anti-slav­ery sen­ti­ment.) Toss­ing the ex­plicit moral the­o­ries is as un­likely to be good, from our per­spec­tive, as toss­ing our brains and try­ing to re­run the pro­cess of nat­u­ral se­lec­tion to gen­er­ate new emo­tions.

Ar­gu­ment from de­pen­dency on em­piri­cal results

Ar­guendo: Which ver­sion of ex­trap­o­la­tion we’ll ac­tu­ally find ap­peal­ing will de­pend on which ex­trap­o­la­tion al­gorithm turns out to have a rea­son­able an­swer. We don’t have enough com­put­ing power to guess, right now, whether:

  • Any rea­son­able-look­ing con­strual of “Toss out all the ex­plicit cog­ni­tive con­tent and redo by throw­ing prethe­o­retic emo­tions at true facts” leads to an ex­trap­o­lated vo­li­tion that lacks dis­ci­pline and co­her­ence, look­ing self­ish and rather an­gry, failing to re­gen­er­ate most of al­tru­ism or Fun The­ory; or

  • Any rea­son­able-look­ing con­strual of “Try to pre­serve ex­plicit moral the­o­ries” leads to an in­co­her­ent mess of as­ser­tions about var­i­ous peo­ple go­ing to Hell and cap­i­tal­ism be­ing bad for you.

Since we can’t guess us­ing our pre­sent com­put­ing power which rule would cause us to re­coil in hor­ror, but the ac­tual hor­rified re­coil would set­tle the ques­tion, we can only defer this sin­gle bit of in­for­ma­tion to one per­son who’s al­lowed to peek at the re­sults.

(Coun­ter­ar­gu­ment: “Per­haps the an­cient Greeks would have re­coiled in hor­ror if they saw how lit­tle the fu­ture would think of a glo­ri­ous death in bat­tle, thus pick­ing the op­tion we see as wrong, us­ing the stated rule.”)