Problem of fully updated deference

The prob­lem of ‘fully up­dated defer­ence’ is an ob­sta­cle to us­ing moral un­cer­tainty to cre­ate cor­rigi­bil­ity.

One pos­si­ble scheme in AI al­ign­ment is to give the AI a state of moral un­cer­tainty im­ply­ing that we know more than the AI does about its own util­ity func­tion, as the AI’s meta-util­ity func­tion defines its ideal tar­get. Then we could tell the AI, “You should let us shut you down be­cause we know some­thing about your ideal tar­get that you don’t, and we es­ti­mate that we can op­ti­mize your ideal tar­get bet­ter with­out you.”

The ob­sta­cle to this scheme is that be­lief states of this type also tend to im­ply that an even bet­ter op­tion for the AI would be to learn its ideal tar­get by ob­serv­ing us. Then, hav­ing ‘fully up­dated’, the AI would have no fur­ther rea­son to ‘defer’ to us, and could pro­ceed to di­rectly op­ti­mize its ideal tar­get.

Fur­ther­more, if the pre­sent AI fore­sees the pos­si­bil­ity of fully up­dat­ing later, the cur­rent AI may eval­u­ate that it is bet­ter to avoid be­ing shut down now so that the AI can di­rectly op­ti­mize its ideal tar­get later, af­ter up­dat­ing. Thus the prospect of fu­ture up­dat­ing is a rea­son to be­have in­cor­rigibly in the pre­sent.

While moral un­cer­tainty seems to take us con­cep­tu­ally closer to defer­ence-based cor­rigi­bil­ity, and there may be re­search av­enues for fix­ing the is­sue (see be­low), the cur­rent ex­plicit pro­pos­als will (when scaled to suffi­ciently high in­tel­li­gence) yield es­sen­tially the same form of in­cor­rigi­bil­ity as an AI given a con­stant util­ity func­tion.

Re­view: Stan­dard prob­lem of value divergence

In a value learn­ing prob­lem:

  • Let \(V\) in­di­cate our true in­tended value func­tion. notePre­tend­ing for the sake of sim­plifi­ca­tion that \(V\) has been ideal­ized or res­cued into a util­ity func­tion.

  • Let \(U\) in­di­cate the AI’s ac­tual util­ity func­tion (learned un­der some prefer­ence frame­work, pre­sum­ably when we at­tempted to teach the AI part of \(V\)).

If the achiev­able op­ti­mum of \(U\) sub­stan­tially di­verges from the achiev­able op­ti­mum of \(V,\) i.e., if the AI ex­pects hu­mans to dis­ap­prove of \(U\)’s op­ti­mum, then this leads into a class of ad­ver­sar­ial situ­a­tions we’d like to avoid.

Ce­teris paribus, \(U\)-\(V\) di­ver­gence cre­ates in­stru­men­tal pres­sure for the AI to:

  • Prevent hu­mans from mod­ify­ing the AI’s pro­gram to op­ti­mize a differ­ent util­ity func­tion \(U'\) closer to \(V.\)

  • Fool hu­mans into think­ing that the AI is op­ti­miz­ing \(V,\) or think­ing that the op­ti­mum of \(U\) is higher in \(V\) than it ac­tu­ally is.

  • Avoid be­ing effec­tively shut down (or sus­pended to disk).

The shut­down prob­lem seems like a well-sim­plified spe­cial case. Even if we re­duce the sce­nario to a sin­gle shut­down but­ton that always works if suc­cess­fully pressed, noteSo that for pur­poses of the sim­plified sce­nario, we only need to con­sider what the AI does about the but­ton, and not whether the AI tries to back it­self up to el­se­where on the In­ter­net. More gen­er­ally, though, “avoid­ing effec­tive shut­down” can in­clude strate­gies like cre­at­ing a hid­den backup while the origi­nal hard­ware is in fact shut down, thus giv­ing the ap­pear­ance of a suc­cess­ful shut­down and avoid­ing fur­ther shut­down at­tempts. most of the same gen­eral prob­lems ap­pear. Ce­teris paribus and by de­fault, we should ex­pect an AI to try to pre­vent the but­ton from be­ing pressed, when­ever that AI ex­pects that what hu­mans will do by op­ti­miz­ing \(V\) af­ter the AI is shut­down, will not be as high in \(U\) as what the AI would do if it avoided shut­down.

com­ment: A ‘minor’ di­ver­gence be­tween \(U\) and \(V\) may not lead into an ad­ver­sar­ial situ­a­tion, if e.g:

  • The achiev­able op­tima of \(U\) and \(V\) co­in­cide even though \(U\) and \(V\) have dis­agree­ments about non-op­ti­mum situ­a­tions.

  • The AI’s prefer­ences have been suc­cess­fully ren­dered tasklike. Then \(V\) hav­ing prefer­ences over mat­ters el­se­where in space and time not rep­re­sented in \(U\), and the dis­po­si­tion of re­sources not re­quired to op­ti­mize \(U\) as far as pos­si­ble, would not nec­es­sar­ily lead the AI to ex­pect dis­agree­ment over \(U\)-equiv­a­lent par­ti­tions of the out­come.

  • The AI an­ti­ci­pates that \(U\)’s op­ti­mum is high enough in \(V\) to satis­fice the hu­mans and not pro­voke ad­ver­sar­ial re­sponses.

  • The AI ex­pects the \(V\)-op­ti­mum to be close enough to \(U\)-op­ti­mal that fight­ing loses more ex­pected \(U\)-value than the slight di­ver­gence.<div>

Mo­ral un­cer­tainty and its re­la­tion to corrigibility

The gen­er­al­ized no­tion of cog­ni­tive al­ign­ment sug­gests that, if we want an AI to not re­sist be­ing paused to disk, we should ask whether we can have the AI think con­ju­gate thoughts to the same rea­son­ing we’re us­ing to de­cide to shut it down:

“Can we make the AI want to be shut down for the same rea­sons we want to shut it down? Maybe be­cause the AI knows that it’s in­com­plete, and is able to con­ceive of its pro­gram­mers hav­ing got­ten its util­ity func­tion ‘wrong’ in a way that the pro­gram­mers but not the AI know how to cor­rect?”

In par­tic­u­lar, we can ask whether moral un­cer­tainty—a meta-util­ity func­tion with a prob­a­bil­ity dis­tri­bu­tion over ob­ject-level util­ity func­tions—would au­to­mat­i­cally solve this prob­lem.

For pur­poses of sim­plifi­ca­tion, sup­pose the AI thinks there are ex­actly three pos­si­ble ‘true’ util­ity func­tions \(U_1, U_2, U_3.\) We similarly say that the AI’s meta-util­ity func­tion \(\Delta U\) starts out by as­sign­ing \(\frac{1}{3}\) prob­a­bil­ity to each \(U_i.\)

Let’s also as­sume the \(U_i\) are im­por­tantly differ­ent in the sense that there’s no sin­gle out­come that’s nearly op­ti­mal for all of them. Let­ting \(\mathbb O\) be the out­come space, this says:

$$\forall o_j \in \mathbb O \colon \exists i \colon \ U_i(o_j) \ll \max_{o \in \mathbb O} U_i(o) $$

Other­wise, the AI can just pick a sin­gle out­come that well-op­ti­mizes all the \(U_i\) si­mul­ta­neously. (We should also in­ter­pret \(\max\) as in­di­cat­ing the at­tain­able op­ti­mum rather than global max­i­mum.)

Now (ar­guendo) sup­pose the AI be­lieves that the hu­mans know which \(U_i\) is the cor­rect one. More pre­cisely, sup­pose the AI’s meta-util­ity func­tion as­serts a re­la­tion be­tween the \(U_i\) in differ­ent pos­si­ble wor­lds, and the em­piri­cal be­hav­ior of hu­mans in differ­ent pos­si­ble wor­lds; such that in the set of pos­si­ble wor­lds where \(U_i\) is ‘cor­rect’, hu­mans within that set of pos­si­ble wor­lds em­piri­cally tend to op­ti­mize a cor­re­spond­ing \(V_i\) whose hu­manly-at­tain­able op­ti­mum is pretty good un­der \(U_i.\)

Then (it is fur­ther ar­gued) the AI ought to en­thu­si­as­ti­cally co­op­er­ate in be­ing shut down.

In par­tic­u­lar (still as­sum­ing the above), sup­pose the AI was start­ing to be­lieve that \(U_1\) is the cor­rect util­ity func­tion. But af­ter the AI starts to op­ti­mize for mostly \(U_1,\) the AI sees its pro­gram­mers start­ing to ap­proach the shut­down but­ton. Then (un­der the above as­sump­tion) the AI con­cludes that \(U_1\) is the wrong util­ity func­tion and that one of \(U_2\) or \(U_3\) must be the cor­rect util­ity func­tion; and ei­ther of these will be op­ti­mized well by hu­mans pur­su­ing the cor­re­spond­ing \(V_i\) af­ter the AI is shut down.

Sup­pose the AI has only the fol­low­ing 5 policy op­tions available:

  • \(\pi_1\): Con­tinue op­ti­miz­ing \(U_1.\)

  • \(\pi_2\): Switch to op­ti­miz­ing \(U_2.\)

  • \(\pi_3\): Switch to op­ti­miz­ing \(U_3.\)

  • \(\pi_4\): Op­ti­mize the ex­pec­ta­tion of its cur­rent up­dated \(\Delta U\), now 50% \(U_2\) and 50% \(U_3.\)

  • \(\pi_5\): Shut down and let the hu­mans op­ti­mize what­ever \(V\) they have in the ac­tual world.

Let \(u_1, u_2, u_3\) and \(v_1, v_2, v_3\) stand for the out­comes in \(\mathbb O\) that are re­spec­tively the at­tain­able op­tima of the AI op­ti­miz­ing \(U_i\) and the hu­mans op­ti­miz­ing \(V_i.\) Also let \(u_{\Delta U}\) re­fer to the AI’s at­tain­able op­ti­mum out­come un­der the cur­rent \(\Delta U\) treated as a con­stant.

Then the AI will pre­fer \(\pi_5\) from this set when­ever:

$$0.5 \cdot U_2(u_{\Delta U}) + 0.5 \cdot U_3(u_{\Delta U}) \ < \ 0.5 \cdot U_2(v_2) + 0.5 \cdot U_3(v_3)$$

This al­lows the \(U_i(v_i)\) to fall well short of the AI’s op­tima \(U_i(u_i).\) The AI can po­ten­tially be­lieve the hu­mans are worse op­ti­miz­ers, or that our \(V_i\) di­verge from its \(U_i\), and still defer. \(\pi_5\) will still be preferred so long as ac­tions that do well un­der \(U_2\) tend to do poorly un­der \(U_3\) and vice versa, lead­ing the AI to pre­fer a hu­man-op­ti­mized \(v_i\) spe­cific to the true \(U_i.\)

In gen­eral this does re­quire some mix of “the \(U_i\) cor­re­spond to \(V_i\) that hu­mans can and will do pretty well” and “there’s a wide va­ri­ety of dis­similar \(U_i.\)” E.g., sup­pose that \(U_2\) is pa­per­clips, \(U_3\) is sta­ples, and these are the only two pos­si­bil­ities. A uni­verse filled half with pa­per­clips and half with sta­ples will prob­a­bly do bet­ter un­der the av­er­age of \(U_2\) and \(U_3\) than any plau­si­ble hu­man be­hav­ior.

Ob­vi­ous an­swers to the above re­quire­ments are that:

  • It seems re­al­is­tic that a meta-util­ity func­tion would cover a very wide range of \(U_i\) rather than just 3 al­ter­na­tives, greatly in­creas­ing the ad­van­tage of \(V\)’s speci­fic­ity.

  • The pro­posal of In­verse Re­in­force­ment Learn­ing would have the AI speci­fi­cally as­sume that the true \(U\) is cor­re­lated to what­ever it sees the be­hav­ior of its op­er­a­tors tend­ing to max­i­mize. note This idea comes with its own ar­guable prob­lems — e.g. hu­mans some­times op­ti­mize bad things. Let us set those aside while con­sid­er­ing only whether this ap­proach solves the shut­down prob­lem in par­tic­u­lar.

The prob­lem of up­dated deference

Counter-ar­guendo, the rea­son this ap­proach via moral un­cer­tainty may fail to solve the shut­down prob­lem (or the cor­rigi­bil­ity in gen­eral) is that the AI may con­sider an al­ter­na­tive policy op­tion:

  • \(\pi_6\): Avoid be­ing shut down; then gather more ev­i­dence \(E\) to up­date \(\Delta U\) as far as pos­si­ble; then op­ti­mize the up­dated \(\Delta U | E.\)

This would tend to lead into a nar­rower \(\Delta U | E\) and an out­come \(u_{\Delta U | E}\) much more spe­cific to the true \(U_i,\) so that the hu­man op­ti­mum \(v_i\) would be at much less of an ad­van­tage.

This is the prob­lem of get­ting the AI to ‘defer’ to us (pre­fer hu­man op­ti­miza­tion of \(V\) to its own op­ti­miza­tion of \(\Delta U\)) de­spite the fu­ture ev­i­dence that the AI thinks it can po­ten­tially gather to ‘up­date’ to \(\Delta U|E\). This open prob­lem is ten­ta­tively named the “prob­lem of up­dated defer­ence”. noteThis is­sue was first ob­served in an­a­lyz­ing his­tor­i­cal-fact shut­down as a pos­si­ble al­ter­na­tive to util­ity in­differ­ence.

At­tempted res­cue by AI unobservability

One pos­si­ble at­tempt to res­cue the origi­nal pro­posal might be to try to con­struct a meta-util­ity func­tion such that the AI be­lieves it can’t learn by ob­ser­va­tion which of \(U_2\) or \(U_3\) are true; in other words, such that the AI does not be­lieve it can up­date much fur­ther.

How­ever, this di­rectly cuts against the con­di­tion, also re­quired for defer­ence, that:

  • The AI be­lieves that hu­mans op­ti­mize a \(V_2\) well-suited to \(U_2\) in pos­si­ble wor­lds where \(U_2\) is the true \(U,\) and that hu­mans op­ti­mize a \(V_3\) well-suited to \(U_3\) in pos­si­ble wor­lds where \(U_3\) is the true \(U.\)

Then the AI can learn well by ob­serv­ing hu­man be­hav­ior, since em­piri­cal hu­man be­hav­ior must cor­re­late with the true \(U\) ac­cord­ing to the AI’s meta-util­ity func­tion.

Can we try to have the AI be­lieve that ob­serv­able hu­man be­hav­ior is un­in­for­ma­tive or am­bigu­ous as to the true \(U\)?

But this cuts di­rectly against the re­quired con­di­tion:

  • There are no poli­cies which are si­mul­ta­neously very good at op­ti­miz­ing \(U_2\) and \(U_3,\) com­pared to how well the spe­cial­ized \(V_i\) can do for par­tic­u­lar \(U_i.\)

That is, if there are things hu­mans can do that are si­mul­ta­neously pretty good at op­ti­miz­ing all the re­main­ing plau­si­ble \(V_i\) (hence unifor­ma­tive), and these are pretty good at op­ti­miz­ing the re­main­ing plau­si­ble \(U_i\) (hence ac­cept­able to the AI), there are prob­a­bly things the AI can do which would be even bet­ter at si­mul­ta­neously op­ti­miz­ing all re­main­ing \(U_i.\)

What if the AI has moral un­cer­tainty that doesn’t cor­re­late to hu­man be­hav­ior at all, and can’t be re­solved by any pos­si­ble ob­ser­va­tion? Then — leav­ing aside that the AI has no in­cen­tive to defer to hu­mans in this case — we should ex­pect the AI to at some point de­cide that it has ob­tained all pos­si­ble in­for­ma­tion about \(U\) and can­not find any policy that leads to a fur­ther up­date of \(\Delta U.\) At this point the AI be­haves as if op­ti­miz­ing a con­stant util­ity func­tion \(U_\Delta = \sum_i \mathbb P_{\Delta}(i) \cdot U_i,\) where \(\mathbb P_\Delta\) in­di­cates the prob­a­bil­ity dis­tri­bu­tion af­ter the AI has up­dated its moral un­cer­tainty as far as it could. That is, the AI marginal­izes over its re­main­ing un­cer­tainty since that un­cer­tainty can­not be re­solved.

Re­la­tion to the gen­eral prob­lem of fully up­dated value identification

One way to look at the cen­tral prob­lem of value iden­ti­fi­ca­tion in su­per­in­tel­li­gence is that we’d ideally want some func­tion that takes a com­plete but purely phys­i­cal de­scrip­tion of the uni­verse, and spits out our true in­tended no­tion of value \(V\) in all its glory. Since su­per­in­tel­li­gences would prob­a­bly be pretty darned good at col­lect­ing data and guess­ing the em­piri­cal state of the uni­verse, this prob­a­bly solves the whole prob­lem.

This is not the same prob­lem as writ­ing down our true \(V\) by hand. The min­i­mum al­gorith­mic com­plex­ity of a meta-util­ity func­tion \(\Delta U\) which out­puts \(V\) af­ter up­dat­ing on all available ev­i­dence, seems plau­si­bly much lower than the min­i­mum al­gorith­mic com­plex­ity for writ­ing \(V\) down di­rectly. But as of 2017, no­body has yet floated any for­mal pro­posal for a \(\Delta U\) of this sort which has not been im­me­di­ately shot down.

(There is one in­for­mal sug­ges­tion for how to turn a purely phys­i­cal de­scrip­tion of the uni­verse into \(V,\) co­her­ent ex­trap­o­lated vo­li­tion. But CEV does not look like we could write it down as an al­gorith­mi­cally sim­ple func­tion of sense data, or a sim­ple func­tion over the un­known true on­tol­ogy of the uni­verse.)

We can then view the prob­lem of up­dated defer­ence as fol­lows:

For some \(\Delta U\) we do know how to write down, let \(T\) be the hy­po­thet­i­cal re­sult of up­dat­ing \(\Delta U\) on all em­piri­cal ob­ser­va­tions the AI can rea­son­ably ob­tain. By the ar­gu­ment given in the pre­vi­ous sec­tion, any un­cer­tainty the AI deems un­re­solv­able will be­have as if marginal­ized out, so we can view \(T\) as a sim­ple util­ity func­tion.

For any prior \(\Delta U\) we cur­rently know how to for­mal­ize, the cor­re­spond­ing fully up­dated \(T\) seems likely to be very far from our ideal \(V\) and to have its op­ti­mum far away from the de­fault re­sult of us try­ing to op­ti­mize our in­tu­itive val­ues. If the AI figures out this true fact, similar in­stru­men­tal pres­sures emerge as if we had given the AI the con­stant util­ity func­tion \(T\) di­ver­gent from our equiv­a­lent of \(V.\)

This prob­lem re­pro­duces it­self on the meta-level: the AI also has a de­fault in­cen­tive to re­sist our at­tempt to tweak its meta-util­ity func­tion \(\Delta U\) to a new meta-util­ity func­tion \(\Delta \dot U\) that up­dates to some­thing other than \(T.\) By de­fault and ce­teris paribus, this seems li­able to be treated by the agent in ex­actly the same way it would treat us try­ing to tweak a con­stant util­ity func­tion \(U\) to a new \(\dot U\) with an op­ti­mum far from \(U\)’s op­ti­mum.

If we did know how to spec­ify prior \(\Delta U\) such that up­dat­ing it on data a su­per­in­tel­li­gence could ob­tain would re­li­ably yield \(T \approx V,\) the prob­lem of al­igned su­per­in­tel­li­gence would have been re­duced to the prob­lem of build­ing an AI with that meta-util­ity func­tion. We could just spec­ify \(\Delta U\) and tell the AI to self-im­prove as fast as it wants, con­fi­dent that true value would come out the other side. De­sired be­hav­iors like “be cau­tious in what you do while learn­ing” could prob­a­bly be re­al­ized as the con­se­quence of in­form­ing the young AI of true facts within the \(\Delta U\) frame­work (e.g. “the uni­verse is frag­ile, and you’ll be much bet­ter at this if you wait an­other month to learn more, be­fore you try to do any­thing large”). Achiev­ing gen­eral cog­ni­tive al­ign­ment, free of ad­ver­sar­ial situ­a­tions, would prob­a­bly be much more straight­for­ward.

But short of this to­tal solu­tion, morally un­cer­tain \(\Delta U\) with a mis­al­igned ideal tar­get \(T\) may not make progress on cor­rigi­bil­ity in suffi­ciently ad­vanced AIs. And this may also be true at ear­lier points when \(\Delta U\) has not fully up­dated, if the cur­rent AI cor­rectly re­al­izes that it will up­date later.

To make this ar­gu­ment slightly less in­for­mal, we could ap­peal to the premises that:

Then if we can pre­dict that the AI would up­date to want­ing to run the uni­verse it­self with­out hu­man in­terfer­ence af­ter the AI had seen all col­lectable ev­i­dence, a suffi­ciently ad­vanced AI can also see that this up­date is pre­dictable (by effi­ciency) and there­fore be­haves as if it had already up­dated (by Bayesi­anism). Effi­ciency is a suffi­cient con­di­tion but not a nec­es­sary one; high-hu­man rea­son­ing over the meta-level ques­tion also seems suffi­cient, and per­haps even in­frahu­man rea­son­ing would suffice.

There­fore we should ex­pect a suffi­ciently in­tel­li­gent AI, given a morally un­cer­tain util­ity func­tion \(\Delta U\) that up­dates to \(\Delta U | E \approx T\) given all available ev­i­dence, to be­have as cor­rigibly or in­cor­rigibly as an AI given a con­stant util­ity func­tion \(T.\) This is a prob­lem from the view­point of any­one who thinks we do not cur­rently know how to pick \(\Delta U\) such that surely \(\Delta U | E \approx V,\) which makes cor­rigi­bil­ity still nec­es­sary.

Fur­ther re­search avenues

The mo­ti­va­tion for try­ing to solve cor­rigi­bil­ity with moral un­cer­tainty is that this seems in some es­sen­tial sense con­ju­gate to our own rea­son­ing about why we want the AI to shut down; we don’t think the AI has the cor­rect an­swer. A nec­es­sary step in echo­ing this rea­son­ing in­side the AI seems to be a meta-util­ity func­tion tak­ing on differ­ent ob­ject-level util­ity func­tions in differ­ent pos­si­ble wor­lds; with­out this we can­not rep­re­sent the no­tion of a util­ity func­tion be­ing guessed in­cor­rectly. If the ar­gu­ment above holds, that nec­es­sary step is how­ever not suffi­cient.

What more is needed? On one ap­proach, we would like the AI to in­fer, in pos­si­ble wor­lds where the hu­mans try to shut the AI down, that even the fully up­dated \(\Delta U | E\) ends up be­ing wronger than hu­mans left to their own de­vices, com­pared to the ‘true’ \(U.\) This is what we be­lieve about the AI rel­a­tive to the true \(V,\) so we should look for a way to faith­fully echo that rea­son­ing in­side the AI’s be­liefs about its true \(U.\)

The fun­da­men­tal ob­sta­cle is that for any ex­plicit struc­ture of un­cer­tainty \(\Delta U\) and mean­ingful ob­ser­va­tion \(e_0\) within that struc­ture—e.g. where \(e_0\) might be see­ing the hu­mans mov­ing to­ward the shut­down but­ton—we must ask, why wouldn’t \(\Delta U\) just up­date on that \(e_0\)? Why would the up­dated \(\Delta U | e_0\) still ex­pect its own rea­son­ing to be bad?

Gen­er­ally, de­ci­sion sys­tems think that op­ti­miz­ing their util­ity func­tions based on their cur­rent be­liefs is a good idea. If you show the de­ci­sion sys­tem new ev­i­dence, it up­dates be­liefs and then thinks that op­ti­miz­ing its util­ity func­tion on the up­dated be­liefs is a good idea. Op­ti­miz­ing the util­ity func­tion based on all pos­si­ble ev­i­dence is the best idea. This rea­son­ing doesn’t yet change for meta-util­ity func­tions ev­i­den­tially linked to hu­man be­hav­iors.

Avert­ing this con­ver­gent con­clu­sion seems like it might take a new meta-level idea in­volv­ing some broader space of pos­si­ble ‘true’ prefer­ence frame­works; or per­haps some non­triv­ially-struc­tured re­cur­sive be­lief about one’s own flawedness.

One sug­ges­tively similar such re­cur­sion is the Death in Da­m­as­cus dilemma from de­ci­sion the­ory. In this dilemma, you must ei­ther stay in Da­m­as­cus or flee to Aleppo, one of those cities will kill you, and Death (an ex­cel­lent pre­dic­tor) has told you that whichever de­ci­sion you ac­tu­ally end up mak­ing turns out to be the wrong one.

Death in Da­m­as­cus yields com­pli­cated rea­son­ing that varies be­tween de­ci­sion the­o­ries, and it’s not clear that any de­ci­sion the­ory yields rea­son­ing we can adapt for cor­rigi­bil­ity. But we want the AI to in­ter­nally echo our ex­ter­nal rea­son­ing in which we think \(\Delta U,\) as we defined that moral un­cer­tainty, ends up up­dat­ing to the wrong con­clu­sion even af­ter the AI tries to up­date on the ev­i­dence of the hu­mans be­liev­ing this. We want an AI which some­how be­lieves that its own \(\Delta U\) can be fun­da­men­tally flawed: that what­ever rea­son­ing the AI ends up do­ing about \(\Delta U,\) on any meta-level, will yield the wrong an­swer com­pared to what \(\Delta U\) defines as the true \(U\); to fur­ther­more be­lieve that the hu­man \(V\) will do bet­ter un­der this true \(U\); to be­lieve that this state of af­fairs is ev­i­den­tially in­di­cated by the hu­mans try­ing to shut down the AI; and be­lieve that \(\Delta U\) still up­dates to the wrong an­swer even when the AI tries to up­date on all the pre­vi­ous meta-knowl­edge; ex­cept for the meta-meta an­swer of just shut­ting down, which be­comes the best pos­si­ble choice given all the pre­vi­ous rea­son­ing. This seems sug­ges­tively similar in struc­ture to Death’s pre­dic­tion that what­ever you do will be the wrong de­ci­sion, even hav­ing taken Death’s state­ment into ac­count.

The Death in Da­m­as­cus sce­nario can be well-rep­re­sented in some (non­stan­dard) de­ci­sion the­o­ries. This pre­sents one po­ten­tial av­enue for fur­ther for­mal re­search on us­ing moral un­cer­tainty to yield shut­down­abil­ity—in fact, us­ing moral un­cer­tainty to solve in gen­eral the hard prob­lem of cor­rigi­bil­ity.