Separation from hyperexistential risk

A prin­ci­ple of AI al­ign­ment that does not seem re­ducible to other prin­ci­ples is “The AGI de­sign should be widely sep­a­rated in the de­sign space from any de­sign that would con­sti­tute a hy­per­ex­is­ten­tial risk”. A hy­per­ex­is­ten­tial risk is a “fate worse than death”, that is, any AGI whose out­come is worse than quickly kil­ling ev­ery­one and filling the uni­verse with pa­per­clips.

As an ex­am­ple of this prin­ci­ple, sup­pose we could write a first-gen­er­a­tion AGI which con­tained an ex­plicit rep­re­sen­ta­tion of our ex­act true value func­tion \(V,\) but where we were not in this thought ex­per­i­ment ab­solutely sure that we’d solved the prob­lem of get­ting the AGI to al­ign on that ex­plicit rep­re­sen­ta­tion of a util­ity func­tion. This would vi­o­late the prin­ci­ple of hy­per­ex­is­ten­tial sep­a­ra­tion, be­cause an AGI that op­ti­mizes \(V\) is near in the de­sign space to one that op­ti­mizes \(-V.\) Similarly, sup­pose we can al­ign an AGI on \(V\) but we’re not cer­tain we’ve built this AGI to be im­mune to de­ci­sion-the­o­retic ex­tor­tion. Then this AGI dis­t­in­guishes the global min­i­mum of \(V\) as the most effec­tive threat against it, which is some­thing that could in­crease the prob­a­bil­ity of \(V\)-min­i­miz­ing sce­nar­ios be­ing re­al­ized. auto-sum­mary-to-here

The con­cern here is a spe­cial case of shalt-not back­fire whereby iden­ti­fy­ing a nega­tive out­come to the sys­tem moves us closer in the de­sign space to re­al­iz­ing it.

One seem­ingly ob­vi­ous patch to avoid di­su­til­ity max­i­miza­tion might be to give the AGI a util­ity func­tion \(U = V + W\) where \(W\) says that the ab­solute worst pos­si­ble thing that can hap­pen is for a piece of pa­per to have writ­ten on it the SHA256 hash of “Nopenopenope” plus 17. Then if, due to oth­er­wise poor de­sign per­mit­ting sin­gle-bit er­rors to have vast re­sults, a cos­mic ray flips the sign of the AGI’s effec­tive util­ity func­tion, the AGI tiles the uni­verse with pieces of pa­per like that; this is no worse than or­di­nary pa­per­clips. Similarly, any ex­tor­tion against the AGI would use such pieces of pa­per as a threat. \(W\) then func­tions as a hon­ey­pot or dis­trac­tor for di­su­til­ity max­i­miz­ers which pre­vents them from min­i­miz­ing our own true util­ity.

This patch would not ac­tu­ally work be­cause this is a rare spe­cial case of a util­ity func­tion not be­ing re­flec­tively con­sis­tent. By the same rea­son­ing we use to add \(W\) to the AI’s util­ity func­tion \(U,\) we might ex­pect the AGI to re­al­ize that the only thing caus­ing this weird hor­rible event to hap­pen would be that event’s iden­ti­fi­ca­tion by its rep­re­sen­ta­tion of \(U,\) and thus the AGI would be mo­ti­vated to delete its rep­re­sen­ta­tion of \(W\) from its suc­ces­sor’s util­ity func­tion.

A patch to the patch might be to have \(W\) sin­gle out a class of event which we didn’t oth­er­wise care about, but would oth­er­wise hap­pen at least once on its own over the oth­er­wise ex­pected his­tory of the uni­verse. If so, we’d need to weight \(W\) rel­a­tive to \(V\) within \(U\) such that \(U\) still mo­ti­vated ex­pend­ing only a small amount of effort on eas­ily pre­vent­ing the \(W\)-dis­val­ued event, rather than all effort be­ing spent on avert­ing \(W\) to the ne­glect of \(V.\)

A deeper solu­tion for an early-gen­er­a­tion Task AGI would be to never try to ex­plic­itly rep­re­sent com­plete hu­man val­ues, es­pe­cially the parts of \(V\) that iden­tify things we dis­like more than death. If you avoid im­pacts in gen­eral ex­cept for op­er­a­tor-whitelisted im­pacts, then you would avoid nega­tive im­pacts along the way, rather than the AI con­tain­ing an ex­plicit de­scrip­tion of what is the worst sort of im­pact that needs to be avoided. In this case, the AGI just doesn’t con­tain the in­for­ma­tion needed to com­pute states of the uni­verse that we’d con­sider worse than death; flip­ping the sign of the util­ity func­tion \(U,\) or sub­tract­ing com­po­nents from \(U\) and then flip­ping the sign, doesn’t iden­tify any state we con­sider worse than pa­per­clips. The AGI no longer neigh­bors a hy­per­ex­is­ten­tial risk in the de­sign space; there is no longer a short path we can take in the de­sign space, by any sim­ple nega­tive mir­a­cle, to get from the AGI to a fate worse than death.

Since hy­per­ex­is­ten­tial catas­tro­phes are nar­row spe­cial cases (or at least it seems this way and we sure hope so), we can avoid them much more widely than or­di­nary ex­is­ten­tial risks. A Task AGI pow­er­ful enough to do any­thing pivotal seems un­avoid­ably very close in the de­sign space to some­thing that would de­stroy the world if we took out all the in­ter­nal limiters. By the act of hav­ing some­thing pow­er­ful enough to de­stroy the world ly­ing around, we are closely neigh­bor­ing the de­struc­tion of the world within an ob­vi­ous met­ric on pos­si­bil­ities. Any­thing pow­er­ful enough to save the world can be trans­formed by a sim­ple nega­tive mir­a­cle into some­thing that (merely) de­stroys it.

But we don’t fret ter­ribly about how a calcu­la­tor that can add 17 + 17 and get 34 is very close in the de­sign space to a calcu­la­tor that gets −34; we just try to pre­vent the er­rors that would take us there. We try to con­strain the state tra­jec­tory nar­rowly enough that it doesn’t slop over into any “neigh­bor­ing” re­gions. This type of think­ing is plau­si­bly the best we can do for or­di­nary ex­is­ten­tial catas­tro­phes, which oc­cupy very large vol­umes of the de­sign space near any AGI pow­er­ful enough to be helpful.

By con­trast, an “I Have No Mouth And I Must Scream” sce­nario re­quires an AGI that speci­fi­cally wants or iden­ti­fies par­tic­u­lar very-low-value re­gions of the out­come space. Most sim­ple util­ity func­tions im­ply re­con­figur­ing the uni­verse in a way that merely kills us; a hy­per­ex­is­ten­tial catas­tro­phe is a much smaller tar­get. Since hy­per­ex­is­ten­tial risks can be ex­tremely bad, we pre­fer to avoid even very tiny prob­a­bil­ities of them; and since they are nar­row tar­gets, it is rea­son­able to try to avoid be­ing any­where near them in the state space. This can be seen as a kind of Mur­phy-proofing; we will nat­u­rally try to rigid­ify the state tra­jec­tory and per­haps suc­ceed, but er­rors in our rea­son­ing are likely to take us to nearby-neigh­bor­ing pos­si­bil­ities de­spite our best efforts. You would still need bad luck on top of that to end up in the par­tic­u­lar neigh­bor­hood that de­notes a hy­per­ex­is­ten­tial catas­tro­phe, but this is the type of small pos­si­bil­ity that seems worth min­i­miz­ing fur­ther.

This prin­ci­ple im­plies that gen­eral in­fer­ence of hu­man val­ues should not be a tar­get of an early-gen­er­a­tion Task AGI. If a meta-util­ity func­tion \(U'\) con­tains all of the in­for­ma­tion needed to iden­tify all of \(V,\) then it con­tains all of the in­for­ma­tion needed to iden­tify min­ima of \(V.\) This would be the case if e.g. an early-gen­er­a­tion AGI was ex­plic­itly iden­ti­fy­ing a meta-goal along the lines of “learn all hu­man val­ues”. How­ever, this con­sid­er­a­tion weigh­ing against gen­eral value learn­ing of true hu­man val­ues might not ap­ply to e.g. a Task AGI that was learn­ing in­duc­tively from hu­man-la­beled ex­am­ples, if the la­bel­ing hu­mans were not try­ing to iden­tify or dis­t­in­guish within “dead or worse” and just as­signed all such cases the same “bad” la­bel. There are still sub­tleties to worry about in a case like that, by which sim­ple nega­tive mir­a­cles might end up iden­ti­fy­ing the true \(V\) any­way in a goal-va­lent way. But even on the first step of “use the same la­bel for death and worse-than-death as events to be avoided, like­wise all va­ri­eties of bad fates bet­ter than death as a type of con­se­quence to no­tice and de­scribe to hu­man op­er­a­tors”, it seems like we would have moved sub­stan­tially fur­ther away in the de­sign space from hy­per­ex­is­ten­tial catas­tro­phe.


  • Principles in AI alignment

    A ‘prin­ci­ple’ of AI al­ign­ment is a very gen­eral de­sign goal like ‘un­der­stand what the heck is go­ing on in­side the AI’ that has in­formed a wide set of spe­cific de­sign pro­pos­als.