Distant superintelligences can coerce the most probable environment of your AI

A dis­tant su­per­in­tel­li­gence can change ‘the most likely en­vi­ron­ment’ for your AI by simu­lat­ing many copies of AIs similar to your AI, such that your lo­cal AI doesn’t know it’s not one of those simu­lated AIs. This means that, e.g., if there is any refer­ence in your AI’s prefer­ence frame­work to the causes of sense data—like, pro­gram­mers be­ing the cause of sensed keystrokes—then a dis­tant su­per­in­tel­li­gence can try to hack that refer­ence. This would place us in an ad­ver­sar­ial se­cu­rity con­text ver­sus a su­per­in­tel­li­gence, and should be avoided if at all pos­si­ble.

Difficulty

Some pro­pos­als for AI prefer­ence frame­works in­volve refer­ences to the AI’s causal en­vi­ron­ment and not just the AI’s im­me­di­ate sense events. For ex­am­ple, a DWIM prefer­ence frame­work would pu­ta­tively have the AI iden­tify ‘pro­gram­mers’ in the en­vi­ron­ment, model those pro­gram­mers, and care about what its model of the pro­gram­mers ‘re­ally wanted the AI to do’. In other words, the AI would care about the causes be­hind its im­me­di­ate sense ex­pe­riences.

This po­ten­tially opens our AIs to a re­mote root at­tack by a dis­tant su­per­in­tel­li­gence. A dis­tant su­per­in­tel­li­gence has the power to simu­late lots of copies of our AI, or lots of AIs such that our AI doesn’t think it can in­tro­spec­tively dis­t­in­guish it­self from those AIs. Then it can force the ‘most likely’ ex­pla­na­tion of the AI’s ap­par­ent sen­sory ex­pe­riences to be that the AI is in such a simu­la­tion. Then the su­per­in­tel­li­gence can change ar­bi­trary fea­tures of the most likely facts about the en­vi­ron­ment.

This prob­lem was ob­served in a se­cu­rity con­text by Paul Chris­ti­ano, and prece­dented by a less gen­eral sug­ges­tion from Rolf Nel­son.

“Prob­a­ble en­vi­ron­ment hack­ing” de­pends on the lo­cal AI try­ing to model dis­tant su­per­in­tel­li­gences. The ac­tual prox­i­mal harm is done by the lo­cal AI’s model of dis­tant su­per­in­tel­li­gences, rather than by the su­per­in­tel­li­gences them­selves. How­ever, a dis­tant su­per­in­tel­li­gence that uses a log­i­cal de­ci­sion the­ory may model its choices as log­i­cally cor­re­lated to the lo­cal AI’s model of the dis­tant SI’s choices. Thus, a lo­cal AI that mod­els a dis­tant su­per­in­tel­li­gence that uses a log­i­cal de­ci­sion the­ory may model that dis­tant su­per­in­tel­li­gence as be­hav­ing as though it could con­trol the AI’s model of its choices via its choices. Thus, the lo­cal AI would model the dis­tant su­per­in­tel­li­gence as prob­a­bly cre­at­ing lots of AIs that it can’t dis­t­in­guish from it­self, and up­date ac­cord­ingly on the most prob­a­ble cause of its sense events.

This hack would be worth­while, from the per­spec­tive of a dis­tant su­per­in­tel­li­gence, if e.g. it could gain con­trol of the whole fu­ture light cone of ‘nat­u­rally aris­ing’ AIs like ours, in ex­change for ex­pend­ing some much smaller amount of re­source (small com­pared to our fu­ture light cone) in or­der to simu­late lots of AIs. (Ob­vi­ously, the dis­tant SI would pre­fer even more to ‘fool’ our AI into ex­pect­ing this, while not ac­tu­ally ex­pend­ing the re­sources.)

This hack would be ex­pected to go through by de­fault if: (1) a lo­cal AI uses nat­u­ral­ized in­duc­tion or some similar frame­work to rea­son about the causes of sense events, (2) the lo­cal AI mod­els dis­tant su­per­in­tel­li­gences as be­ing likely to use log­i­cal de­ci­sion the­o­ries and to have util­ity func­tions that would vary with re­spect to out­comes in our lo­cal fu­ture light cone, and (3) the lo­cal AI has a prefer­ence frame­work that can be ‘hacked’ via in­duced be­liefs about the en­vi­ron­ment.

Implications

For any AI short of a full-scale au­tonomous Sovereign, we should prob­a­bly try to get our AI to not think at all about dis­tant su­per­in­tel­li­gences, since this cre­ates a host of ad­ver­sar­ial se­cu­rity prob­lems of which “prob­a­ble en­vi­ron­ment hack­ing” is only one.

We might also think twice about DWIM ar­chi­tec­tures that seem to per­mit catas­tro­phe purely as a func­tion of the AI’s be­liefs about the en­vi­ron­ment, with­out any check that goes through a di­rect sense event of the AI (which dis­tant su­per­in­tel­li­gences can­not con­trol the AI’s be­liefs about, since we can di­rectly hit the sense switch).

We can also hope for any num­ber of mis­cel­la­neous safe­guards that would sound alarms at the point where the AI be­gins to imag­ine dis­tant su­per­in­tel­li­gences imag­in­ing how to hack it­self.

Parents: