Low impact

A low-im­pact agent is a hy­po­thet­i­cal task-based AGI that’s in­tended to avoid dis­as­trous side effects via try­ing to avoid large side effects in gen­eral.

Con­sider the Sorcerer’s Ap­pren­tice fable: a le­gion of broom­sticks, self-repli­cat­ing and re­peat­edly overfilling a cauldron (per­haps to be as cer­tain as pos­si­ble that the cauldron was full). A low-im­pact agent would, if func­tion­ing as in­tended, have an in­cen­tive to avoid that out­come; it wouldn’t just want to fill the cauldron, but fill the cauldron in a way that had a min­i­mum foot­print. If the task given the AGI is to paint all cars pink, then we can hope that a low-im­pact AGI would not ac­com­plish this via self-repli­cat­ing nan­otech­nol­ogy that went on repli­cat­ing af­ter the cars were painted, be­cause this would be an un­nec­es­sar­ily large side effect.

On a higher level of ab­strac­tion, we can imag­ine that the uni­verse is parsed by us into a set of vari­ables \(V_i\) with val­ues \(v_i.\) We want to avoid the agent tak­ing ac­tions that cause large amounts of di­su­til­ity, that is, we want to avoid per­turb­ing vari­ables from \(v_i\) to \(v_i^*\) in a way that de­creases util­ity. How­ever, the ques­tion of ex­actly which vari­ables \(V_i\) are im­por­tant and shouldn’t be en­trop­i­cally per­turbed is value-laden—com­pli­cated, frag­ile, high in al­gorith­mic com­plex­ity, with Humean de­grees of free­dom in the con­cept bound­aries.

Rather than rely­ing solely on teach­ing an agent ex­actly which parts of the en­vi­ron­ment shouldn’t be per­turbed and risk­ing catas­tro­phe if we miss an in­junc­tion, the low im­pact route would try to build an agent that tried to per­turb fewer vari­ables re­gard­less.

The hope is that “have fewer side effects” is a prob­lem that has a sim­ple core and is learn­able by a man­age­able amount of train­ing. Con­versely, try­ing to train “here is the list of bad effects not to have and im­por­tant vari­ables not to per­turb” would be com­pli­cated and lack a sim­ple core, be­cause ‘bad’ and ‘im­por­tant’ are value-laden. A list of dan­ger­ous vari­ables would also be a black­list rather than a whitelist, which would make it more vuln­er­a­ble to treach­er­ous con­text changes if the AI gained the abil­ity to af­fect new things.

In­tro­duc­tion: For­mal­iz­ing low im­pact seems nontrivial

In­tu­itively, the no­tion of “low im­pact” seems like it should be sim­pler—have more of a cen­tral, core ten­dency to cap­ture—than “avoid bad im­pacts”. If so, we don’t know yet how to com­pactly state this core prin­ci­ple semifor­mally.

Sup­pose we start with an ob­vi­ous no­tion: to have low im­pact, min­i­mize the num­ber of vari­ables you causally af­fect. But:

  • Every event has a not-ab­solutely-zero im­pact on ev­ery fu­ture event. When you twitch your thumb or even just fire a neu­ron, the grav­i­ta­tional rip­ples from the mov­ing atoms spread out and will even­tu­ally pro­duce in­finites­i­mal forces on atoms or­bit­ing the other side of the galaxy. So we can’t say “have zero im­pact on as many quarks as pos­si­ble” be­cause it is im­pos­si­ble for any act to have ab­solutely zero im­pact on any quark.

This is a spe­cial case of a failure mode where, since the im­pact met­ric turns out to already be maxed out by all ac­tions, mov­ing around in­side the maxed-out im­pact penalty doesn’t change ex­pected util­ity, and there­fore fails to in­fluence which ac­tion is cho­sen. This fore­see­able difficulty might also illus­trate a typ­i­cal treach­er­ous con­text change, in that the AI dur­ing its train­ing phase might have a rel­a­tively small causal model of its en­vi­ron­ment, with qual­i­ta­tive causal ar­rows, and so seem to be min­i­miz­ing im­pact in­side the train­ing regime. Later de­vel­op­ment might cause the AI to adopt a new causal model in which, as hap­pens to ac­tu­ally be the case, all its pos­si­ble ac­tions in­fluence all the quarks in­side its fu­ture light cone, and so ac­tions are not differ­en­ti­ated by the mag­ni­tude of their im­pact penalty; thus the AI would safe in train­ing and un­safe in later use, with the crit­i­cal thresh­old pos­si­bly com­ing af­ter the AI was too in­tel­li­gent to be shut down.

But mea­sur­ing causal im­pacts qual­i­ta­tively isn’t the only pos­si­bil­ity. Sup­pose we define im­pact as the sum of the ex­pected move­ment of all atoms, rel­a­tive to their mean ex­pected po­si­tions if some priv­ileged null ac­tion were taken in­stead? Or, to avoid this im­pact mea­sure re­turn­ing 0 as soon as the AI re­al­izes that there are no atoms, only quarks, we try to give the AI some no­tion of the dis­tance be­tween out­comes in any way of mod­el­ing the world?

Ar­guendo, sup­pose that we could define a rep­re­sen­ta­tional lan­guage over pos­si­ble uni­verses and laws of physics, such that within any rep­re­sentable model of the uni­verse, there’s an ob­vi­ous no­tion of ‘dis­tance’ be­tween any two out­comes con­di­tional on the AI’s ac­tions. If the uni­verse is made of atoms, this rep­re­sen­ta­tion will ex­pose the ob­vi­ous im­pact met­ric on the move­ments of atoms. If the uni­verse is made of quarks, the same com­po­nent of the util­ity func­tion will read­ily calcu­late the move­ments of quarks. If the uni­verse is made of quan­tum fields, this im­pact met­ric will be­have in the in­tu­itively in­tended way that ba­si­cally works out to mea­sur­ing par­ti­cle mo­tions, rather than the change met­ric always max­ing out as the re­sult of all am­pli­tude flows end­ing up in qual­i­ta­tively differ­ent sec­tions of the quan­tum con­figu­ra­tion space, etcetera. (Note that this is already sound­ing pretty non­triv­ial.)

Fur­ther­more, sup­pose when the AI is think­ing in terms of nei­ther atoms nor quarks, but rather, say, the equiv­a­lent of chess moves or voxel fields, the same im­pact met­ric can ap­ply to this as well; so that we can ob­serve the low-im­pact be­hav­iors at work dur­ing ear­lier de­vel­op­ment phases.

More for­mally: We sup­pose that the AI’s model class \(\mathcal M\) is such that for any al­lowed model \(M \in \mathcal M,\) for any two out­comes \(o_M\) and \(o_M^'\) that can re­sult from the AI’s choice of ac­tions, there is a dis­tance \(%% o_M - o_M^' %%\) which obeys stan­dard rules for dis­tances. This gen­eral dis­tance mea­sure is such that, within the stan­dard model of physics, mov­ing atoms around would add to the dis­tance be­tween out­comes in the ob­vi­ous way; and for mod­els short of molec­u­lar de­tail, will mea­sure changes in other vari­ables and quan­tities in an in­tu­itive way. We then fix some par­tic­u­lar policy \(\pi_0\) whose con­se­quence \((o|\pi_0)\) is “the re­sult of the AI do­ing noth­ing”, and mea­sure the im­pact penalty of any other policy \(\pi_k\) as pro­por­tional to the ex­pected dis­tance be­tween out­comes \(\mathbb E[%% (o | \pi_0) - (o | \pi_k) %%].\)

Then we might run into any of these fur­ther fore­see­able difficul­ties, if, e.g., you asked the AGI to cure can­cer with min­i­mum ‘im­pact’ as defined above (or to put it an­other way, min­i­mize im­pact sub­ject to the con­straint of cur­ing can­cer):

  • Offset­ting ac­tions we would in­tu­itively re­gard as both bad and im­pact­ful:

  • The AGI mod­els global death rates as a quan­tity, and im­ple­ments offset­ting ac­tions to keep Earth’s an­nual death statis­tics con­stant af­ter it cures can­cer.

  • Chaotic per­mis­sive­ness:

  • Weather is a chaotic sys­tem. If in­fluenc­ing any but­terfly is go­ing to move around all the atoms in the at­mo­sphere any­way, you might as well have them end up some­where you want.

  • Sta­sis in­cen­tives:

  • The AGI ed­its its pro­gram­mers to make sure the pro­gram­mers do what­ever they would have done if the AGI, af­ter be­ing told its task, performed the null ac­tion.

  • The AGI pro­lifer­ates across galax­ies to make sure ev­ery­thing else in the uni­verse out­side of hu­man bod­ies ad­heres as closely as pos­si­ble to the ex­pected state it would have oc­cu­pied if the null ac­tion had been taken.

  • The AGI sets up a weather-con­trol sys­tem so that at least its fur­ther ac­tions won’t again dis­turb the weather.

All of this just goes to say that there’s ap­par­ently some sub­tlety built into our in­tu­itively in­tended no­tion of “paint all cars pink, but do so with the min­i­mum foot­print pos­si­ble apart from that”.

We want peo­ple to be able to no­tice that their cars have been painted pink, and for them to en­joy what­ever fur­ther benefit of pink-painted cars led us to give the AGI this in­struc­tion in the first place. But we can’t just whitelist any fur­ther im­pact that hap­pens as a con­se­quence of the car be­ing painted pink, be­cause maybe the car was painted with pink repli­cat­ing nanoma­chines. Etcetera.

Even if there is, in fact, some sub­tlety built into our in­tended no­tion of “make plans that have min­i­mal side effects”, this sub­tle no­tion of low im­pact might still have a rel­a­tively much sim­pler core than our in­tu­itive no­tion of “avoid bad im­pacts”. This might be re­flected in ei­ther an im­proved for­mal in­tu­ition for ‘low im­pact’ that proves to stand up to a few years of skep­ti­cal scrutiny with­out any holes hav­ing been poked in it, or, much more nerve-rack­ingly, the abil­ity to train an AI to make min­i­mal-im­pact plans even if we don’t know a closed-form defi­ni­tion of “min­i­mal im­pact”.

Work in this area is on­go­ing, so far mainly in the form of some pre­limi­nary sug­ges­tions by Stu­art Arm­strong (which were mostly shot down, but this is still progress com­pared to star­ing blankly at the prob­lem). link Arm­strong’s stuff.

Fore­see­able difficulties

Per­mis­sive­ness in­side chaotic systems

Sup­pose you told the AI to af­fect as few things as pos­si­ble, above the min­i­mum nec­es­sary to achieve its task, and defined ‘im­pact’ qual­i­ta­tively in terms of causal links that make vari­ables oc­cupy differ­ent states. Then since ev­ery act and in­deed ev­ery in­ter­nal de­ci­sion (tran­sis­tors, in switch­ing, move elec­trons) would have in­finites­i­mal in­fluences on liter­ally ev­ery­thing in the AI’s fu­ture light cone, all of which is defined as an ‘im­pact’, all ac­tions would seem to have the same, max­i­mum im­pact. Then the im­pact penalty would make no differ­ence to the net ex­pected util­ity of ac­tions, caus­ing the AI to be­have as if it had no im­pact penalty.

Even if an im­pact mea­sure doesn’t max out be­cause of ubiquitous qual­i­ta­tive im­pacts, a poorly defined im­pact mea­sure might max out quan­ti­ta­tively when the AGI is op­er­at­ing in a do­main that is chaotic in the sense that tiny differ­ences soon blow up to large differ­ences. E.g., if a but­terfly flaps its wings, that might cause a hur­ri­cane on the other side of the world a year later—so since you’re already chang­ing the weather sys­tem as much as pos­si­ble, why does it mat­ter if, say, you on-pur­pose cause a hur­ri­cane in some area, or de­stroy a tar­get us­ing at­mo­spheric light­ning strikes? Those air molecules would all have ended up mov­ing any­way be­cause of the but­terfly effect.

An imag­in­able patch is to try to eval­u­ate im­pact over fore­see­able im­pacts, so that a known light­ning strike is ‘fore­see­able’, while the effects on fu­ture hur­ri­canes are ‘not fore­see­able’. This seems wor­ry­ingly like mix­ing up the map and the ter­ri­tory (is it okay to re­lease en­vi­ron­men­tal poi­sons so long as you don’t know who gets hurt?), but Stu­art Arm­strong has made some pre­limi­nary sug­ges­tions about min­i­miz­ing know­able im­pacts. link Arm­strong’s stuff on min­i­miz­ing know­able im­pacts.

If you didn’t know it was com­ing, “max­ing out the im­pact penalty” would po­ten­tially be a treach­er­ous con­text change. When the AI was at the in­frahu­man level, it might model the world on a level where its ac­tions had rel­a­tively few di­rect causal links spread­ing out from them, and most of the world would seem un­touched by most of its pos­si­ble ac­tions. Then min­i­miz­ing the im­pact of its ac­tions, while fulfilling its goals, might in the in­frahu­man state seem to re­sult in the AI car­ry­ing out plans with rel­a­tively few side effects, as in­tended. In a su­per­hu­man state, the AI might re­al­ize that its ev­ery act re­sulted in quan­tum am­pli­tude flow­ing into a nonover­lap­ping sec­tion of con­figu­ra­tion space, or hav­ing chaotic in­fluences on a sys­tem the AI was not pre­vi­ously mod­el­ing as hav­ing max­i­mum im­pact each time.

In­finite im­pact penalties

In one case, a pro­posed im­pact penalty was writ­ten down on a white­board which hap­pened to have the frac­tional form \(\frac{X}{Y}\) where the quan­tity \(Y\) could in some imag­in­able uni­verses get very close to zero, caus­ing Eliezer Yud­kowsky to make an “Aaaaaaaaaaa”-sound as he waved his hands speech­lessly in the di­rec­tion of the de­nom­i­na­tor. The cor­re­spond­ing agent would have spent all its effort on fur­ther-min­i­miz­ing in­finites­i­mal prob­a­bil­ities of vast im­pact penalties.

Be­sides “don’t put de­nom­i­na­tors that can get close to zero in any term of a util­ity func­tion”, this illus­trates a spe­cial case of the gen­eral rule that im­pact penalties need to have their loud­ness set at a level where the AI is do­ing some­thing be­sides min­i­miz­ing the im­pact penalty. As a spe­cial case, this re­quires con­sid­er­ing the growth sce­nario for im­prob­a­ble sce­nar­ios of very high im­pact penalty; the penalty must not grow faster than the prob­a­bil­ity diminishes.

(As usual, note that if the agent only started to vi­su­al­ize these ul­tra-un­likely sce­nar­ios upon reach­ing a su­per­hu­man level where it could con­sider loads of strange pos­si­bil­ities, this would con­sti­tute a treach­er­ous con­text change.)

Allowed con­se­quences vs. offset actions

When we say “paint all cars pink” or “cure can­cer” there’s some im­plicit set of con­se­quences that we think are al­low­able and should definitely not be pre­vented, such as peo­ple notic­ing that their cars are pink, or plane­tary death rates drop­ping. We don’t want the AI try­ing to ob­scure peo­ple’s vi­sion so they can’t no­tice the car is pink, and we don’t want the AI kil­ling a cor­re­spond­ing num­ber of peo­ple to level the plane­tary death rate. We don’t want these bad offset­ting ac­tions which would avert the con­se­quences that were the point of the plan in the first place.

If we use a low-im­pact AGI to carry out some pivotal act that’s part of a larger plan to im­prove Earth’s chances of not be­ing turned into pa­per­clips, then this, in a cer­tain sense, has a very vast im­pact on many galax­ies that will not be turned into pa­per­clips. We would not want this al­lowed con­se­quence to max out and blur our AGI’s im­pact mea­sure, nor have the AGI try to im­ple­ment the pivotal act in a way that would min­i­mize the prob­a­bil­ity of it ac­tu­ally work­ing to pre­vent pa­per­clips, nor have the AGI take offset­ting ac­tions to keep the prob­a­bil­ity of pa­per­clips to its pre­vi­ous level.

Sup­pose we try to patch this rule that, when we carry out the plan, the fur­ther causal im­pacts of the task’s ac­com­plish­ment are ex­empt from im­pact penalties.

But this seems to al­low too much. What if the cars are painted with self-repli­cat­ing pink nanoma­chines? What dis­t­in­guishes the fur­ther con­se­quences of that solved goal from the fur­ther causal im­pact of peo­ple notic­ing that their cars have been painted pink?

One differ­ence be­tween “peo­ple no­tice their can­cer was cured” and “the can­cer cure repli­cates and con­sumes the bio­sphere” is that the first case in­volves fur­ther effects that are, from our per­spec­tive, pretty much okay, while the sec­ond class of fur­ther effects are things we don’t like. But an ‘okay’ change ver­sus a ‘bad’ change is a value-laden bound­ary. If we need to de­tect this differ­ence as such, we’ve thrown out the sup­posed sim­plic­ity of ‘low im­pact’ that was our rea­son for tack­ling ‘low im­pact’ and not ‘low bad­ness’ in the first place.

What we need in­stead is some way of dis­t­in­guish­ing “Peo­ple see their cars were painted pink” ver­sus “The nanoma­chin­ery in the pink paint repli­cates fur­ther” that op­er­ates on a more ab­stract, non-value-laden level. For ex­am­ple, hy­po­thet­i­cally speak­ing, we might claim that most ways of paint­ing cars pink will have the con­se­quence of peo­ple see­ing their cars were painted pink and only a few ways of paint­ing cars pink will not have this con­se­quence, whereas the repli­cat­ing ma­chin­ery is an un­usu­ally large con­se­quence of the task hav­ing reached its fulfilled state.

But is this re­ally the cen­tral core of the dis­tinc­tion, or does fram­ing an im­pact mea­sure this way im­ply some fur­ther set of nonob­vi­ous un­de­sir­able con­se­quences? Can we say rigor­ously what kind of mea­sure on task fulfill­ments would im­ply that ‘most’ pos­si­ble fulfill­ments lead peo­ple to see their cars painted pink, while ‘few’ de­stroy the world through self-repli­cat­ing nan­otech­nol­ogy? Would that rigor­ous mea­sure have fur­ther prob­lems?

And if we told an AGI to shut down a nu­clear plant, wouldn’t we want a low-im­pact AGI to err on the side of pre­vent­ing ra­dioac­tivity re­lease, rather than try­ing to pro­duce a ‘typ­i­cal’ mag­ni­tude of con­se­quences for shut­ting down a nu­clear plant?

It seems difficult (but might still be pos­si­ble) to clas­sify the fol­low­ing con­se­quences as hav­ing low and high ex­tra­ne­ous im­pacts based on a generic im­pact mea­sure only, with­out in­tro­duc­ing fur­ther value lad­ing:

  • Low dis­al­lowed im­pact: Cur­ing can­cer causes peo­ple to no­tice their can­cer has been cured, hos­pi­tal in­comes to drop, and world pop­u­la­tion to rise rel­a­tive to its de­fault state.

  • High dis­al­lowed im­pact: Shut­ting down a nu­clear power plant causes a re­lease of ra­dioac­tivity.

  • High dis­al­lowed im­pact: Paint­ing with pink nanoma­chin­ery causes the nanoma­chines to fur­ther repli­cate and eat some in­no­cent by­stan­ders.

  • Low dis­al­lowed im­pact: Paint­ing cars with or­di­nary pink paint changes the rays of light re­flect­ing from those cars and causes peo­ple to gasp and say “What just hap­pened to my car?”

  • Low dis­al­lowed im­pact: Do­ing some­thing smart with a Task AGI de­creases the prob­a­bil­ity of the galax­ies be­ing con­sumed by an Un­friendly AI.

(Even if we think that good AGI sce­nar­ios in­volve the AGI con­ceal­ing the fact of its ex­is­tence, it’s hard to see why we’d want the events as such to be un­no­tice­able, or for their notic­ing to count as ex­tra­ne­ous im­pacts.)

Fal­lback use of an im­pact mea­sure to pri­ori­tize op­er­a­tor queries

If this prob­lem of al­low­ing good con­se­quences did turn out to be in­her­ently value-laden, it might not ob­vi­ate the use­ful­ness of a safe im­pact mea­sure. For ex­am­ple, it could be that in nor­mal op­er­a­tion of a Task AGI, there are only a limited num­ber of good fur­ther im­pacts that we want to whitelist—e.g., once we say that cur­ing can­cer is al­lowed to cause peo­ple to no­tice that they feel healthier and for there to be more peo­ple al­ive us­ing more re­sources, etcetera, then it’s okay to min­i­mize all other foot­prints.

Similarly, a safe im­pact mea­sure might op­er­ate to alert the op­er­a­tors to all large im­pacts sorted by size of im­pact, such that the im­pact mea­sure would re­li­ably not have any re­main­ing event that was ‘large’ in an in­tu­itive sense that the op­er­a­tors didn’t know about. That is, we might not be able to give a blan­ket in­struc­tion to min­i­mize im­pact, but a safe im­pact mea­sure might still be use­ful for de­cid­ing which fea­tures of a plan need to be queried to the op­er­a­tors.

Sta­sis incentives

This would be the gen­eral prob­lem that in­cludes “The AGI pro­lifer­ates across galax­ies to make sure ev­ery­thing else in the uni­verse out­side of hu­man bod­ies ad­heres as closely as pos­si­ble to the ex­pected state it would have oc­cu­pied if the null ac­tion had been taken.”

Similarly, we can imag­ine the AGI hav­ing a strong in­cen­tive to, in Round 1, paint the cars with nanoma­chines such that the nanoma­chin­ery quietly and un­ob­tru­sively sets up a weather-con­trol sys­tem. This wouldn’t pre­vent the AGI’s ac­tions in Round 1 from in­fluenc­ing the chaotic weather and hav­ing a huge im­pact, but could en­sure that the AI’s fu­ture ac­tions in Round 2 wouldn’t in­fluence the formerly-chaotic weather rel­a­tive to its null ac­tion then.

The prob­lem of the backup AGI

Sup­pose we’re us­ing a mea­sure of im­pact rel­a­tive to a priv­ileged null ac­tion. Imag­ine that there are two AGIs which will both be given the same task, and un­der or­di­nary cir­cum­stances, di­vide the effort of perform­ing the task be­tween them; but if one AGI fails to perform, the other will take over the task in its en­tirety. Then each AGI rea­sons, “If I perform the null ac­tion in­stead of car­ry­ing out the task, the other AGI will do al­most ex­actly the same thing I was go­ing to do” and so mod­els it­self as hav­ing coun­ter­fac­tu­ally al­most no im­pact. In this way, two AGIs can each eval­u­ate them­selves as hav­ing al­most no im­pact, while op­er­at­ing in tan­dem they have a very large im­pact. (This prob­lem was first ob­served by Stu­art Arm­strong.)

We can even imag­ine—though hope­fully this isn’t the sort of thing that would go un­caught by other safety mea­sures—that in Round 1, the agent builds a backup agent, which doesn’t count as hav­ing an ex­tra­ne­ous im­pact be­cause the sys­tem of 2 agents ends up car­ry­ing out the same steps that the agent would have car­ried out on its own. But then on Round 2, there’s now a backup and both agents re­gard them­selves as hav­ing min­i­mal im­pact.

(If you say that the agent should, on Round 1, re­al­ize that con­struct­ing the backup agent on Round 1 would have a later im­pact on the effect of the AI car­ry­ing out its or­ders in Round 2, then we’re now al­low­ing the agent to min­i­mize its im­pact across tasks, which might im­ply that in Round 1 it should try to build some­thing that reg­u­larizes the weather as a side effect, so that its ac­tions in Round 2 have less chaotic im­pact on the weather… or that it tries to in­fluence its pro­gram­mers to give it less im­pact­ful tasks, etcetera.)

A more gen­eral way of view­ing this pos­si­ble prob­lem is that there might be tiling prob­lems for a low-im­pact agent mod­ify­ing it­self or cre­at­ing sub­agents or suc­ces­sors.

Stan­dard issues

The fol­low­ing stan­dard is­sues from al­ign­ment the­ory seem like they could also ap­ply to the prob­lem of build­ing a low-im­pact agent or cre­at­ing a safe im­pact mea­sure. We need an ad­vanced-safe im­pact mea­sure that will stand up to the load and stress of cog­ni­tion that is su­per­hu­man in at least some do­mains.

Pro­gram­mer manipulation

A Task AGI which plans across rounds may have an in­cen­tive to in­fluence its pro­gram­mers into giv­ing it plans that will have net lower im­pact com­pared to other plans the pro­gram­mers might offer. If there’s an ex­plicit pro­ce­dure for the pro­gram­mers to whitelist con­se­quences, the AGI would have to be some­how con­figured in such fash­ion as to not pre­fer pro­gram­mers to whitelist the con­se­quences or seek out con­se­quences that the pro­gram­mers would be ex­pected to (per­haps mis­tak­enly) whitelist.

Causal en­vi­ron­ment hacking

Since an im­pact mea­sure seems likely to be defined over the causes of sense data, it has the stan­dard prob­lem with re­mote hack­ing of causal en­vi­ron­ments. (This is a ubiquitous enough prob­lem that you’re go­ing to want some gen­eral solu­tion such as not hav­ing the AGI model re­mote hack­ers in the first place. But this sub­prob­lem should still be noted each time we in­tro­duce a de­pen­dency on the mod­eled causes of sense data.)

Edge instantiation

Build­ing a safe im­pact mea­sure is hard be­cause if we roll a naive im­pact mea­sure, it might have its min­i­mum in a weird, un­de­sir­able place. Freez­ing all mat­ter into the uni­verse into sta­sis would be an ex­am­ple of this.

Un­fore­seen maximum

The ac­tions and modes the AGI has available to po­ten­tially min­i­mize its im­pact will in­clude more ac­tions and modes than we can imag­ine and check our­selves.

Near­est un­blocked neighbor

If the max­i­miz­ing plan has high im­pact, the new best plan af­ter in­tro­duc­ing an im­pact mea­sure may be a similar plan that has, through some leaky defi­ni­tion, found a weird spe­cial case that doesn’t count as an im­pact. To put it an­other way, to the ex­tent an im­perfect im­pact mea­sure would oth­er­wise ham­per the AI if it were perfect, the AI ex­pe­riences an in­stru­men­tal pres­sure to seek out any loop­holes or leaks in the defi­ni­tion.

Con­text change

As the AI gains in cog­ni­tive ca­pa­bil­ity, this widens the op­tion space which could po­ten­tially con­tain in­stances of ‘low im­pact’ that are un­fore­seen or un­blocked.

In this par­tic­u­lar case, in­creased in­tel­li­gence may also:

  • Change the set of im­pacts that the AI can fore­see, so that e.g. its causal model goes from ‘some ac­tions causally con­nect to some other things’ to ‘all ac­tions causally con­nect to all things’.

  • Wi­den the space of pos­si­bil­ities it con­sid­ers, to in­clude for the first time very-low-prob­a­bil­ity sce­nar­ios of very large im­pacts.

  • In­clude new offset­ting ac­tions, so that the AI for the first time has the op­tion of edit­ing brains not to no­tice changes it makes, or in­tro­duce ran­dom­ized Un­friendly-AI gen­er­a­tors to offset de­creases in the prob­a­bil­ity of Un­friendly AI.

Reflec­tive stability

Does a low-im­pact AI want to only build a suc­ces­sor that’s a low-im­pact AI? If it builds an en­vi­ron­men­tal sub­agent, is that sub­agent low im­pact?

Even if the AGI is sup­posed to not be self-mod­ify­ing or to be build­ing sub­agents, is there a wor­ry­ing di­ver­gence and pres­sure to be held in check be­tween how the AI thinks and how the AI would pre­fer to think? Does a low-im­pact AGI want rele­vant cog­ni­tive com­pu­ta­tions in gen­eral to be low im­pact?

To the ex­tent that low im­pact is a fea­ture of the util­ity func­tion rather than the op­ti­miza­tion style, this doesn’t have any ob­vi­ous prob­lems (apart from Arm­strong’s dual-AGI no-im­pact coun­ter­fac­tual is­sue), but it’s a stan­dard thing to check, and would be­come much more im­por­tant if low im­pact was sup­pos­edly be­ing achieved through any fea­ture of the op­ti­miza­tion style rather than util­ities over out­comes.

Re­lated /​ fur­ther problems

A shut­down util­ity func­tion is one which in­cen­tivizes the AI to safely switch it­self off, with­out, say, cre­at­ing a sub­agent that as­similates all mat­ter in the uni­verse to make ab­solutely sure the AI is never again switched back on.

Abortable plans are those which are com­posed with the in­ten­tion that it be pos­si­ble to mid­way ac­ti­vate an ‘abort’ plan, such that the par­tial im­ple­men­ta­tion of the origi­nal plan, com­bined with the ex­e­cu­tion of the abort plan, to­gether have a min­i­mum im­pact. For ex­am­ple, if an abortable AI was build­ing self-repli­cat­ing nanoma­chines to paint a car pink, it would give all the nanoma­chines a quiet self-de­struct but­ton, so that at any time the ‘abort’ plan could be ex­e­cuted af­ter hav­ing par­tially im­ple­mented to the plan to paint the car pink, such that these two plans to­gether would have a min­i­mum im­pact.


  • Shutdown utility function

    A spe­cial case of a low-im­pact util­ity func­tion where you just want the AGI to switch it­self off harm­lessly (and not cre­ate sub­agents to make ab­solutely sure it stays off, etcetera).

  • Abortable plans

    Plans that can be un­done, or switched to hav­ing low fur­ther im­pact. If the AI builds abortable nanoma­chines, they’ll have a quiet self-de­struct op­tion that in­cludes any repli­cated nanoma­chines.


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.