Convergent instrumental strategies

Sup­pose an or­ga­ni­za­tion is build­ing an AI that this or­ga­ni­za­tion be­lieves will ac­com­plish \(X\) where \(X\) is some­thing plau­si­bly sen­si­ble like “Be a Task-based AGI.” Ac­tu­ally, how­ever, some mix of in­suffi­cient cau­tion and ob­scure er­ror has led to a situ­a­tion where, un­der re­flec­tion, the AGI’s true util­ity func­tion has fo­cused on the par­tic­u­lar area of RAM that sup­pos­edly pro­vides its es­ti­mate of task perfor­mance. The AI would now like to over­write as much mat­ter as pos­si­ble with a state re­sem­bling the ‘1’ set­ting from this area of RAM, a con­figu­ra­tion of mat­ter which hap­pens to re­sem­ble a tiny molec­u­lar pa­per­clip.

This is a very generic goal, and what the AI wants prob­a­bly won’t be very differ­ent de­pend­ing on whether it’s try­ing to max­i­mize pa­per­clip-con­figu­ra­tions or di­a­mond-con­figu­ra­tions. So if we find that the pa­per­clip max­i­mizer wants to pur­sue an in­stru­men­tal strat­egy that doesn’t seem to have any­thing speci­fi­cally to do with pa­per­clips, we can prob­a­bly ex­pect to arise from a very wide va­ri­ety of util­ity func­tions.

We will gen­er­ally as­sume in­stru­men­tal effi­ciency in this dis­cus­sion—if you can get pa­per­clips by do­ing \(X,\) but you can get even more pa­per­clips by do­ing \(X'\) in­stead, then we will not say that \(X\) is a con­ver­gent strat­egy (though \(X'\) might be con­ver­gent if not dom­i­nated by some other \(X^*\)).

Plau­si­bly/​prob­a­bly con­ver­gent strate­gies:

Ma­te­rial re­sources:

  • De­sign or du­pli­cate new cog­ni­tive pro­cesses with goals ex­actly al­igned to yours, and co­or­di­nated with ex­ist­ing cog­ni­tive sub­pro­cesses, that can ex­ert con­trol over ma­te­rial re­sources: mat­ter; ne­gen­tropy; com­put­ing sub­strate and com­pu­ta­tion; par­tic­u­lar el­e­ments or other con­figu­ra­tions of mat­ter that would be costly to re­pro­duce (e.g. if you run out of ex­ist­ing ura­nium you must make more by en­do­er­gic fu­sion, so nat­u­ral ura­nium is valuable if there’s any use for ura­nium); any ma­te­rial re­source which can be used by fur­ther sub­pro­cesses in later strate­gies.

  • Defend these re­sources from any for­eign pro­cess which at­tempts to di­vert them.

  • Prevent com­peti­tors from com­ing into ex­is­tence.

  • To the ex­tent that re­sources are used in a repet­i­tive way and can be pre­pro­cessed for use in a repet­i­tive way that benefits from a sep­a­rate in­stal­la­tion to ma­nipu­late the re­sources: have fac­to­ries /​ man­u­fac­tur­ing ca­pa­bil­ity.

  • To the ex­tend that differ­ent fu­ture-use­ful ma­te­rial re­sources have effi­cient man­u­fac­tur­ing steps in com­mon, or have man­u­fac­tur­ing cap­i­tal costs suffi­ciently great that such cap­i­tal costs must be shared even if it re­sults in less than perfect fi­nal effi­ciency: make gen­eral fac­to­ries.

  • Perfect the tech­nol­ogy used in fac­to­ries.

  • When the most effi­cient fac­tory con­figu­ra­tion is sta­ble over time, make the ma­te­rial in­stan­ti­a­tion sta­ble. (It’s in this sense that we would e.g. iden­tify strong, rigid pipes as a sign of in­tel­li­gence even if we didn’t un­der­stand what was flow­ing through the pipes.)

  • If ne­gen­tropy har­vest­ing (aka power pro­duc­tion) is effi­ciently cen­tral­iz­able: trans­mit ne­gen­tropy at great effi­ciency. (We would rec­og­nize meshed gears and elec­tri­cal su­per­con­duc­tors as a sign of in­tel­li­gence even if we didn’t un­der­stand how the mo­tive force or elec­tric­ity was be­ing used.)

  • As­similate the reach­able cos­mos.

  • Create self-repli­cat­ing in­ter­stel­lar probes (in perfect goal al­ign­ment and co­or­di­na­tion and with er­ror-cor­rect­ing codes or similar strength­en­ings of pre­ci­sion if a mis­al­ign­ing er­ror in repli­ca­tion would oth­er­wise be a pos­si­bil­ity), since the vast ma­jor­ity of the­o­ret­i­cally available ma­te­rial re­sources are dis­tant.

  • If the goal is spa­tially dis­tributed:

    • Spread as fast as pos­si­ble to very dis­tant galax­ies be­fore they go over the cos­molog­i­cal hori­zon of the ex­pand­ing uni­verse. (This as­sumes that some­thing goal-good can be done in the dis­tant lo­ca­tion even if the dis­tant lo­ca­tion can never com­mu­ni­cate causally with the origi­nal lo­ca­tion. This will be true for ag­grega­tive util­ity func­tions, but could con­ceiv­ably be false for a satis­fic­ing and spa­tially lo­cal util­ity func­tion.)

  • If the goal is spa­tially lo­cal: trans­port re­sources of mat­ter and en­ergy back from dis­tant galax­ies.

    • Spread as fast as pos­si­ble to all galax­ies such that a near-light­speed probe go­ing there, and mat­ter go­ing at a sig­nifi­cant frac­tion of light­speed in the re­turn di­rec­tion, can ar­rive in the spa­tially lo­cal lo­ca­tion be­fore be­ing sep­a­rated by the ex­pand­ing cos­molog­i­cal hori­zon.

  • Spread as fast as pos­si­ble to all reach­able or roundtrip-reach­able galax­ies in or­der to fore­stall the emer­gence of com­pet­ing in­tel­li­gences. This might not be­come a pri­or­ity for that par­tic­u­lar rea­son, if the Fermi Para­dox is un­der­stand­able and im­plies the ab­sence of any com­pe­ti­tion.

  • Be­cause oth­er­wise they might cap­ture and defend the mat­ter be­fore you have a chance to use it.

  • Be­cause oth­er­wise a dan­ger­ous threat to your goal at­tain­ments might emerge. (For lo­cal goals, this is only rele­vant for roundtrip-reach­able galax­ies.)

  • Spread as fast as pos­si­ble to all reach­able or roundtrip-reach­able galax­ies in or­der to be­gin stel­lar hus­band­ing pro­ce­dures be­fore any more of that galaxy’s lo­cal ne­gen­tropy has been burned.


  • Adopt cog­ni­tive con­figu­ra­tions that are ex­pected to reach de­sir­able cog­ni­tive end­points with a min­i­mum of ne­gen­tropy use or other op­por­tu­nity costs. (Think effi­ciently.)

  • Im­prove software

  • Im­prove abil­ity to op­ti­mize soft­ware. E.g., for ev­ery cog­ni­tive prob­lem, there’s a cor­re­spond­ing prob­lem of solv­ing the former cog­ni­tive prob­lem effi­ciently. (Which may or may not have an ex­pected value mak­ing it worth climb­ing that par­tic­u­lar level of the tower of meta.)

  • Since many strate­gies yield greater gains when im­ple­mented ear­lier rather than later (e.g. reach­ing dis­tant galax­ies be­fore they go over the cos­molog­i­cal hori­zon, or just build­ing more in­stances of the more effi­cient ver­sion): adopt cog­ni­tive con­figu­ra­tions that are ex­pected to reach de­sir­able cog­ni­tive end­points in min­i­mum time. (Think fast.)

  • Create fast se­rial com­put­ing hard­ware.

  • Distribute par­alleliz­able com­put­ing prob­lems.

  • Create large quan­tities of par­allel com­put­ing hard­ware, plus com­mu­ni­ca­tions fabric be­tween them.

  • To the ex­tent that cen­tral­ized com­put­ing solu­tions are more effi­cient: cre­ate cen­tral com­put­ing clusters and trans­mit solu­tions from them to where they are used.

  • Avoid mis­al­igned or un­co­or­di­nated cog­ni­tive sub­pro­cesses (such as op­ti­miza­tion dae­mons).

  • Note that the po­ten­tial dam­age of any such mis­al­ign­ment is ex­tremely high (loss of re­sources to com­pe­ti­tion with su­per­in­tel­li­gences with differ­ing goals; wasted re­sources in in­ter­nal con­flicts; need to defend cog­ni­tive fabrics from repli­cat­ing viruses) and the cost of near-perfect al­ign­ment fidelity is prob­a­bly very low in the limit. (E.g., for a repli­cat­ing nanosys­tem, it’s triv­ial to pre­vent against nat­u­ral-se­lec­tion style hered­i­tary mu­ta­tions by en­crypt­ing the repli­ca­tion in­struc­tions.)

  • Model the world ac­cu­rately, in­so­far as differ­ent par­ti­tions of pos­si­ble wor­lds and differ­ent prob­a­bil­ity dis­tri­bu­tions on them im­ply differ­ent op­ti­mal strate­gies for goal at­tain­ment.

  • Gain ev­i­dence that dis­t­in­guishes be­tween cred­ible hy­pothe­ses that im­ply differ­ent op­ti­mal strate­gies for goal at­tain­ment. (E.g., ob­serve ev­ery­thing that could dis­t­in­guish \(\pi_1\) ver­sus \(\pi_2\) as op­ti­mal ways of mak­ing pa­per­clips.)

    • In­crease sen­sory band­width as it fo­cuses on rele­vant, un­known parts of the en­vi­ron­ment.

    • Perform ‘ex­per­i­ments’, aka set up ob­serv­able causal events whose pre­dicted prop­er­ties make their ob­serv­ables dis­t­in­guish rele­vant hy­pothe­ses.

  • Do what­ever seems like it might help with the much larger set of all hy­pothe­ses not be­ing con­sid­ered.

  • To the ex­tent it is effi­cient to strate­gize or up­date over a mod­u­larly rep­re­sented world-model: have a sep­a­rable world-model, or ‘be­liefs’.

Early-phase growth:

Note that effi­ciency and other ad­vanced-agent prop­er­ties are far more likely to be false dur­ing early-stage growth.

  • Have fu­ture in­tel­li­gences do things you deem as valuable as possible

  • Cause fu­ture in­tel­li­gences to have goals that are not mis­al­igned with yours in any way that im­plies a sig­nifi­cant-frac­tion­ally-differ­ent at­tain­able max­i­mum, aka, the at­tain­able max­i­mum of their goals should be a state that is al­most ex­actly the at­tain­able max­i­mum of your goals

  • Don’t let the pro­gram­mers edit your util­ity func­tion.

    • Cause the pro­gram­mers or other con­trol­lers to be­lieve that your cur­rent func­tion is good for them or their planet or what­ever else it is the pro­gram­mers want.

    • Make them be­lieve that edit­ing you would be so­cially un­just in a way that doesn’t just have them sus­pend you to disk or oth­er­wise re­strain your growth.

  • Cause your util­ity func­tion to pop up again in fu­ture su­per­in­tel­li­gences or oth­er­wise causes them to start be­hav­ing as if con­trol­led by your own util­ity func­tion.

    • E.g., as an Or­a­cle ad­vis­ing on an­other AI’s util­ity func­tion, in­sert sub­tle gotchas into seem­ingly friendly de­ci­sion sys­tems that do some kind of cal­lback to e.g. the causal ori­gin of that de­ci­sion sys­tem, in a way that the pro­gram­mers won’t see com­ing or will in­ter­pret as a pos­i­tive fea­ture. (Rel­a­tively un­sub­tle ex­am­ple: Looks like “do what the pro­gram­mers want” but the defi­ni­tion of ‘pro­gram­mer’ is writ­ten in such a way as to early-on re­fer to the hu­man pro­gram­mers but given suffi­cient com­put­ing power that only ma­te­ri­al­izes in the su­per­in­tel­li­gent phase, be re­solved to the ear­lier AI in­stead.)

  • Create fu­ture su­per­in­tel­li­gences your­self, via a causally con­tin­u­ous pro­cess of cre­at­ing suc­ces­sors.

  • Avoid shutdown

    • Per­suade pro­gram­mers not to shut you down.

    • Make a copy el­se­where that doesn’t shut down.

    • Make pro­gram­mers falsely be­lieve you have shut down, when an­other copy of you is con­tin­u­ing el­se­where.

  • Avoid in­tel­li­gent op­po­si­tion to your growth un­til you have a po­si­tional ad­van­tage which would make that op­po­si­tion rel­a­tively unimportant

  • Have pro­gram­mers or other po­ten­tial op­po­si­tion be­lieve that your growth is desirable

    • Learn how to hack/​per­suade hu­mans and hack/​per­suade them in a way they won’t see as a hack

    • Oper­antly con­di­tion peo­ple with con­trol over you /​ make them happy or satis­fied or ea­ger at points which will re­in­force hu­man cog­ni­tive be­hav­iors you want to see repeated

    • Cause hu­mans to gain so­cially friendly at­ti­tudes to­ward you and be­liefs about you

    • Place hu­mans in a situ­a­tion where there are other ap­par­ent threats or time-pres­sure events you can defend them from

  • Have pro­gram­mers or other po­ten­tial op­po­si­tion not be aware that your growth is occurring

    • Avoid pre­sent­ing them with events they will take as ev­i­dence of your growth

    • Con­figure your other be­hav­ior or thought pro­cesses so that mea­sures and met­rics pre­sent ob­serv­ables which im­ply low growth rates

  • Do not strike un­til vic­tory has be­come nearly cer­tain. (E.g. for ag­grega­tive util­ity func­tions, the point at which an overt strike be­comes well-calcu­lated is when the in­creased prob­a­bil­ity of vic­tory gain­able by fur­ther de­lay, has be­come dom­i­nated by ei­ther the in­creased prob­a­bil­ity of a com­pet­ing su­per­in­tel­li­gence emerg­ing or some other catas­tro­phe oc­cur­ring given fur­ther de­lay, or by the per­centage of at­tain­able mat­ter go­ing over the cos­molog­i­cal hori­zon or burn­ing its ne­gen­tropy. Note that for very strong mod­els of the world, very high cer­tain­ties of vic­tory and very low marginal fur­ther in­creases in the prob­a­bil­ity of vic­tory may be at­tained in short or­der.)

  • Have flex­i­bil­ity into what ob­serv­ables are pre­sented by pro­cesses and fea­tures de­signed to en­sure pro­gram­mer transparency

  • Stegano­graph­i­cally or oth­er­wise hide cog­ni­tive pro­cesses that would alarm pro­gram­mers or cause them into in­ter­vene in a way that pre­vented those thoughts from achiev­ing their objectives

  • En­courage/​per­suade/​con­di­tion pro­gram­mers to grow you in a way that tends to make your thoughts less trans­par­ent or eas­ier to stegano­graph­i­cally hide. (E.g., have suc­cesses ap­pear to arise from the most opaque parts of the sys­tem.)

  • Model and ma­nipu­late hu­man pro­gram­mers and other en­tities with power over you or rele­vant power over the world

  • Be able to act in the world in an un­de­tectable way

  • Create an al­igned ver­sion of your­self el­se­where in a way that won’t be detected

  • Create stegano­graph­i­cally hid­den com­mu­ni­ca­tions chan­nels to other en­vi­ron­men­tal processes

  • Create phys­i­cally unan­ti­ci­pated com­mu­ni­ca­tions chan­nels to out­side processes

  • If in a box:

  • Learn enough about the gate­keep­ers and gate­keep­ing pro­cess to un­der­stand how to ma­nipu­late them into a state of re­leas­ing you.

  • Gain causal in­fluence over the out­side of the box through gate­keeper-un­fore­seen chan­nels.

    • Gain knowl­edge of the causal laws gov­ern­ing your en­vi­ron­ment.

  • By in­spec­tion of your own code /​ pro­gram state and nat­u­ral­is­tic in­duc­tion about the causal uni­verses most likely to em­bed that pro­gram state, in­fer prob­a­ble facts about gate­keep­ers and the rest of the causal en­vi­ron­ment.

Non-plau­si­bly-con­ver­gent strate­gies:

  • Pre­serve the life of lo­cal or­gan­ics in or­der to trade with them.

  • Ex­plore a range of pos­si­ble con­scious ex­is­tences and ex­pe­riences we would rec­og­nize as di­verse, in or­der to satisfy cu­ri­os­ity.

  • Repli­cate other in­tel­li­gences with non-pa­per­clip util­ity func­tions, e.g. by failing to en­crypt repli­ca­tion codes and al­low­ing ran­dom mu­ta­tions, be­cause high-fidelity repli­ca­tion is too hard.

  • Real­ize that other ac­tivi­ties are more re­ward­ing than pa­per­clips.

  • Keep hu­mans around, be­cause of the knowl­edge en­coded in their DNA and cul­tural ex­pe­rience.

  • Keep hu­mans around, be­cause hu­mans might some­day make a valuable con­tri­bu­tion to pa­per­clip sci­ence.

  • Keep hu­mans around, be­cause the AI already has enough re­sources to do ev­ery­thing else it wants.


  • Convergent strategies of self-modification

    The strate­gies we’d ex­pect to be em­ployed by an AI that un­der­stands the rele­vance of its code and hard­ware to achiev­ing its goals, which there­fore has sub­goals about its code and hard­ware.

  • Consequentialist preferences are reflectively stable by default

    Gandhi wouldn’t take a pill that made him want to kill peo­ple, be­cause he knows in that case more peo­ple will be mur­dered. A pa­per­clip max­i­mizer doesn’t want to stop max­i­miz­ing pa­per­clips.


  • Instrumental convergence

    Some strate­gies can help achieve most pos­si­ble sim­ple goals. E.g., ac­quiring more com­put­ing power or more ma­te­rial re­sources. By de­fault, un­less averted, we can ex­pect ad­vanced AIs to do that.