Instrumental convergence

Alter­na­tive introductions

In­tro­duc­tion: A ma­chine of un­known purpose

Sup­pose you landed on a dis­tant planet and found a struc­ture of gi­ant metal pipes, crossed by oc­ca­sional ca­bles. Fur­ther in­ves­ti­ga­tion shows that the ca­bles are elec­tri­cal su­per­con­duc­tors car­ry­ing high-voltage cur­rents.

You might not know what the huge struc­ture did. But you would nonethe­less guess that this huge struc­ture had been built by some in­tel­li­gence, rather than be­ing a nat­u­rally-oc­cur­ring min­eral for­ma­tion—that there were aliens who built the struc­ture for some pur­pose.

Your rea­son­ing might go some­thing like this: “Well, I don’t know if the aliens were try­ing to man­u­fac­ture cars, or build com­put­ers, or what. But if you con­sider the prob­lem of effi­cient man­u­fac­tur­ing, it might in­volve min­ing re­sources in one place and then effi­ciently trans­port­ing them some­where else, like by pipes. Since the most effi­cient size and lo­ca­tion of these pipes would be sta­ble, you’d want the shape of the pipes to be sta­ble, which you could do by mak­ing the pipes out of a hard ma­te­rial like metal. There’s all sorts of op­er­a­tions that re­quire en­ergy or ne­gen­tropy, and a su­per­con­duct­ing ca­ble car­ry­ing elec­tric­ity seems like an effi­cient way of trans­port­ing that en­ergy. So I don’t know what the aliens were ul­ti­mately try­ing to do, but across a very wide range of pos­si­ble goals, an in­tel­li­gent alien might want to build a su­per­con­duct­ing ca­ble to pur­sue that goal.”

That is: We can take an enor­mous va­ri­ety of com­pactly speci­fi­able goals, like “travel to the other side of the uni­verse” or “sup­port biolog­i­cal life” or “make pa­per­clips”, and find very similar op­ti­mal strate­gies along the way. To­day we don’t ac­tu­ally know if elec­tri­cal su­per­con­duc­tors are the most use­ful way to trans­port en­ergy in the limit of tech­nol­ogy. But what­ever is the most effi­cient way of trans­port­ing en­ergy, whether that’s elec­tri­cal su­per­con­duc­tors or some­thing else, the most effi­cient form of that tech­nol­ogy would prob­a­bly not vary much de­pend­ing on whether you were try­ing to make di­a­monds or make pa­per­clips.

Or to put it an­other way: If you con­sider the goals “make di­a­monds” and “make pa­per­clips”, then they might have al­most noth­ing in com­mon with re­spect to their end-states—a di­a­mond might con­tain no iron. But the ear­lier strate­gies used to make a lot of di­a­mond and make a lot of pa­per­clips might have much in com­mon; “the best way of trans­port­ing en­ergy to make di­a­mond” and “the best way of trans­port­ing en­ergy to make pa­per­clips” are much more likely to be similar.

From a Bayesian stand­point this is how we can iden­tify a huge ma­chine strung with su­per­con­duct­ing ca­bles as hav­ing been pro­duced by high-tech­nol­ogy aliens, even be­fore we have any idea of what the ma­chine does. We’re say­ing, “This looks like the product of op­ti­miza­tion, a strat­egy \(X\) that the aliens chose to best achieve some un­known goal \(Y\); we can in­fer this even with­out know­ing \(Y\) be­cause many pos­si­ble \(Y\)-goals would con­cen­trate prob­a­bil­ity into this \(X\)-strat­egy be­ing used.”

Con­ver­gence and its caveats

When you se­lect policy \(\pi_k\) be­cause you ex­pect it to achieve a later state \(Y_k\) (the “goal”), we say that \(\pi_k\) is your in­stru­men­tal strat­egy for achiev­ing \(Y_k.\) The ob­ser­va­tion of “in­stru­men­tal con­ver­gence” is that a widely differ­ent range of \(Y\)-goals can lead into highly similar \(\pi\)-strate­gies. (This be­comes truer as the \(Y\)-seek­ing agent be­comes more in­stru­men­tally effi­cient; two very pow­er­ful chess en­g­ines are more likely to solve a hu­manly solv­able chess prob­lem the same way, com­pared to two weak chess en­g­ines whose in­di­vi­d­ual quirks might re­sult in idiosyn­cratic solu­tions.)

If there’s a sim­ple way of clas­sify­ing pos­si­ble strate­gies \(\Pi\) into par­ti­tions \(X \subset \Pi\) and \(\neg X \subset \Pi\), and you think that for most com­pactly de­scrib­able goals \(Y_k\) the cor­re­spond­ing best poli­cies \(\pi_k\) are likely to be in­side \(X,\) then you think \(X\) is a “con­ver­gent in­stru­men­tal strat­egy”.

In other words, if you think that a su­per­in­tel­li­gent pa­per­clip max­i­mizer, di­a­mond max­i­mizer, a su­per­in­tel­li­gence that just wanted to keep a sin­gle but­ton pressed for as long as pos­si­ble, and a su­per­in­tel­li­gence op­ti­miz­ing for a flour­ish­ing in­ter­galac­tic civ­i­liza­tion filled with happy sapi­ent be­ings, would all want to “trans­port mat­ter and en­ergy effi­ciently” in or­der to achieve their other goals, then you think “trans­port mat­ter and en­ergy effi­ciently” is a con­ver­gent in­stru­men­tal strat­egy.

In this case “pa­per­clips”, “di­a­monds”, “keep­ing a but­ton pressed as long as pos­si­ble”, and “sapi­ent be­ings hav­ing fun”, would be the goals \(Y_1, Y_2, Y_3, Y_4.\) The cor­re­spond­ing best strate­gies \(\pi_1, \pi_2, \pi_3, \pi_4\) for achiev­ing these goals would not be iden­ti­cal—the poli­cies for mak­ing pa­per­clips and di­a­monds are not ex­actly the same. But all of these poli­cies (we think) would lie within the par­ti­tion \(X \subset \Pi\) where the su­per­in­tel­li­gence tries to “trans­port mat­ter and en­ergy effi­ciently” (per­haps by us­ing su­per­con­duct­ing ca­bles), rather than the com­ple­men­tary par­ti­tion \(\neg X\) where the su­per­in­tel­li­gence does not try to trans­port mat­ter and en­ergy effi­ciently.


  • Con­sider the set of com­putable and tractable util­ity func­tions \(\mathcal U_C\) that take an out­come \(o,\) de­scribed in some lan­guage \(\mathcal L\), onto a ra­tio­nal num­ber \(r\). That is, we sup­pose:

  • That the re­la­tion \(U_k\) be­tween de­scrip­tions \(o_\mathcal L\) of out­comes \(o\), and the cor­re­spond­ing util­ities \(r,\) is com­putable;

  • Fur­ther­more, that it can be com­puted in re­al­is­ti­cally bounded time;

  • Fur­ther­more, that the \(U_k\) re­la­tion be­tween \(o\) and \(r\), and the \(\mathbb P [o | \pi_i]\) re­la­tion be­tween poli­cies and sub­jec­tively ex­pected out­comes, are to­gether reg­u­lar enough that a re­al­is­tic amount of com­put­ing power makes it pos­si­ble to search for poli­cies \(\pi\) that are yield high ex­pected \(U_k(o)\).

  • Choose some sim­ple pro­gram­ming lan­guage \(\mathcal P,\) such as the lan­guage of Tur­ing ma­chines, or Python 2 with­out most of the sys­tem libraries.

  • Choose a sim­ple map­ping \(\mathcal P_B\) from \(\mathcal P\) onto bit­strings.

  • Take all pro­grams in \(\mathcal P_B\) be­tween 20 and 1000 bits in length, and filter them for bound­ed­ness and tractabil­ity when treated as util­ity func­tions, to ob­tain the filtered set \(U_K\).

  • Set 90% as an ar­bi­trary thresh­old.

If, given our be­liefs \(\mathbb P\) about our uni­verse and which poli­cies lead to which real out­comes, we think that in an in­tu­itive sense it sure looks like at least 90% of the util­ity func­tions \(U_k \in U_K\) ought to im­ply best find­able poli­cies \(\pi_k\) which lie within the par­ti­tion \(X\) of \(\Pi,\) we’ll allege that \(X\) is “in­stru­men­tally con­ver­gent”.

Com­pat­i­bil­ity with Vingean uncertainty

Vingean un­cer­tainty is the ob­ser­va­tion that, as we be­come in­creas­ingly con­fi­dent of in­creas­ingly pow­er­ful in­tel­li­gence from an agent with pre­cisely known goals, we be­come de­creas­ingly con­fi­dent of the ex­act moves it will make (un­less the do­main has an op­ti­mal strat­egy and we know the ex­act strat­egy). E.g., to know ex­actly where Deep Blue would move on a chess­board, you would have to be as good at chess as Deep Blue. How­ever, we can be­come in­creas­ingly con­fi­dent that more pow­er­ful chess­play­ers will even­tu­ally win the game—that is, steer the fu­ture out­come of the chess­board into the set of states des­ig­nated ‘win­ning’ for their color—even as it be­comes less pos­si­ble for us to be cer­tain about the chess­player’s ex­act policy.

In­stru­men­tal con­ver­gence can be seen as a caveat to Vingean un­cer­tainty: Even if we don’t know the ex­act ac­tions or the ex­act end goal, we may be able to pre­dict that some in­ter­ven­ing states or poli­cies will fall into cer­tain ab­stract cat­e­gories.

That is: If we don’t know whether a su­per­in­tel­li­gent agent is a pa­per­clip max­i­mizer or a di­a­mond max­i­mizer, we can still guess with some con­fi­dence that it will pur­sue a strat­egy in the gen­eral class “ob­tain more re­sources of mat­ter, en­ergy, and com­pu­ta­tion” rather than “don’t get more re­sources”. This is true even though Vinge’s Prin­ci­ple says that we won’t be able to pre­dict ex­actly how the su­per­in­tel­li­gence will go about gath­er­ing mat­ter and en­ergy.

Imag­ine the real world as an ex­tremely com­pli­cated game. Sup­pose that at the very start of this game, a highly ca­pa­ble player must make a sin­gle bi­nary choice be­tween the ab­stract moves “Gather more re­sources later” and “Never gather any more re­sources later”. Vingean un­cer­tainty or not, we seem jus­tified in putting a high prob­a­bil­ity on the first move be­ing preferred—a bi­nary choice is sim­ple enough that we can take a good guess at the op­ti­mal play.

Con­ver­gence su­per­venes on consequentialism

\(X\) be­ing “in­stru­men­tally con­ver­gent” doesn’t mean that ev­ery mind needs an ex­tra, in­de­pen­dent drive to do \(X.\)

Con­sider the fol­low­ing line of rea­son­ing: “It’s im­pos­si­ble to get on an air­plane with­out buy­ing plane tick­ets. So any­one on an air­plane must be a sort of per­son who en­joys buy­ing plane tick­ets. If I offer them a plane ticket they’ll prob­a­bly buy it, be­cause this is al­most cer­tainly some­body who has an in­de­pen­dent mo­ti­va­tional drive to buy plane tick­ets. There’s just no way you can de­sign an or­ganism that ends up on an air­plane un­less it has a buy­ing-tick­ets drive.”

The ap­pear­ance of an “in­stru­men­tal strat­egy” can be seen as im­plicit in re­peat­edly choos­ing ac­tions \(\pi_k\) that lead into a fi­nal state \(Y_k,\) and it so hap­pens that \(\pi_k \in X\). There doesn’t have to be a spe­cial \(X\)-mod­ule which re­peat­edly se­lects \(\pi_X\)-ac­tions re­gard­less of whether or not they lead to \(Y_k.\)

The flaw in the ar­gu­ment about plane tick­ets is that hu­man be­ings are con­se­quen­tial­ists who buy plane tick­ets just be­cause they wanted to go some­where and they ex­pected the ac­tion “buy the plane ticket” to have the con­se­quence, in that par­tic­u­lar case, of go­ing to the par­tic­u­lar place and time they wanted to go. No ex­tra “buy the plane ticket” mod­ule is re­quired, and es­pe­cially not a plane-ticket-buyer that doesn’t check whether there’s any travel goal and whether buy­ing the plane ticket leads into the de­sired later state.

More semifor­mally, sup­pose that \(U_k\) is the util­ity func­tion of an agent and let \(\pi_k\) be the policy it se­lects. If the agent is in­stru­men­tally effi­cient rel­a­tive to us at achiev­ing \(U_k,\) then from our per­spec­tive we can mostly rea­son about what­ever kind of op­ti­miza­tion it does as if it were ex­pected util­ity max­i­miza­tion, i.e.:

$$\pi_k = \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | \pi_i ]$$

When we say that \(X\) is in­stru­men­tally con­ver­gent, we are stat­ing that it prob­a­bly so hap­pens that:

$$\big ( \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | \pi_i ] \big ) \in X$$

We are not mak­ing any claims along the lines that for an agent to thrive, its util­ity func­tion \(U_k\) must de­com­pose into a term for \(X\) plus a resi­d­ual term \(V_k\) de­not­ing the rest of the util­ity func­tion. Rather, \(\pi_k \in X\) is the mere re­sult of un­bi­ased op­ti­miza­tion for a goal \(U_k\) that makes no ex­plicit men­tion of \(X.\)

(This doesn’t rule out that some spe­cial cases of AI de­vel­op­ment path­ways might tend to pro­duce ar­tifi­cial agents with a value func­tion \(U_e\) which does de­com­pose into some var­i­ant \(X_e\) of \(X\) plus other terms \(V_e.\) For ex­am­ple, nat­u­ral se­lec­tion on or­ganisms that spend a long pe­riod of time as non-con­se­quen­tial­ist policy-re­in­force­ment-learn­ers, be­fore they later evolve into con­se­quen­tial­ists, has had re­sults along these lines in the case of hu­mans. For ex­am­ple, hu­mans have an in­de­pen­dent, sep­a­rate “cu­ri­os­ity” drive, in­stead of just valu­ing in­for­ma­tion as a means to in­clu­sive ge­netic fit­ness.)

Re­quired ad­vanced agent properties

Dist­in­guish­ing the ad­vanced agent prop­er­ties that seem prob­a­bly re­quired for an AI pro­gram to start ex­hibit­ing the sort of rea­son­ing filed un­der “in­stru­men­tal con­ver­gence”, the most ob­vi­ous can­di­dates are:

That is: You don’t au­to­mat­i­cally see “ac­quire more com­put­ing power” as a use­ful strat­egy un­less you un­der­stand “I am a cog­ni­tive pro­gram and I tend to achieve more of my goals when I run on more re­sources.” Alter­na­tively, e.g., the pro­gram­mers adding more com­put­ing power and the sys­tem’s goals start­ing to be achieved bet­ter, af­ter which re­lated poli­cies are pos­i­tively re­in­forced and re­peated, could ar­rive at a similar end via the pseu­do­con­se­quen­tial­ist idiom of policy re­in­force­ment.

The ad­vanced agent prop­er­ties that would nat­u­rally or au­to­mat­i­cally lead to in­stru­men­tal con­ver­gence seem well above the range of mod­ern AI pro­grams. As of 2016, cur­rent ma­chine learn­ing al­gorithms don’t seem to be within the range where this pre­dicted phe­nomenon should start to be visi­ble.


An in­stru­men­tal con­ver­gence claim is about a de­fault or a ma­jor­ity of cases, not a uni­ver­sal gen­er­al­iza­tion.

If for what­ever rea­son your goal is to “make pa­per­clips with­out us­ing any su­per­con­duc­tors”, then su­per­con­duct­ing ca­bles will not be the best in­stru­men­tal strat­egy for achiev­ing that goal.

Any claim about in­stru­men­tal con­ver­gence says at most, “The vast ma­jor­ity of pos­si­ble goals \(Y\) would con­ver­gently im­ply a strat­egy in \(X,\) by de­fault and un­less oth­er­wise averted by some spe­cial case \(Y_i\) for which strate­gies in \(\neg X\) are bet­ter.”

See also the more gen­eral idea that the space of pos­si­ble minds is very large. Univer­sal claims about all pos­si­ble minds have many chances to be false, while ex­is­ten­tial claims “There ex­ists at least one pos­si­ble mind such that…” have many chances to be true.

If some par­tic­u­lar oak tree is ex­tremely im­por­tant and valuable to you, then you won’t cut it down to ob­tain wood. It is ir­rele­vant whether a ma­jor­ity of other util­ity func­tions that you could have, but don’t ac­tu­ally have, would sug­gest cut­ting down that oak tree.

Con­ver­gent strate­gies are not de­on­tolog­i­cal rules.

Imag­ine look­ing at a ma­chine chess-player and rea­son­ing, “Well, I don’t think the AI will sac­ri­fice its pawn in this po­si­tion, even to achieve a check­mate. Any chess-play­ing AI needs a drive to be pro­tec­tive of its pawns, or else it’d just give up all its pawns. It wouldn’t have got­ten this far in the game in the first place, if it wasn’t more pro­tec­tive of its pawns than that.”

Modern chess al­gorithms be­have in a fash­ion that most hu­mans can’t dis­t­in­guish from ex­pected-check­mate-max­i­miz­ers. That is, from your merely hu­man per­spec­tive, watch­ing a sin­gle move at the time it hap­pens, there’s no visi­ble differ­ence be­tween your sub­jec­tive ex­pec­ta­tion for the chess al­gorithm’s be­hav­ior, and your ex­pec­ta­tion for the be­hav­ior of an or­a­cle that always out­put the move with the high­est con­di­tional prob­a­bil­ity of lead­ing to check­mate. If you, a hu­man, you could dis­cern with your un­aided eye some sys­tem­atic differ­ence like “this al­gorithm pro­tects its pawn more of­ten than check­mate-achieve­ment would im­ply”, you would know how to make sys­tem­at­i­cally bet­ter chess moves; mod­ern ma­chine chess is too su­per­hu­man for that.

Often, this uniform rule of out­put-the-move-with-high­est-prob­a­bil­ity-of-even­tual-check­mate will seem to pro­tect pawns, or not throw away pawns, or defend pawns when you at­tack them. But if in some spe­cial case the high­est prob­a­bil­ity of check­mate is in­stead achieved by sac­ri­fic­ing a pawn, the chess al­gorithm will do that in­stead.


The rea­son­ing for an in­stru­men­tal con­ver­gence claim says that for many util­ity func­tions \(U_k\) and situ­a­tions \(S_i\) a \(U_k\)-con­se­quen­tial­ist in situ­a­tion \(S_i\) will prob­a­bly find some best policy \(\pi_k = \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | S_i, \pi_i ]\) that hap­pens to be in­side the par­ti­tion \(X\). If in­stead in situ­a­tion \(S_k\)

$$\big ( \underset{\pi_i \in X}{\operatorname{argmax}} \mathbb E [ U_k | S_k, \pi_i ] \big ) \ < \ \big ( \underset{\pi_i \in \neg X}{\operatorname{argmax}} \mathbb E [ U_k | S_k, \pi_i ] \big )$$

…then a \(U_k\)-con­se­quen­tial­ist in situ­a­tion \(S_k\) won’t do any \(\pi_i \in X\) even if most other sce­nar­ios \(S_i\) make \(X\)-strate­gies pru­dent.

\(X\) would help ac­com­plish \(Y\)” is in­suffi­cient to es­tab­lish a claim of in­stru­men­tal con­ver­gence on \(X\).

Sup­pose you want to get to San Fran­cisco. You could get to San Fran­cisco by pay­ing me $20,000 for a plane ticket. You could also get to San Fran­cisco by pay­ing some­one else $400 for a plane ticket, and this is prob­a­bly the smarter op­tion for achiev­ing your other goals.

Estab­lish­ing “Com­pared to do­ing noth­ing, \(X\) is more use­ful for achiev­ing most \(Y\)-goals” doesn’t es­tab­lish \(X\) as an in­stru­men­tal strat­egy. We need to be­lieve that there’s no other policy in \(\neg X\) which would be more use­ful for achiev­ing most \(Y.\)

When \(X\) is phrased in very gen­eral terms like “ac­quire re­sources”, we might rea­son­ably guess that “don’t ac­quire re­sources” or “do \(Y\) with­out ac­quiring any re­sources” is in­deed un­likely to be a su­pe­rior strat­egy. If \(X_i\) is some nar­rower and more spe­cific strat­egy, like “ac­quire re­sources by min­ing them us­ing pick­axes”, it’s much more likely that some other strat­egy \(X_k\) or even a \(\neg X\)-strat­egy is the real op­ti­mum.

See also: Miss­ing the weird al­ter­na­tive, Cog­ni­tive un­con­tain­abil­ity.

That said, if we can see how a nar­row strat­egy \(X_i\) helps most \(Y\)-goals to some large de­gree, then we should ex­pect the ac­tual policy de­ployed by an effi­cient \(Y_k\)-agent to ob­tain at least as much \(Y_k\) as would \(X_i.\)

That is, we can rea­son­ably ar­gue: “By fol­low­ing the straight­for­ward strat­egy ‘spread as far as pos­si­ble, ab­sorb all reach­able mat­ter, and turn it into pa­per­clips’, an ini­tially un­op­posed su­per­in­tel­li­gent pa­per­clip max­i­mizer could ob­tain \(10^{55}\) pa­per­clips. Then we should ex­pect an ini­tially un­op­posed su­per­in­tel­li­gent pa­per­clip max­i­mizer to get at least this many pa­per­clips, what­ever it ac­tu­ally does. Any strat­egy in the op­po­site par­ti­tion ‘do not spread as far as pos­si­ble, ab­sorb all reach­able mat­ter, and turn it into pa­per­clips’ must seem to yield more than \(10^{55}\) pa­per­clips, be­fore we should ex­pect a pa­per­clip max­i­mizer to do that.”

Similarly, a claim of in­stru­men­tal con­ver­gence on \(X\) can be ce­teris paribus re­futed by pre­sent­ing some al­ter­nate nar­row strat­egy \(W_j \subset \neg X\) which seems to be more use­ful than any ob­vi­ous strat­egy in \(X.\) We are then not pos­i­tively con­fi­dent of con­ver­gence on \(W_j,\) but we should as­sign very low prob­a­bil­ity to the alleged con­ver­gence on \(X,\) at least un­til some­body pre­sents an \(X\)-ex­em­plar with higher ex­pected util­ity than \(W_j.\) If the pro­posed con­ver­gent strat­egy is “trade eco­nom­i­cally with other hu­mans and obey ex­ist­ing sys­tems of prop­erty rights,” and we see no way for Clippy to ob­tain \(10^{55}\) pa­per­clips un­der those rules, but we do think Clippy could get \(10^{55}\) pa­per­clips by ex­pand­ing as fast as pos­si­ble with­out re­gard for hu­man welfare or ex­ist­ing le­gal sys­tems, then we can ce­teris paribus re­ject “obey prop­erty rights” as con­ver­gent. Even if trad­ing with hu­mans to make pa­per­clips pro­duces more pa­per­clips than do­ing noth­ing, it may not pro­duce the most pa­per­clips com­pared to con­vert­ing the ma­te­rial com­pos­ing the hu­mans into more effi­cient pa­per­clip-mak­ing ma­chin­ery.

Claims about in­stru­men­tal con­ver­gence are not eth­i­cal claims.

Whether \(X\) is a good way to get both pa­per­clips and di­a­monds is ir­rele­vant to whether \(X\) is good for hu­man flour­ish­ing or eu­daimo­nia or fun-the­o­retic op­ti­mal­ity or ex­trap­o­lated vo­li­tion or what­ever. Whether \(X\) is, in an in­tu­itive sense, “good”, needs to be eval­u­ated sep­a­rately from whether it is in­stru­men­tally con­ver­gent.

In par­tic­u­lar: in­stru­men­tal strate­gies are not ter­mi­nal val­ues. In fact, they have a type dis­tinc­tion from ter­mi­nal val­ues. “If you’re go­ing to spend re­sources on think­ing about tech­nol­ogy, try to do it ear­lier rather than later, so that you can amor­tize your in­ven­tion over more uses” seems very likely to be an in­stru­men­tally con­ver­gent ex­plo­ra­tion-ex­ploita­tion strat­egy; but “spend cog­ni­tive re­sources sooner rather than later” is more a fea­ture of poli­cies rather than a fea­ture of util­ity func­tions. It’s definitely not plau­si­ble in a prethe­o­retic sense as the Mean­ing of Life. So a par­ti­tion into which most in­stru­men­tal best-strate­gies fall, is not like a uni­ver­sally con­vinc­ing util­ity func­tion (which you prob­a­bly shouldn’t look for in the first place).

Similarly: The nat­u­ral se­lec­tion pro­cess that pro­duced hu­mans gave us many in­de­pen­dent drives \(X_e\) that can be viewed as spe­cial var­i­ants of some con­ver­gent in­stru­men­tal strat­egy \(X.\) A pure pa­per­clip max­i­mizer would calcu­late the value of in­for­ma­tion (VoI) for learn­ing facts that could lead to it mak­ing more pa­per­clips; we can see learn­ing high-value facts as a con­ver­gent strat­egy \(X\). In this case, hu­man “cu­ri­os­ity” can be viewed as the cor­re­spond­ing emo­tion \(X_e.\) This doesn’t mean that the true pur­pose of \(X_e\) is \(X\) any more than the true pur­pose of \(X_e\) is “make more copies of the allele cod­ing for \(X_e\)” or “in­crease in­clu­sive ge­netic fit­ness”. That line of rea­son­ing prob­a­bly re­sults from a mind pro­jec­tion fal­lacy on ‘pur­pose’.

Claims about in­stru­men­tal con­ver­gence are not fu­tur­olog­i­cal pre­dic­tions.

Even if, e.g., “ac­quire re­sources” is an in­stru­men­tally con­ver­gent strat­egy, this doesn’t mean that we can’t as a spe­cial case de­liber­ately con­struct ad­vanced AGIs that are not driven to ac­quire as many re­sources as pos­si­ble. Rather the claim im­plies, “We would need to de­liber­ately build \(X\)-avert­ing agents as a spe­cial case, be­cause by de­fault most imag­in­able agent de­signs would pur­sue a strat­egy in \(X.\)

Of it­self, this ob­ser­va­tion makes no fur­ther claim about the quan­ti­ta­tive prob­a­bil­ity that, in the real world, AGI builders might want to build \(\neg X\)-agents, might try to build \(\neg X\)-agents, and might suc­ceed at build­ing \(\neg X\)-agents.

A claim about in­stru­men­tal con­ver­gence is talk­ing about a log­i­cal prop­erty of the larger de­sign space of pos­si­ble agents, not mak­ing a pre­dic­tion what hap­pens in any par­tic­u­lar re­search lab. (Though the ground facts of com­puter sci­ence are rele­vant to what hap­pens in ac­tual re­search labs.)

For dis­cus­sion of how in­stru­men­tal con­ver­gence may in prac­tice lead to fore­see­able difficul­ties of AGI al­ign­ment that re­sist most sim­ple at­tempts at fix­ing them, see the ar­ti­cles on Patch re­sis­tance and Near­est un­blocked strat­egy.

Cen­tral ex­am­ple: Re­source acquisition

One of the con­ver­gent strate­gies origi­nally pro­posed by Steve Omo­hun­dro in “The Ba­sic AI Drives” was re­source ac­qui­si­tion:

“All com­pu­ta­tion and phys­i­cal ac­tion re­quires the phys­i­cal re­sources of space, time, mat­ter, and free en­ergy. Al­most any goal can be bet­ter ac­com­plished by hav­ing more of these re­sources.”

We’ll con­sider this ex­am­ple as a tem­plate for other pro­posed in­stru­men­tally con­ver­gent strate­gies, and run through the stan­dard ques­tions and caveats.

• Ques­tion: Is this some­thing we’d ex­pect a pa­per­clip max­i­mizer, di­a­mond max­i­mizer, and but­ton-presser to do? And while we’re at it, also a flour­ish­ing-in­ter­galac­tic-civ­i­liza­tion op­ti­mizer?

  • Paper­clip max­i­miz­ers need mat­ter and free en­ergy to make pa­per­clips.

  • Di­a­mond max­i­miz­ers need mat­ter and free en­ergy to make di­a­monds.

  • If you’re try­ing to max­i­mize the prob­a­bil­ity that a sin­gle but­ton stays pressed as long as pos­si­ble, you would build fortresses pro­tect­ing the but­ton and en­ergy stores to sus­tain the fortress and re­pair the but­ton for the longest pos­si­ble pe­riod of time.

  • Nice su­per­in­tel­li­gences try­ing to build happy in­ter­galac­tic civ­i­liza­tions full of flour­ish­ing sapi­ent minds, can build marginally larger civ­i­liza­tions with marginally more hap­piness and marginally longer lifes­pans given marginally more re­sources.

To put it an­other way, for a util­ity func­tion \(U_k\) to im­ply the use of ev­ery joule of en­ergy, it is a suffi­cient con­di­tion that for ev­ery plan \(\pi_i\) with ex­pected util­ity \(\mathbb E [ U | \pi_i ],\) there is a plan \(\pi_j\) with \(\mathbb E [ U | \pi_j ] > \mathbb E [ U | \pi_i]\) that uses one more joule of en­ergy:

  • For ev­ery plan \(\pi_i\) that makes pa­per­clips, there’s a plan \(\pi_j\) that would make more ex­pected pa­per­clips if more en­ergy were available and ac­quired.

  • For ev­ery plan \(\pi_i\) that makes di­a­monds, there’s a plan \(\pi_j\) that makes slightly more di­a­mond given one more joule of en­ergy.

  • For ev­ery plan \(\pi_i\) that pro­duces a prob­a­bil­ity \(\mathbb P (press | \pi_i) = 0.999...\) of a but­ton be­ing pressed, there’s a plan \(\pi_j\) with a slightly higher prob­a­bil­ity of that but­ton be­ing pressed \(\mathbb P (press | \pi_j) = 0.9999...\) which uses up the mass-en­ergy of one more star.

  • For ev­ery plan that pro­duces a flour­ish­ing in­ter­galac­tic civ­i­liza­tion, there’s a plan which pro­duces slightly more flour­ish­ing given slightly more en­ergy.

• Ques­tion: Is there some strat­egy in \(\neg X\) which pro­duces higher \(Y_k\)-achieve­ment for most \(Y_k\) than any strat­egy in­side \(X\)?

Sup­pose that by us­ing most of the mass-en­ergy in most of the stars reach­able be­fore they go over the cos­molog­i­cal hori­zon as seen from pre­sent-day Earth, it would be pos­si­ble to pro­duce \(10^{55}\) pa­per­clips (or di­a­monds, or prob­a­bil­ity-years of ex­pected but­ton-stays-pressed time, or QALYs, etcetera).

It seems rea­son­ably un­likely that there is a strat­egy in­side the space in­tu­itively de­scribed by “Do not ac­quire more re­sources” that would pro­duce \(10^{60}\) pa­per­clips, let alone that the strat­egy pro­duc­ing the most pa­per­clips would be in­side this space.

We might be able to come up with a weird spe­cial-case situ­a­tion \(S_w\) that would im­ply this. But that’s not the same as as­sert­ing, “With high sub­jec­tive prob­a­bil­ity, in the real world, the op­ti­mal strat­egy will be in \(\neg X\).” We’re con­cerned with mak­ing a state­ment about de­faults given the most sub­jec­tively prob­a­ble back­ground states of the uni­verse, not try­ing to make a uni­ver­sal state­ment that cov­ers ev­ery con­ceiv­able pos­si­bil­ity.

To put it an­other way, if your policy choices or pre­dic­tions are only safe given the premise that “In the real world, the best way of pro­duc­ing the max­i­mum pos­si­ble num­ber of pa­per­clips in­volves not ac­quiring any more re­sources”, you need to clearly flag this as a load-bear­ing as­sump­tion.

• Caveat: The claim is not that ev­ery pos­si­ble goal can be bet­ter-ac­com­plished by ac­quiring more re­sources.

As a spe­cial case, this would not be true of an agent with an im­pact penalty term in its util­ity func­tion, or some other low-im­pact agent, if that agent also only had goals of a form that could be satis­fied in­side bounded re­gions of space and time with a bounded effort.

We might rea­son­ably ex­pect this spe­cial kind of agent to only ac­quire the min­i­mum re­sources to ac­com­plish its task.

But we wouldn’t ex­pect this to be true in a ma­jor­ity of pos­si­ble cases in­side mind de­sign space; it’s not true by de­fault; we need to spec­ify a fur­ther fact about the agent to make the claim not be true; we must ex­pend en­g­ineer­ing effort to make an agent like that, and failures of this effort will re­sult in re­ver­sion-to-de­fault. If we imag­ine some com­pu­ta­tion­ally sim­ple lan­guage for spec­i­fy­ing util­ity func­tions, then most util­ity func­tions wouldn’t hap­pen to have both of these prop­er­ties, so a ma­jor­ity of util­ity func­tions given this lan­guage and mea­sure would not by de­fault try to use fewer re­sources.

• Caveat: The claim is not that well-func­tion­ing agents must have ad­di­tional, in­de­pen­dent re­source-ac­quiring mo­ti­va­tional drives.

A pa­per­clip max­i­mizer will act like it is “ob­tain­ing re­sources” if it merely im­ple­ments the policy it ex­pects to lead to the most pa­per­clips. Clippy does not need to have any sep­a­rate and in­de­pen­dent term in its util­ity func­tion for the amount of re­source it pos­sesses (and in­deed this would po­ten­tially in­terfere with Clippy mak­ing pa­per­clips, since it might then be tempted to hold onto re­sources in­stead of mak­ing pa­per­clips with them).

• Caveat: The claim is not that most agents will be­have as if un­der a de­on­tolog­i­cal im­per­a­tive to ac­quire re­sources.

A pa­per­clip max­i­mizer wouldn’t nec­es­sar­ily tear apart a work­ing pa­per­clip fac­tory to “ac­quire more re­sources” (at least not un­til that fac­tory had already pro­duced all the pa­per­clips it was go­ing to help pro­duce.)

• Check: Are we ar­gu­ing “Ac­quiring re­sources is a bet­ter way to make a few more pa­per­clips than do­ing noth­ing” or “There’s no bet­ter/​best way to make pa­per­clips that in­volves not ac­quiring more mat­ter and en­ergy”?

As men­tioned above, the lat­ter seems rea­son­able in this case.

• Caveat: “Ac­quiring re­sources is in­stru­men­tally con­ver­gent” is not an eth­i­cal claim.

The fact that a pa­per­clip max­i­mizer would try to ac­quire all mat­ter and en­ergy within reach, does not of it­self bear on whether our own nor­ma­tive val­ues might per­haps com­mand that we ought to use few re­sources as a ter­mi­nal value.

(Though some of us might find pretty com­pel­ling the ob­ser­va­tion that if you leave mat­ter ly­ing around, it sits around not do­ing any­thing and even­tu­ally the pro­tons de­cay or the ex­pand­ing uni­verse tears it apart, whereas if you turn the mat­ter into peo­ple, it can have fun. There’s no rule that in­stru­men­tally con­ver­gent strate­gies don’t hap­pen to be the right thing to do.)

• Caveat: “Ac­quiring re­sources is in­stru­men­tally con­ver­gent” is not of it­self a fu­tur­olog­i­cal pre­dic­tion.

See above. Maybe we try to build Task AGIs in­stead. Maybe we suc­ceed, and Task AGIs don’t con­sume lots of re­sources be­cause they have well-bounded tasks and im­pact penalties.

Rele­vance to the larger field of value al­ign­ment theory

The list of ar­guably con­ver­gent strate­gies has its own page. How­ever, some of the key strate­gies that have been ar­gued as con­ver­gent in e.g. Omo­hun­dro’s “The Ba­sic AI Drives” and Bostrom’s “The Su­per­in­tel­li­gent Will: Mo­ti­va­tion and In­stru­men­tal Ra­tion­al­ity in Ad­vanced Ar­tifi­cial Agents” in­clude:

  • Ac­quiring/​con­trol­ling mat­ter and en­ergy.

  • En­sur­ing that fu­ture in­tel­li­gences with similar goals ex­ist. E.g., a pa­per­clip max­i­mizer wants the fu­ture to con­tain pow­er­ful, effec­tive in­tel­li­gences that max­i­mize pa­per­clips.

  • An im­por­tant spe­cial case of this gen­eral rule is self-preser­va­tion.

  • Another spe­cial case of this rule is pro­tect­ing goal-con­tent in­tegrity (not al­low­ing ac­ci­den­tal or de­liber­ate mod­ifi­ca­tion of the util­ity func­tion).

  • Learn­ing about the world (so as to bet­ter ma­nipu­late it to make pa­per­clips).

  • Car­ry­ing out rele­vant sci­en­tific in­ves­ti­ga­tions.

  • Op­ti­miz­ing tech­nol­ogy and de­signs.

  • En­gag­ing in an “ex­plo­ra­tion” phase of seek­ing op­ti­mal de­signs be­fore an “ex­ploita­tion” phase of us­ing them.

  • Think­ing effec­tively (treat­ing the cog­ni­tive self as an im­prov­able tech­nol­ogy).

  • Im­prov­ing cog­ni­tive pro­cesses.

  • Ac­quiring com­put­ing re­sources for thought.

This is rele­vant to some of the cen­tral back­ground ideas in AGI al­ign­ment, be­cause:

  • A su­per­in­tel­li­gence can have a catas­trophic im­pact on our world even if its util­ity func­tion con­tains no overtly hos­tile terms. A pa­per­clip max­i­mizer doesn’t hate you, it just wants pa­per­clips.

  • A con­se­quen­tial­ist AGI with suffi­cient big-pic­ture un­der­stand­ing will by de­fault want to pro­mote it­self to a su­per­in­tel­li­gence, even if the pro­gram­mers did not ex­plic­itly pro­gram it to want to self-im­prove. Even a pseu­do­con­se­quen­tial­ist may e.g. re­peat strate­gies that led to pre­vi­ous cog­ni­tive ca­pa­bil­ity gains.

This means that pro­gram­mers don’t have to be evil, or even de­liber­ately bent on cre­at­ing su­per­in­tel­li­gence, in or­der for their work to have catas­trophic con­se­quences.

The list of con­ver­gent strate­gies, by its na­ture, tends to in­clude ev­ery­thing an agent needs to sur­vive and grow. This sup­ports strong forms of the Orthog­o­nal­ity Th­e­sis be­ing true in prac­tice as well as in prin­ci­ple. We don’t need to filter on agents with ex­plicit ter­mi­nal val­ues for e.g. “sur­vival” in or­der to find sur­viv­ing pow­er­ful agents.

In­stru­men­tal con­ver­gence is also why we ex­pect to en­counter most of the prob­lems filed un­der Cor­rigi­bil­ity. When the AI is young, it’s less likely to be in­stru­men­tally effi­cient or un­der­stand the rele­vant parts of the big­ger pic­ture; but once it does, we would by de­fault ex­pect, e.g.:

  • That the AI will try to avoid be­ing shut down.

  • That it will try to build sub­agents (with iden­ti­cal goals) in the en­vi­ron­ment.

  • That the AI will re­sist mod­ifi­ca­tion of its util­ity func­tion.

  • That the AI will try to avoid the pro­gram­mers learn­ing facts that would lead them to mod­ify the AI’s util­ity func­tion.

  • That the AI will try to pre­tend to be friendly even if it is not.

  • That the AI will try to con­ceal hos­tile thoughts (and the fact that any con­cealed thoughts ex­ist).

This paints a much more effort­ful pic­ture of AGI al­ign­ment work than “Oh, well, we’ll just test it to see if it looks nice, and if not, we’ll just shut off the elec­tric­ity.”

The point that some un­de­sir­able be­hav­iors are in­stru­men­tally con­ver­gent gives rise to the Near­est un­blocked strat­egy prob­lem. Sup­pose the AGI’s most preferred policy starts out as one of these in­cor­rigible be­hav­iors. Sup­pose we cur­rently have enough con­trol to add patches to the AGI’s util­ity func­tion, in­tended to rule out the in­cor­rigible be­hav­ior. Then, af­ter in­te­grat­ing the in­tended patch, the new most preferred policy may be the most similar policy that wasn’t ex­plic­itly blocked. If you naively give the AI a term in its util­ity func­tion for “hav­ing an off-switch”, it may still build sub­agents or suc­ces­sors that don’t have off-switches. Similarly, when the AGI be­comes more pow­er­ful and its op­tion space ex­pands, it’s again likely to find new similar poli­cies that weren’t ex­plic­itly blocked.

Thus, in­stru­men­tal con­ver­gence is one of the two ba­sic sources of patch re­sis­tance as a fore­see­able difficulty of AGI al­ign­ment work.

write a tu­to­rial for the cen­tral ex­am­ple of a pa­per­clip max­i­mizer dis­t­in­guish that the propo­si­tion is con­ver­gent pres­sure, not con­ver­gent de­ci­sion the com­monly sug­gested in­stru­men­tal con­ver­gences sep­a­rately: figure out the ‘prob­le­matic in­stru­men­tal pres­sures’ list for Cor­rigi­bil­ity sep­a­rately: ex­plain why in­stru­men­tal pres­sures may be patch-re­sis­tant es­pe­cially in self-mod­ify­ing consequentialists


  • Paperclip maximizer

    This agent will not stop un­til the en­tire uni­verse is filled with pa­per­clips.

  • Instrumental

    What is “in­stru­men­tal” in the con­text of Value Align­ment The­ory?

  • Instrumental pressure

    A con­se­quen­tial­ist agent will want to bring about cer­tain in­stru­men­tal events that will help to fulfill its goals.

  • Convergent instrumental strategies

    Paper­clip max­i­miz­ers can make more pa­per­clips by im­prov­ing their cog­ni­tive abil­ities or con­trol­ling more re­sources. What other strate­gies would al­most-any AI try to use?

  • You can't get more paperclips that way

    Most ar­gu­ments that “A pa­per­clip max­i­mizer could get more pa­per­clips by (do­ing nice things)” are flawed.


  • Theory of (advanced) agents

    One of the re­search sub­prob­lems of build­ing pow­er­ful nice AIs, is the the­ory of (suffi­ciently ad­vanced) minds in gen­eral.