Instrumental goals are almost-equally as tractable as terminal goals

One coun­ter­ar­gu­ment to the Orthog­o­nal­ity Th­e­sis as­serts that agents with ter­mi­nal prefer­ences for goals like e.g. re­source ac­qui­si­tion will always be much bet­ter at those goals than agents which merely try to ac­quire re­sources on the way to do­ing some­thing else, like mak­ing pa­per­clips. There­fore, by fil­ter­ing on real-world com­pe­tent agents, we filter out all agents which do not have ter­mi­nal prefer­ences for ac­quiring re­sources.

A re­ply is that “figur­ing out how to do \(W_4\) on the way to \(W_3\), on the way to \(W_2\), on the way to \(W_1\), with­out that par­tic­u­lar way of do­ing \(W_4\) stomp­ing on your abil­ity to later achieve \(W_2\)” is such a ubiquitous idiom of cog­ni­tion or su­percog­ni­tion that (a) any com­pe­tent agent must already do that all the time, and (b) it doesn’t seem like adding one more straight­for­ward tar­get \(W_0\) to the end of the chain should usu­ally re­sult in greatly in­creased com­pu­ta­tional costs or greatly diminished abil­ity to op­ti­mize \(W_4\).

E.g. con­trast the nec­es­sary thoughts of a pa­per­clip max­i­mizer ac­quiring re­sources in or­der to turn them into pa­per­clips, and an agent with a ter­mi­nal goal of ac­quiring and hoard­ing re­sources.

The pa­per­clip max­i­mizer has a ter­mi­nal util­ity func­tion \(U_0\) which counts the num­ber of pa­per­clips in the uni­verse (or rather, pa­per­clip-sec­onds in the uni­verse’s his­tory). The pa­per­clip max­i­mizer then iden­ti­fies a se­quence of sub­goals and sub-sub­goals \(W_1, W_2, W_3...W_N\) cor­re­spond­ing to in­creas­ingly fine-grained strate­gies for mak­ing pa­per­clips, each of which is sub­ject to the con­straint that it doesn’t stomp on the pre­vi­ous el­e­ments of the goal hi­er­ar­chy. (For sim­plic­ity of ex­po­si­tion we tem­porar­ily pre­tend that each goal has only one sub­goal rather than a fam­ily of con­junc­tive and dis­junc­tive sub­goals.)

More con­cretely, we can imag­ine that \(W_1\) is “get mat­ter un­der my con­trol (in a way that doesn’t stop me from mak­ing pa­per­clips with it)”, that is, if we were to con­sider the naive or un­con­di­tional de­scrip­tion \(W_1'\) “get mat­ter un­der my ‘con­trol’ (whether or not I can make pa­per­clips with it)”, we are here in­ter­ested in a sub­set of states \(W_1 \subset W_1'\) such that \(\mathbb E[U_0|W_1]\) is high. Then \(W_2\) might be “ex­plore the uni­verse to find mat­ter (in such a way that it doesn’t in­terfere with bring­ing that mat­ter un­der con­trol or turn­ing it into pa­per­clips)”, \(W_3\) might be “build in­ter­stel­lar probes (in such a way that …)”, and as we go fur­ther into the hi­er­ar­chy we will find \(W_10\) “gather all the ma­te­ri­als for an in­ter­stel­lar probe in one place (in such a way that …)”, \(W_20\) “lay the next 20 sets of rails for trans­port­ing the tita­nium cart”, and \(W_25\) “move the left con­trol­ler up­ward”.

Of course by the time we’re that deep in the hi­er­ar­chy, any effi­cient plan­ning al­gorithm is mak­ing some use of real in­de­pen­dences where we can rea­son rel­a­tively my­opi­cally about how to lay train tracks with­out wor­ry­ing very much about what the cart of tita­nium is be­ing used for. (Pro­vided that the strate­gies are con­strained enough in do­main to not in­clude any strate­gies that stomp dis­tant higher goals, e.g. the strat­egy “build an in­de­pen­dent su­per­in­tel­li­gence that just wants to lay train tracks”; if the sys­tem were op­ti­miz­ing that broadly it would need to check dis­tant con­se­quences and con­di­tion on them.

The re­ply would then be that, in gen­eral, any feat of su­per­in­tel­li­gence re­quires mak­ing a ton of big, medium-sized, and lit­tle strate­gies all con­verge on a sin­gle fu­ture state in virtue of all of those strate­gies hav­ing been se­lected suffi­ciently well to op­ti­mize the ex­pec­ta­tion \(\mathbb E[U|W_1,W_2,...].\) for some \(U.\) A ton of lit­tle and medium-sized strate­gies must have all man­aged not to col­lide with each other or with larger big-pic­ture con­sid­er­a­tions. If you can’t do this much then you can’t win a game of Go or build a fac­tory or even walk across the room with­out your limbs tan­gling up.

Then there doesn’t seem to be any good rea­son to ex­pect an agent which is in­stead op­ti­miz­ing di­rectly the util­ity func­tion \(U_1\) which is “ac­quire and hoard re­sources” to do a very much bet­ter job of op­ti­miz­ing \(W_10\) or \(W_25.\) When \(W_25\) already needs to be con­di­tioned in such a way as to not stomp on all the higher goals \(W_2, W_3, ...\) it just doesn’t seem that much less con­strain­ing to tar­get \(U_1\) ver­sus \(U_0, W_1.\) Most of the cog­ni­tive la­bor in the se­quence does not seem like it should be go­ing into check­ing for \(U_0\) at the end in­stead of check­ing for \(U_1\) at the end. It should be go­ing into, e.g., figur­ing out how to make any kind of in­ter­stel­lar probe and figur­ing out how to build fac­to­ries.

It has not his­tor­i­cally been the case that the most com­pu­ta­tion­ally effi­cient way to play chess is to have com­pet­ing agents in­side the chess al­gorithm try­ing to op­ti­mize differ­ent un­con­di­tional util­ity func­tions and bid­ding on the right to make moves in or­der to pur­sue their own lo­cal goal of “pro­tect the queen, re­gard­less of other long-term con­se­quences” or “con­trol the cen­ter, re­gard­less of other long-term con­se­quences”. What we are ac­tu­ally try­ing to get is the chess move such that, con­di­tion­ing on that chess move and the sort of fu­ture chess moves we are likely to make, our chance of win­ning is the high­est. The best mod­ern chess al­gorithms do their best to fac­tor in any­thing that af­fects long-range con­se­quences when­ever they know about those con­se­quences. The best chess al­gorithms don’t try to fac­tor things into lots of col­lid­ing un­con­di­tional urges, be­cause some­times that’s not how “the win­ning move” fac­tors. You can ex­tremely of­ten do bet­ter by do­ing a deeper con­se­quen­tial­ist search that con­di­tions mul­ti­ple el­e­ments of your strat­egy on longer-term con­se­quences in a way that pre­vents your moves from step­ping on each other. It’s not very much of an ex­ag­ger­a­tion to say that this is why hu­mans with brains that can imag­ine long-term con­se­quences are smarter than, say, ar­madillos.

Some­times there are sub­tleties we don’t have the com­put­ing power to no­tice, we can’t liter­ally ac­tu­ally con­di­tion on the fu­ture. But “to make pa­per­clips, ac­quire re­sources and use them to make pa­per­clips” ver­sus “to make pa­per­clips, ac­quire re­sources re­gard­less of whether they can be used to make pa­per­clips” is not sub­tle. We’d ex­pect a su­per­in­tel­li­gence that was effi­cient rel­a­tive to hu­mans to un­der­stand and cor­rect at least those di­ver­gences be­tween \(W_1\) and \(W_1'\) that a hu­man could see, us­ing at most the triv­ial amount of com­put­ing power rep­re­sented by a hu­man brain. To the ex­tent that par­tic­u­lar choices are be­ing se­lected-on over a do­main that is likely to in­clude choices with huge long-range con­se­quences, one ex­pends the com­put­ing power to check and con­di­tion on the long-range con­se­quences; but a su­per­ma­jor­ity of choices shouldn’t re­quire checks of this sort; and even choices about how to de­sign train tracks that do re­quire longer-range checks are not go­ing to be very much more tractable de­pend­ing on whether the dis­tant top of the goal hi­er­ar­chy is some­thing like “make pa­per­clips” or “hoard re­sources”.

Even sup­pos­ing that there could be 5% more com­pu­ta­tional cost as­so­ci­ated with check­ing in­stru­men­tal strate­gies for step­ping on “pro­mote fun-the­o­retic eu­daimo­nia”, which might ubiquitously in­volve con­sid­er­a­tions like “make sure none of the com­pu­ta­tional pro­cesses you use to do this are them­selves sen­tient”, this doesn’t mean you can’t have com­pe­tent agents that go ahead and spend 5% more com­pu­ta­tion. Iit’s sim­ply the cor­rect choice to build sub­agents that ex­pend 5% more com­pu­ta­tion to main­tain co­or­di­na­tion on achiev­ing eu­daimo­nia, rather than build­ing sub­agents that ex­pend 5% less com­pu­ta­tion to hoard re­sources and never give them back. It doesn’t mat­ter if the sec­ond kind of agents are less “costly” in some my­opic sense, they are vastly less use­ful and in­deed ac­tively de­struc­tive. So noth­ing that is choos­ing so as to op­ti­mize its ex­pec­ta­tion of \(U_0\) will build a sub­agent that gen­er­ally op­ti­mizes its own ex­pec­ta­tion of \(U_1.\)


  • Orthogonality Thesis

    Will smart AIs au­to­mat­i­cally be­come benev­olent, or au­to­mat­i­cally be­come hos­tile? Or do differ­ent AI de­signs im­ply differ­ent goals?