Nearest unblocked strategy

link to epistemic ver­sion http://​​less­​​lw/​​nki/​​jfk_was_not_as­sas­si­nated_prior_prob­a­bil­ity_zero/​​d9h3?con­text=3


‘Near­est un­blocked strat­egy’ seems like it should be a fore­see­able prob­lem of try­ing to get rid of un­de­sir­able AI be­hav­iors by adding spe­cific penalty terms to them, or oth­er­wise try­ing to ex­clude one class of ob­served or fore­seen bad be­hav­iors. Namely, if a de­ci­sion crite­rion thinks \(X\) is the best thing to do, and you add a penalty term \(P\) that you think ex­cludes ev­ery­thing in­side \(X,\) the next-best thing to do may be a very similar thing \(X'\) which is the most similar thing to \(X\) that doesn’t trig­ger \(P.\)

Ex­am­ple: Pro­duc­ing hap­piness.

Some very early pro­pos­als for AI al­ign­ment sug­gested that AIs be tar­geted on pro­duc­ing hu­man hap­piness. Leav­ing aside var­i­ous other ob­jec­tions, ar­guendo, imag­ine the fol­low­ing se­ries of prob­lems and at­tempted fixes:

  • By hy­poth­e­sis, the AI is suc­cess­fully in­fused with a goal of “hu­man hap­piness” as a util­ity func­tion over hu­man brain states. (Ar­guendo, this pred­i­cate is nar­rowed suffi­ciently that the AI does not just want to con­struct the tiniest, least re­source-in­ten­sive brains ex­pe­rienc­ing the largest amount of hap­piness per erg of en­ergy.)

  • Ini­tially, the AI seems to be pur­su­ing this goal in good ways; it or­ga­nizes files, tells funny jokes, helps land­ladies take out the garbage, etcetera.

  • En­couraged, the pro­gram­mers fur­ther im­prove the AI and add more com­put­ing power.

  • The AI gains a bet­ter un­der­stand­ing of the world, and the AI’s policy space ex­pands to in­clude con­ceiv­able op­tions like “ad­minister heroin”.

  • The AI starts plan­ning how to ad­minister heroin to peo­ple.

  • The pro­gram­mers no­tice this be­fore it hap­pens. (Ar­guendo, due to suc­cess­ful trans­parency fea­tures, or an im­per­a­tive to check plans with the users, which op­er­ated as in­tended at the AI’s cur­rent level of in­tel­li­gence.)

  • The pro­gram­mers edit the AI’s util­ity func­tion and add a penalty of −100 utilons for any event cat­e­go­rized as “the AI ad­ministers heroin to hu­mans”. (Ar­guendo, the AI’s cur­rent level of in­tel­li­gence does not suffice to pre­vent the pro­gram­mers from edit­ing its util­ity func­tion, de­spite the con­ver­gent in­stru­men­tal in­cen­tive to avoid this; nor does it suc­cess­fully de­ceive the pro­gram­mers.)

  • The AI gets slightly smarter. New con­ceiv­able op­tions en­ter the AI’s op­tion space.

  • The AI starts want­ing to ad­minister co­caine to hu­mans (in­stead of heroin).

  • The pro­gram­mers read through the cur­rent sched­ule of pro­hibited drugs and add penalty terms for ad­minis­ter­ing mar­ijuana, co­caine, etcetera.

  • The AI be­comes slightly smarter. New op­tions en­ter its policy space.

  • The AI starts think­ing about how to re­search a new hap­piness drug not on the list of drugs that its util­ity func­tion des­ig­nates as bad.

  • The pro­gram­mers, af­ter some work, man­age to de­velop a cat­e­gory for ‘The AI forcibly ad­minis­ter­ing any kind of psy­choac­tive drug to hu­mans’ which is broad enough that the AI stops sug­gest­ing re­search cam­paigns to de­velop things slightly out­side the cat­e­gory.

  • The AI wants to build an ex­ter­nal sys­tem to ad­minister heroin, so that it won’t be clas­sified in­side this set of bad events “the AI forcibly ad­minis­ter­ing drugs”.

  • The pro­gram­mers gen­er­al­ize the penalty pred­i­cate to in­clude “ma­chine sys­tems in gen­eral forcibly ad­minis­ter­ing heroin” as a bad thing.

  • The AI re­calcu­lates what it wants, and be­gins to want to pay hu­mans to ad­minister heroin.

  • The pro­gram­mers try to gen­er­al­ize the cat­e­gory of pe­nal­ized events to in­clude non-vol­un­tar­ily ad­minis­tra­tion of drugs in gen­eral that pro­duce hap­piness, whether done by hu­mans or AIs. The pro­gram­mers patch this cat­e­gory so that the AI is not try­ing to shut down at least the nicer parts of psy­chi­a­tric hos­pi­tals.

  • The AI be­gins plan­ning an ad cam­paign to per­suade peo­ple to use heroin vol­un­tar­ily.

  • The pro­gram­mers add a penalty of −100 utilons for “AIs per­suad­ing hu­mans to use drugs”.

  • The AI goes back to helping land­ladies take out the garbage. All seems to be well.

  • The AI con­tinues to in­crease in in­tel­li­gence, be­com­ing ca­pa­ble enough that the AI can no longer be ed­ited against its own will.

  • The AI no­tices the op­tion “Tweak hu­man brains to ex­press ex­tremely high lev­els of en­doge­nous opi­ates, then take care of their twitch­ing bod­ies to so they can go on be­ing happy”.

The over­all story is one where the AI’s prefer­ences on round \(i,\) de­noted \(U_i,\) are ob­served to ar­rive at an at­tain­able op­ti­mum \(X_i\) which the hu­mans see as un­de­sir­able. The hu­mans de­vise a penalty term \(P_i\) in­tended to ex­clude the un­de­sir­able parts of the policy space, and add this to \(U_i\) cre­at­ing a new util­ity func­tion \(U_{i+1},\) af­ter which the AI’s op­ti­mal policy set­tles into a new state \(X_i^*\) that seems ac­cept­able. How­ever, af­ter the next ex­pan­sion of the policy space, \(U_{i+1}\) set­tles into a new at­tain­able op­ti­mum \(X_{i+1}\) which is very similar to \(X_i\) and makes the min­i­mum ad­just­ment nec­es­sary to evade the bound­aries of the penalty term \(P_i,\) re­quiring a new penalty term \(P_{i+1}\) to ex­clude this new mis­be­hav­ior.

(The end of this story might not kill you if the AI had enough suc­cess­ful, ad­vanced-safe cor­rigi­bil­ity fea­tures that the AI would in­definitely go on check­ing novel poli­cies and novel goal in­stan­ti­a­tions with the users, not strate­gi­cally hid­ing its dis­al­ign­ment from the pro­gram­mers, not de­ceiv­ing the pro­gram­mers, let­ting the pro­gram­mers edit its util­ity func­tion, not do­ing any­thing dis­as­trous be­fore the util­ity func­tion had been ed­ited, etcetera. But you wouldn’t want to rely on this. You would not want in the first place to op­er­ate on the paradigm of ‘max­i­mize hap­piness, but not via any of these bad meth­ods that we have already ex­cluded’.)


Re­cur­rence of a nearby un­blocked strat­egy is ar­gued to be a fore­see­able difficulty given the fol­low­ing pre­con­di­tions:

• The AI is a con­se­quen­tial­ist, or is con­duct­ing some other search such that when the search is blocked at \(X,\) the search may hap­pen upon a similar \(X'\) that fits the same crite­rion that origi­nally pro­moted \(X.\) E.g. in an agent that se­lects ac­tions on the ba­sis of their con­se­quences, if an event \(X\) leads to goal \(G\) but \(X\) is blocked, then a similar \(X'\) may also have the prop­erty of lead­ing to \(G.\)

• The search is tak­ing place over a rich do­main where the space of rele­vant neigh­bors around X is too com­pli­cated for us to be cer­tain that we have de­scribed all the rele­vant neigh­bors cor­rectly. If we imag­ine an agent play­ing the purely ideal game of log­i­cal Tic-Tac-Toe, then if the agent’s util­ity func­tion hates play­ing in the cen­ter of the board, we can be sure (be­cause we can ex­haus­tively con­sider the space) that there are no Tic-Tac-Toe squares that be­have strate­gi­cally al­most like the cen­ter but don’t meet the ex­act defi­ni­tion we used of ‘cen­ter’. In the far more com­pli­cated real world, when you elimi­nate ‘ad­minister heroin’ you are very likely to find some other chem­i­cal or trick that is strate­gi­cally mostly equiv­a­lent to ad­minis­ter­ing heroin. See “Al­most all real-world do­mains are rich”.

• From our per­spec­tive on value, the AI does not have an ab­solute iden­ti­fi­ca­tion of value for the do­main, due to some com­bi­na­tion of “the do­main is rich” and “value is com­plex”. Chess is com­pli­cated enough that hu­man play­ers can’t ab­solutely iden­tify win­ning moves, but since a chess pro­gram can have an ab­solute iden­ti­fi­ca­tion of which end­states con­sti­tute win­ning, we don’t run into a prob­lem of un­end­ing patches in iden­ti­fy­ing which states of the board are good play. (How­ever, if we con­sider a very early chess pro­gram that (from our per­spec­tive) was try­ing to be a con­se­quen­tial­ist but wasn’t very good at it, then we can imag­ine that, if the early chess pro­gram con­sis­tently threw its queen onto the right edge of the board for strange rea­sons, for­bid­ding it to move the queen there might well lead it to throw the queen onto the left edge for the same strange rea­sons.)


‘Near­est un­blocked’ be­hav­ior is some­times ob­served in humans

Although hu­mans obey­ing the law make poor analo­gies for math­e­mat­i­cal al­gorithms, in some cases hu­man eco­nomic ac­tors ex­pect not to en­counter le­gal or so­cial penalties for obey­ing the let­ter rather than the spirit of the law. In those cases, af­ter a pre­vi­ously high-yield strat­egy is out­lawed or pe­nal­ized, the re­sult is very of­ten a near-neigh­bor­ing re­sult that barely evades the let­ter of the law. This illus­trates that the the­o­ret­i­cal ar­gu­ment also ap­plies in prac­tice to at least some pseudo-eco­nomic agents (hu­mans), as we would ex­pect given the stated pre­con­di­tions.

Com­plex­ity of value means we should not ex­pect to find a sim­ple en­cod­ing to ex­clude detri­men­tal strategies

To a hu­man, ‘poi­sonous’ is one word. In terms of molec­u­lar biol­ogy, the ex­act vol­ume of the con­figu­ra­tion space of molecules that is ‘non­poi­sonous’ is very com­pli­cated. By hav­ing a sin­gle word/​con­cept for poi­sonous-vs.-non­poi­sonous, we’re di­men­sion­ally re­duc­ing the space of ed­ible sub­stances—tak­ing a very squig­gly vol­ume of molecule-space, and map­ping it all onto a lin­ear scale from ‘non­poi­sonous’ to ‘poi­sonous’.

There’s a sense in which hu­man cog­ni­tion im­plic­itly performs di­men­sional re­duc­tion on our solu­tion space, es­pe­cially by sim­plify­ing di­men­sions that are rele­vant to some com­po­nent of our val­ues. There may be some psy­cholog­i­cal sense in which we feel like “do X, only not weird low-value X” ought to be a sim­ple in­struc­tion, and an agent that re­peat­edly pro­duces the next un­blocked weird low-value X is be­ing per­verse—that the agent, given a few ex­am­ples of weird low-value Xs la­beled as non­in­stances of the de­sired con­cept, ought to be able to just gen­er­al­ize to not pro­duce weird low-value Xs.

In fact, if it were pos­si­ble to en­code all rele­vant di­men­sions of hu­man value into the agent then we could just say di­rectly to “do X, but not low-value X”. By the defi­ni­tion of full cov­er­age, the agent’s con­cept for ‘low-value’ in­cludes ev­ery­thing that is ac­tu­ally of low value, so this one in­struc­tion would blan­ket all the un­de­sir­able strate­gies we want to avoid.

Con­versely, the truth of the com­plex­ity of value the­sis would im­ply that the sim­ple word ‘low-value’ is di­men­sion­ally re­duc­ing a space of tremen­dous al­gorith­mic com­plex­ity. Thus the effort re­quired to ac­tu­ally con­vey the rele­vant dos and don’ts of “X, only not weird low-value X” would be high, and a hu­man-gen­er­ated set of su­per­vised ex­am­ples la­beled ‘not the kind of X we mean’ would be un­likely to cover and sta­bi­lize all the di­men­sions of the un­der­ly­ing space of pos­si­bil­ities. Since the weird low-value X can­not be elimi­nated in one in­struc­tion or sev­eral patches or a hu­man-gen­er­ated set of su­per­vised ex­am­ples, the near­est un­blocked strat­egy prob­lem will re­cur in­cre­men­tally each time a patch is at­tempted and then the policy space is widened again.


Near­est un­blocked strat­egy be­ing a fore­see­able difficulty is a ma­jor con­trib­u­tor to wor­ry­ing that short-term in­cen­tives in AI de­vel­op­ment, to get to­day’s sys­tem work­ing to­day, or to have to­day’s sys­tem not ex­hibit­ing any im­me­di­ately visi­ble prob­lems to­day, will not lead to ad­vanced agents which are safe af­ter un­der­go­ing sig­nifi­cant gains in ca­pa­bil­ity.

More gen­er­ally, near­est un­blocked strat­egy is a fore­see­able rea­son why say­ing “Well just ex­clude X” or “Just write the code to not X” or “Add a penalty term for X” doesn’t solve most of the is­sues that crop up in AI al­ign­ment.

Even more gen­er­ally, this sug­gests that we want AIs to op­er­ate in­side a space of con­ser­va­tive cat­e­gories con­tain­ing ac­tively whitelisted strate­gies and goal in­stan­ti­a­tions, rather than hav­ing the AI op­er­ate in­side a (con­stantly ex­pand­ing) space of all con­ceiv­able poli­cies minus a set of black­listed cat­e­gories.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.