Edge instantiation

The edge in­stan­ti­a­tion prob­lem is a hy­poth­e­sized patch-re­sis­tant prob­lem for safe value load­ing in ad­vanced agent sce­nar­ios where, for most util­ity func­tions we might try to for­mal­ize or teach, the max­i­mum of the agent’s util­ity func­tion will end up ly­ing at an edge of the solu­tion space that is a ‘weird ex­treme’ from our per­spec­tive.

Definition

On many classes of prob­lems, the max­i­miz­ing solu­tion tends to lie at an ex­treme edge of the solu­tion space. This means that if we have an in­tu­itive out­come X in mind and try to ob­tain it by giv­ing an agent a solu­tion fit­ness func­tion F that sounds like it should as­sign X a high value, the max­i­mum of F may be at an ex­treme edge of the solu­tion space that looks to us like a very un­nat­u­ral in­stance of X, or not an X at all. The Edge In­stan­ti­a­tion prob­lem is a spe­cial­iza­tion of un­fore­seen max­i­miza­tion which in turn spe­cial­izes Bostrom’s per­verse in­stan­ti­a­tion class of prob­lems.

It is hy­poth­e­sized (by e.g. Yud­kowsky) that many classes of solu­tion that have been pro­posed to patch Edge In­stan­ti­a­tion would fail to re­solve the en­tire prob­lem and that fur­ther Edge In­stan­ti­a­tion prob­lems would re­main. For ex­am­ple, even if we con­sider a satis­fic­ing util­ity func­tion with only val­ues 0 and 1 where ‘typ­i­cal’ X has value 1 and no higher score is pos­si­ble, an ex­pected util­ity max­i­mizer could still end up de­ploy­ing an ex­treme strat­egy in or­der to max­i­mize the prob­a­bil­ity that a satis­fac­tory out­come is ob­tained. Con­sid­er­ing sev­eral pro­posed solu­tions like this and their failures sug­gests that Edge In­stan­ti­a­tion is a re­sis­tant (not ul­ti­mately un­solv­able, but with many at­trac­tive-seem­ing solu­tions failing to work) for the deep rea­son that many pos­si­ble stages of an agent’s cog­ni­tion would po­ten­tially rank solu­tions and choose very-high-rank­ing solu­tions.

The propo­si­tion defined is true if Edge In­stan­ti­a­tion does in fact sur­face as a prag­mat­i­cally im­por­tant prob­lem for ad­vanced agent sce­nar­ios, and would in fact re­sur­face in the face of most ‘naive’ at­tempts to cor­rect it. The propo­si­tion is not that the Edge In­stan­ti­a­tion Prob­lem is un­re­solv­able, but that it’s real, im­por­tant, doesn’t have a sim­ple an­swer, and re­sists most sim­ple at­tempts to patch it.

Ex­am­ple 1: Smil­ing faces

When Bill Hib­bard was first be­gin­ning to con­sider the value al­ign­ment prob­lem, he sug­gested giv­ing AIs the goals of mak­ing hu­mans smile, a goal that could be trained by rec­og­niz­ing pic­tures of smil­ing hu­mans, and was in­tended to elicit hu­man hap­piness. Yud­kowsky replied by sug­gest­ing that the true be­hav­ior elic­ited would be to tile the fu­ture light cone with tiny molec­u­lar smiley faces. This is not be­cause the agent was per­verse, but be­cause among the set of all ob­jects that look like smiley faces, the solu­tion with the most ex­treme value for achiev­able nu­meros­ity (that is, the strat­egy which cre­ates the largest pos­si­ble num­ber of smil­ing faces) also sets the value for the size of in­di­vi­d­ual smil­ing faces to an ex­tremely small di­ame­ter. The tiniest pos­si­ble smil­ing faces are very un­like the archety­pal ex­am­ples of smil­ing faces that we had in mind when spec­i­fy­ing the util­ity func­tion; from a hu­man per­spec­tive, the in­tu­itively in­tended mean­ing has been re­placed by a weird ex­treme.

Stu­art Rus­sell ob­serves that max­i­miz­ing some as­pects of a solu­tion tends to set all un­con­strained as­pects of the solu­tion to ex­treme val­ues. The solu­tion that max­i­mizes the num­ber of smiles min­i­mizes the size of each in­di­vi­d­ual smile. The bad-seem­ing re­sult is not just an ac­ci­den­tal out­come of mere am­bi­guity in the in­struc­tions. The prob­lem wasn’t just that a wide range of pos­si­bil­ities cor­re­sponded to ‘smiles’ and a ran­domly se­lected pos­si­bil­ity from this space sur­prised us by not be­ing the cen­tral ex­am­ple we origi­nally had in mind. Rather, there’s a sys­tem­atic ten­dency for the high­est-scor­ing solu­tion to oc­cupy an ex­treme edge of the solu­tion space, which means that we are sys­tem­at­i­cally likely to see ‘ex­treme’ or ‘weird’ solu­tions rather than the ‘nor­mal’ ex­am­ples we had in mind.

Ex­am­ple 2: Sorcerer’s Apprentice

In the hy­po­thet­i­cal Sorcerer’s Ap­pren­tice sce­nario, you in­struct an ar­tifi­cial agent to add wa­ter to a cauldron, and it floods the en­tire work­place. Hy­po­thet­i­cally, you had in mind only adding enough wa­ter to fill the cauldron and then stop­ping, but some stage of the agent’s solu­tion-find­ing pro­cess op­ti­mized on a step where ‘flood­ing the work­place’ scored higher than ‘add 4 buck­ets of wa­ter and then shut down safely’, even though both of these qual­ify as ‘filling the cauldron’.

This could be be­cause (in the most naive case) the util­ity func­tion you gave the agent was in­creas­ing in the amount of wa­ter in con­tigu­ous con­tact with the cauldron’s in­te­rior—you gave it a util­ity func­tion that im­plied 4 buck­ets of wa­ter were good and 4,000 buck­ets of wa­ter were bet­ter.

Sup­pose that, hav­ing fore­seen in ad­vance the above pos­si­ble dis­aster, you try to patch the agent by in­struct­ing it not to move more than 50 kilo­grams of ma­te­rial to­tal. The agent promptly be­gins to build sub­agents (with the agent’s own mo­tions to build sub­agents mov­ing only 50 kilo­grams of ma­te­rial) which build fur­ther agents and again flood the work­place. You have run into a Near­est Un­blocked Neigh­bor prob­lem; when you ex­cluded one ex­treme solu­tion, the re­sult was not the cen­tral-feel­ing ‘nor­mal’ ex­am­ple you origi­nally had in mind. In­stead, the new max­i­mum lay on a new ex­treme edge of the solu­tion space.

Another solu­tion might be to define what you thought was a satis­fic­ing agent, with a util­ity func­tion that as­signed 1 in all cases where there were at least 4 buck­ets of wa­ter in the cauldron and 0 oth­er­wise. The agent then calcu­lates that it could in­crease the prob­a­bil­ity of this con­di­tion ob­tain­ing from 99.9% to 99.99% by repli­cat­ing sub­agents and re­peat­edly filling the cauldron, just in case one agent malfunc­tions or some­thing else tries to re­move wa­ter from the cauldron. Since 0.9999 > 0.999, there is then a more ex­treme solu­tion with greater ex­pected util­ity, even though the util­ity func­tion it­self is bi­nary and satis­fic­ing.

Premises

As­sumes: Orthog­o­nal­ity thesis

As with most as­pects of the value load­ing prob­lem, Orthog­o­nal­ity Th­e­sis is an im­plicit premise of the Edge In­stan­ti­a­tion prob­lem; for Edge In­stan­ti­a­tion to be a prob­lem for ad­vanced agents im­plies that ‘what we re­ally meant’ or the out­comes of high­est nor­ma­tive value are not in­her­ently picked out by ev­ery pos­si­ble max­i­miz­ing pro­cess; and that most pos­si­ble util­ity func­tions do not care ‘what we re­ally meant’ un­less ex­plic­itly con­structed to have a do what I mean be­hav­ior.

As­sumes: Com­plex­ity of values

If nor­ma­tive val­ues were ex­tremely sim­ple (of very low al­gorith­mic com­plex­ity), then they could be for­mally speci­fied in full, and the most ex­treme strat­egy that scored high­est on this for­mal mea­sure sim­ply would cor­re­spond with what we re­ally wanted, with no down­sides that hadn’t been taken into ac­count in the score.

Arguments

In­ter­ac­tion with near­est un­blocked neighbor

The Edge In­stan­ti­a­tion prob­lem has the Near­est un­blocked strat­egy pat­tern. If you fore­see one spe­cific ‘per­verse’ in­stan­ti­a­tion and try to pro­hibit it, the max­i­mum over the re­main­ing solu­tion space is again likely to be at an­other ex­treme edge of the solu­tion space that again seems ‘per­verse’.

In­ter­ac­tion with cog­ni­tive un­con­tain­abil­ity of ad­vanced agents

Ad­vanced agents search larger solu­tion spaces than we do. There­fore the pro­ject of try­ing to vi­su­al­ize all the strate­gies that might fit a util­ity func­tion, to try to ver­ify in our own minds that the max­i­mum is some­where safe, seems ex­cep­tion­ally un­trust­wor­thy (not Ad­vanced safety).

In­ter­ac­tion with con­text change problem

Agents that ac­quire new strate­gic op­tions or be­come able to search a wider range of the solu­tion space may go from hav­ing only ap­par­ently ‘nor­mal’ solu­tions to ap­par­ently ‘ex­treme’ solu­tions. This is known as the con­text change prob­lem. For ex­am­ple, an agent that in­duc­tively learns hu­man smiles as a com­po­nent of its util­ity func­tion, might as a non-ad­vanced agent have ac­cess only to strate­gies that make hu­mans happy in an in­tu­itive sense (thereby pro­duc­ing the ap­par­ent ob­ser­va­tion that ev­ery­thing is go­ing fine and the agent is work­ing as in­tended), and then af­ter self-im­prove­ment, ac­quire as an ad­vanced agent the strate­gic op­tion of trans­form­ing the fu­ture light cone into tiny molec­u­lar smiley­faces.

Strong pres­sures can arise at any stage of optimization

Sup­pose you tried to build an agent that was an ex­pected util­ity satis­ficer—rather than hav­ing a 0-1 util­ity func­tion and thus chas­ing prob­a­bil­ities of goal satis­fac­tion ever closer to 1, the agent searches for strate­gies that have at least 0.999 ex­pected util­ity. Why doesn’t this re­solve the prob­lem?

A bounded satis­ficer doesn’t rule out the solu­tion of filling the room with wa­ter, since this solu­tion also has >0.999 ex­pected util­ity. It only re­quires the agent to carry out one cog­ni­tive al­gorithm which has at least one max­i­miz­ing or highly op­ti­miz­ing stage, in or­der for ‘fill the room with wa­ter’ to be preferred to ‘add 4 buck­ets and shut down safely’ on that stage (while be­ing equally ac­cept­able at fu­ture satis­fic­ing stages). E.g., maybe you build an ex­pected util­ity satis­ficer and still end up with an ex­treme re­sult be­cause one of the cog­ni­tive al­gorithms sug­gest­ing solu­tions was try­ing to min­i­mize its own disk space us­age.

On a meta-level, we may run into prob­lems of for re­flec­tive agents. Maybe one sim­ple way of ob­tain­ing at least 0.999 ex­pected util­ity is to cre­ate a sub­agent that max­i­mizes ex­pected util­ity? It seems in­tu­itively clear why bounded max­i­miz­ers would build bound­edly max­i­miz­ing offspring, but a bounded satis­ficer doesn’t need to build bound­edly satis­fic­ing offspring—a bounded max­i­mizer might also be ‘good enough’. (In the cur­rent the­ory of TilingA­gents, we can prove that an ex­pected util­ity satis­ficer can tile to an ex­pected util­ity satis­ficer with some sur­pris­ing caveats, but the prob­lem is that it can tile to other things be­sides an ex­pected util­ity satis­ficer.)

Since it seems very easy for at least one stage of a self-mod­ify­ing agent to end up prefer­ring solu­tions that have higher scores rel­a­tive to some scor­ing rule, the edge in­stan­ti­a­tion prob­lem can be ex­pected to re­sist naive at­tempts to de­scribe an agent that seems to have an over­all be­hav­ior of ‘not try­ing quite so hard’. It’s also not clear how to make the in­struc­tion ‘don’t try so hard’ be Reflec­tive­lyCon­sis­tent, or ap­ply to ev­ery part of a con­sid­ered sub­agent. This is also why limited op­ti­miza­tion is an open prob­lem.

Disprefer­ring solu­tions with ‘ex­treme im­pacts’ in gen­eral is the open prob­lem of low im­pact AI. Cur­rently, no for­mal­iz­able util­ity func­tion is known that plau­si­bly has the right in­tu­itive mean­ing for this. (We’re work­ing on it.) Also note that not ev­ery ex­treme ‘tech­ni­cally an X’ that we think is ‘not re­ally an X’ has an ex­treme causal im­pact in an in­tu­itive sense, so not ev­ery case of the Edge In­stan­ti­a­tion prob­lem is blocked by dis­prefer­ring greater im­pacts.

Implications

One of limited op­ti­miza­tion, low Im­pact, or full cov­er­age value load­ing seems crit­i­cal for real-world agents in­sert prob­a­bil­ity bar

As Stu­art Rus­sell ob­serves, solv­ing an op­ti­miza­tion prob­lem where only some val­ues are con­strained or max­i­mized, will tend to set un­con­strained vari­ables to ex­treme val­ues. The uni­verse con­tain­ing the max­i­mum pos­si­ble num­ber of pa­per­clips con­tains no hu­mans; op­ti­miz­ing for as much hu­man safety as pos­si­ble will drive hu­man free­dom to zero.

Then we must ap­par­ently do at least one of the fol­low­ing:

  1. Build full cov­er­age ad­vanced agents whose util­ity func­tions lead them to ter­mi­nally dis­pre­fer stomp­ing on ev­ery as­pect of value that we care about (or would care about un­der re­flec­tion). In a full cov­er­age agent there are no un­con­strained vari­ables that we care about to be set to ex­treme val­ues that we would dis­like; the AI’s goal sys­tem knows and cares about all of these. It will not set hu­man free­dom to an ex­tremely low value in the course of fol­low­ing an in­struc­tion to op­ti­mize hu­man safety, be­cause it knows about hu­man free­dom and liter­ally ev­ery­thing else.

  2. Build pow­er­ful agents that are limited op­ti­miz­ers which pre­dictably in­vent only solu­tions we in­tu­itively con­sider ‘non-ex­treme’, whose op­ti­miza­tions are such as to not drive to an ex­treme on any sub­stage. This leaves us with just am­bi­guity as a (se­vere) prob­lem, but at least averts a sys­tem­atic drive to­ward ex­tremes that will sys­tem­at­i­cally ‘ex­ploit’ that am­bi­guity.

  3. Build pow­er­ful agents that are low im­pact and pre­fer to avoid solu­tions that pro­duce greater im­pacts on any­thing we in­tu­itively see as an im­por­tant pred­i­cate, in­clud­ing both ev­ery­thing we value and a great many more things we don’t par­tic­u­larly value.

  4. Find some other es­cape route from the value achieve­ment prob­lem.

In­suffi­ciently cau­tious at­tempts to build ad­vanced agents are likely to be highly de­struc­tive in­sert prob­a­bil­ity bar

Edge In­stan­ti­a­tion is one of the con­tribut­ing rea­sons why value load­ing is hard and naive solu­tions end up do­ing the equiv­a­lent of tiling the fu­ture light cone with pa­per­clips.

We’ve pre­vi­ously ob­served cer­tain par­ties propos­ing util­ity func­tions for ad­vanced agents that seem ob­vi­ously sub­ject to the Edge In­stan­ti­a­tion prob­lem. Con­fronted with the ob­vi­ous dis­aster fore­cast, they pro­pose patch­ing the util­ity func­tion to elimi­nate that par­tic­u­lar sce­nario (or rather, say that of course they would have writ­ten the util­ity func­tion to ex­clude that sce­nario) or claim that the agent will not ‘mis­in­ter­pret’ the in­struc­tions so egre­giously (deny­ing the Orthog­o­nal­ity Th­e­sis at least to the ex­tent of propos­ing a uni­ver­sal prefer­ence for in­ter­pret­ing in­struc­tions ‘as in­tended’). Mis­takes of this type also be­long to a class that po­ten­tially wouldn’t show up dur­ing early stages of the AI, or would show up in an ini­tially non­catas­trophic way that seemed eas­ily patched, so peo­ple ad­vo­cat­ing an em­piri­cal first method­ol­ogy would falsely be­lieve that they had learned to han­dle them or elimi­nated all such ten­den­cies already.

Thus the prob­lem of Edge In­stan­ti­a­tion (which is much less se­vere for non­ad­vanced agents than ad­vanced agents, will not be solved in the ad­vanced stage by patches that seem to fix weak early prob­lems, and has em­piri­cally ap­peared in pro­pos­als by mul­ti­ple speak­ers who re­jected at­tempts to point out the Edge In­stan­ti­a­tion prob­lem) is a sig­nifi­cant con­tribut­ing fac­tor to the over­all ex­pec­ta­tion that the de­fault out­come of de­vel­op­ing ad­vanced agents with cur­rent at­ti­tudes is dis­as­trous.

Rel­a­tive to cur­rent at­ti­tudes, small in­creases in safety aware­ness do not pro­duce sig­nifi­cantly less de­struc­tive fi­nal out­comes in­sert prob­a­bil­ity bar

Sim­ple patches to Edge In­stan­ti­a­tion fail and the only cur­rently known ap­proaches would take a lot of work to solve prob­lems like limited op­ti­miza­tion or full cov­er­age that are hard for deep rea­sons. In other words, Edge In­stan­ti­a­tion does not ap­pear to be the sort of prob­lem that an AI pro­ject can eas­ily avoid just by be­ing made aware of it. (E.g. MIRI knows about it but hasn’t yet come up with any solu­tion, let alone one eas­ily patched on to any cog­ni­tive ar­chi­tec­ture.)

This is one of the fac­tors con­tribut­ing to the gen­eral as­sess­ment that the curve of out­come good­ness as a func­tion of effort is flat for a sig­nifi­cant dis­tance around cur­rent lev­els of effort.