Unforeseen maximum

An un­fore­seen max­i­mum of a util­ity func­tion (or other prefer­ence frame­work) is when, e.g., you tell the AI to pro­duce smiles, think­ing that the AI will make peo­ple happy in or­der to pro­duce smiles. But un­fore­seen by you, the AI has an al­ter­na­tive for mak­ing even more smiles, which is to con­vert all mat­ter within reach into tiny molec­u­lar smiley­faces.

In other words, you’re propos­ing to give the AI a goal \(U\), be­cause you think \(U\) has a max­i­mum around some nice op­tions \(X.\) But it turns out there’s an­other op­tion \(X'\) you didn’t imag­ine, with \(X' >_U X,\) and \(X'\) is not so nice.

Un­fore­seen max­i­mums are ar­gued to be a fore­see­able difficulty of AGI al­ign­ment, if you try to iden­tify nice poli­cies by giv­ing a sim­ple crite­rion \(U\) that, so far as you can see, seems like it’d be best op­ti­mized by do­ing nice things.

Slightly more semifor­mally, we could say that “un­fore­seen max­i­mum” is re­al­ized as a difficulty when:

  1. A pro­gram­mer think­ing about a util­ity func­tion \(U\) con­sid­ers policy op­tions \(\pi_i \in \Pi_N\) and con­cludes that of these op­tions the policy with high­est \(\mathbb E [ U | \pi_i ]\) is \(\pi_1,\) and hence a \(U\)-max­i­mizer will prob­a­bly do \(\pi_1.\)

  2. The pro­gram­mer also thinks that their own crite­rion of good­ness \(V\) will be pro­moted by \(\pi_1,\) that is, \(\mathbb E [ V | \pi_1 ] > \mathbb E [ V ]\) or ”\(\pi_1\) is benefi­cial”. So the pro­gram­mer con­cludes that it’s a great idea to build an AI that op­ti­mizes for \(U.\)

  3. Alas, the AI is search­ing a policy space \(\Pi_M,\) which al­though it does con­tain \(\pi_1\) as an op­tion, also con­tains an at­tain­able op­tion \(\pi_0\) which pro­gram­mer didn’t con­sider, with \(\mathbb E [ U | \pi_0 ] > \mathbb E [ U | \pi_1 ].\) This is a prob­lem if \(\pi_0\) pro­duces much less \(V\)-benefit than \(\pi_1\) or is out­right detri­men­tal.

That is:

$$\underset{\pi_i \in \Pi_N}{\operatorname {argmax}} \ \mathbb E [ U | \pi_i ] = \pi_1$$

$$\underset{\pi_k \in \Pi_M}{\operatorname {argmax}} \ \mathbb E [ U | \pi_k ] = \pi_0$$

$$\mathbb E [ V | \pi_0 ] \ll \mathbb E [ V | \pi_1 ]$$

Ex­am­ple: Sch­mid­hu­ber’s com­pres­sion goal.

Juer­gen Sch­mid­hu­ber of IDSIA, dur­ing the 2009 Sin­gu­lar­ity Sum­mit, gave a talk propos­ing that the best and most moral util­ity func­tion for an AI was the gain in com­pres­sion of sen­sory data over time. Sch­mid­hu­ber gave ex­am­ples of valuable be­hav­iors he thought this would mo­ti­vate, like do­ing sci­ence and un­der­stand­ing the uni­verse, or the con­struc­tion of art and highly aes­thetic ob­jects.

Yud­kowsky in Q&A sug­gested that this util­ity func­tion would in­stead mo­ti­vate the con­struc­tion of ex­ter­nal ob­jects that would in­ter­nally gen­er­ate ran­dom cryp­to­graphic se­crets, en­crypt highly reg­u­lar streams of 1s and 0s, and then re­veal the cryp­to­graphic se­crets to the AI.

Trans­lat­ing into the above schema:

  1. Sch­mid­hu­ber, con­sid­er­ing the util­ity func­tion \(U\) of “max­i­mize gain in sen­sory com­pres­sion”, thought that op­tion \(\pi_1\) of “do art and sci­ence” would be the at­tain­able max­i­mum of \(U\) within all op­tions \(\Pi_N\) that Sch­mid­hu­ber con­sid­ered.

  2. Sch­mid­hu­ber also con­sid­ered the op­tion \(\pi_1\) “do art and sci­ence” to achieve most of the at­tain­able value un­der his own crite­rion of good­ness \(V\).

  3. How­ever, while the AI’s op­tion space \(\Pi_M\) would in­deed in­clude \(\pi_1\) as an op­tion, it would also in­clude the op­tion \(\pi_0\) of “have an en­vi­ron­men­tal ob­ject en­crypt streams of 1s or 0s and then re­veal the key” which would score much higher un­der \(U\), and much lower un­der \(V.\)

Re­la­tion to other fore­see­able difficulties

Con­text dis­aster im­plies an un­fore­seen max­i­mum may come as a sur­prise, or not show up dur­ing the de­vel­op­ment phase, be­cause dur­ing the de­vel­op­ment phase the AI’s op­tions are re­stricted to some \(\Pi_L \subset \Pi_M\) with \(\pi_0 \not\in \Pi_L.\)

In­deed, the pseudo-for­mal­iza­tion of a “type-1 con­text dis­aster” is iso­mor­phic to the pseud­ofor­mal­iza­tion of “un­fore­seen max­i­mum”, ex­cept that in a con­text dis­aster, \(\Pi_N\) and \(\Pi_M\) are iden­ti­fied with “AI’s op­tions dur­ing de­vel­op­ment” and “AI’s op­tions af­ter a ca­pa­bil­ity gain”. (In­stead of “Op­tions the pro­gram­mer is think­ing of” and “Op­tions the AI will con­sider”.)

The two con­cepts are con­cep­tu­ally dis­tinct be­cause, e.g:

  • A con­text dis­aster could also ap­ply to a de­ci­sion crite­rion learned by train­ing, not just a util­ity func­tion en­vi­sioned by the pro­gram­mer.

  • It’s an un­fore­seen max­i­mum but not a con­text dis­aster if the pro­gram­mer is ini­tially rea­son­ing, not that the AI has already been ob­served to be benefi­cial dur­ing a de­vel­op­ment phase, but rather that the AI ought to be benefi­cial when it op­ti­mizes \(U\) later be­cause of the sup­posed nice max­i­mum at \(\pi_1\).

If we hadn’t ob­served what seem like clear-cut cases of some ac­tors in the field be­ing blind­sided by un­fore­seen max­ima in imag­i­na­tion, we’d worry less about ac­tors be­ing blind­sided by con­text dis­asters over ob­ser­va­tions.

Edge in­stan­ti­a­tion sug­gests that the real max­ima of non-\(V\) util­ity func­tions will be “strange, weird, and ex­treme” rel­a­tive to our own \(V\)-views on prefer­able op­tions.

Miss­ing the weird al­ter­na­tive sug­gests that peo­ple may psy­cholog­i­cally fail to con­sider al­ter­na­tive agent op­tions \(\pi_0\) that are very low in \(V,\) be­cause the hu­man search func­tion looks for high-\(V\) and nor­mal poli­cies. In other words, that Sch­mid­hu­ber didn’t gen­er­ate “en­crypt streams of 1s or 0s and then re­veal the key” be­cause this policy was less at­trac­tive to him than “do art and sci­ence” and be­cause it was weird.

Near­est un­blocked strat­egy sug­gests that if you try to add a penalty term to ex­clude \(\pi_0\), the next-high­est \(U\)-rank­ing op­tion will of­ten be some similar al­ter­na­tive \(\pi_{0.01}\) which still isn’t nice.

frag­ile value as­serts that our true crite­rion of good­ness \(V\) is nar­rowly peaked within the space of all achiev­able out­comes for a su­per­in­tel­li­gence, such that we rapidly fall off in \(V\) as we move away from the peak. Com­plex­ity of value says that \(V\) and its cor­re­spond­ing peak have high al­gorith­mic com­plex­ity. Then the peak out­comes iden­ti­fied by any sim­ple ob­ject-level \(U\) will sys­tem­at­i­cally fail to find \(V\). It’s like try­ing to find a 1000-byte pro­gram which will ap­prox­i­mately re­pro­duce the text of Shake­speare’s Ham­let; al­gorith­mic in­for­ma­tion the­ory says that you just shouldn’t ex­pect to find a sim­ple pro­gram like that.

ap­ple pie prob­lem raises the con­cern that some peo­ple may have psy­cholog­i­cal trou­ble ac­cept­ing the “But \(\pi_0\)” cri­tique even af­ter it is pointed out, be­cause of their ide­olog­i­cal at­tach­ment to a no­ble goal \(U\) (prob­a­bly ac­tu­ally no­ble!) that would be even more praise­wor­thy if \(U\) could also serve as a com­plete util­ity func­tion for an AGI (which it un­for­tu­nately can’t).

Im­pli­ca­tions and re­search avenues

Con­ser­vatism in goal con­cepts can be seen as try­ing to di­rectly tackle the prob­lem of un­fore­seen max­ima. More gen­er­ally, AI ap­proaches which work on “whitelist­ing con­ser­va­tive bound­aries around ap­proved policy spaces” in­stead of “search the widest pos­si­ble policy space, minus some black­listed parts”.

The Task paradigm for ad­vanced agents con­cen­trates on try­ing to ac­com­plish some sin­gle pivotal act which can be ac­com­plished by one or more tasks of limited scope. Com­bined with other mea­sures, this might make it eas­ier to iden­tify an ad­e­quate safe plan for ac­com­plish­ing the limited-scope task, rather than need­ing to iden­tify the frag­ile peak of \(V\) within some much larger land­scape. The Task AGI for­mu­la­tion is claimed to let us par­tially “nar­row down” the scope of the nec­es­sary \(U\), the part of \(V\) that’s rele­vant to the task, and the searched policy space \(\Pi\) to what is only ad­e­quate. This might re­duce or me­lio­rate, though not by it­self elimi­nate, un­fore­seen max­ima.

Mild op­ti­miza­tion can be seen as “not try­ing so hard, not shov­ing all the way to the max­i­mum”—the hope is that when com­bined with a Task paradigm plus other mea­sures like con­ser­va­tive goals and strate­gies, this will pro­duce less op­ti­miza­tion pres­sure to­ward weird edges and un­fore­seen max­ima. (This method is not ad­e­quate on its own be­cause an ar­bi­trary ad­e­quate-\(U\) policy may still not be high-\(V\), ce­teris paribus.)

Imi­ta­tion-based agents try to max­i­mize similar­ity to a refer­ence hu­man’s im­me­di­ate be­hav­ior, rather than try­ing to op­ti­mize a util­ity func­tion.

The prospect of be­ing tripped up by un­fore­seen max­ima, is one of the con­tribut­ing mo­ti­va­tions for giv­ing up on hand-coded ob­ject-level util­ities in fa­vor of meta-level prefer­ence frame­works that learn a util­ity func­tion or de­ci­sion rule. (Again, this doesn’t seem like a full solu­tion by it­self, only one in­gre­di­ent to be com­bined with other meth­ods. If the util­ity func­tion is a big com­pli­cated learned ob­ject, that by it­self is not a good rea­son to re­lax about the pos­si­bil­ity that its max­i­mum will be some­where you didn’t fore­see, es­pe­cially af­ter a ca­pa­bil­ities boost.)

Miss­ing the weird al­ter­na­tive and the ap­ple pie prob­lem sug­gest that it may be un­usu­ally difficult to ex­plain to ac­tors why \(\pi_0 >_U \pi_1\) is a difficulty of their fa­vored util­ity func­tion \(U\) that allegedly im­plies nice policy \(\pi_1.\) That is, for psy­cholog­i­cal rea­sons, this difficulty seems un­usu­ally likely to ac­tu­ally trip up spon­sors of AI pro­jects or poli­ti­cally block progress on al­ign­ment.


  • Missing the weird alternative

    Peo­ple might sys­tem­at­i­cally over­look “make tiny molec­u­lar smiley­faces” as a way of “pro­duc­ing smiles”, be­cause our brains au­to­mat­i­cally search for high-util­ity-to-us ways of “pro­duc­ing smiles”.