Missing the weird alternative

The “Un­fore­seen max­i­mum” prob­lem is alleged to be a fore­see­able difficulty of com­ing up with a good goal for an AGI (part of the al­ign­ment prob­lem for ad­vanced agents). Roughly, an “un­fore­seen max­i­mum” hap­pens when some­body thinks that “pro­duce smiles” would be a great goal for an AGI, be­cause you can pro­duce lots of smiles by mak­ing peo­ple happy, and mak­ing peo­ple happy is good. How­ever, while it’s true that mak­ing peo­ple happy by or­di­nary means will pro­duce some smiles, what will pro­duce even more smiles is ad­minis­ter­ing reg­u­lar doses of heroin or turn­ing all mat­ter within reach into tiny molec­u­lar smiley­faces.

“Miss­ing the weird al­ter­na­tive” is an at­tempt to psy­chol­o­gize about why peo­ple talk­ing about AGI util­ity func­tions might make this kind of over­sight sys­tem­at­i­cally. To avoid Bul­verism, if you’re not yet con­vinced that miss­ing a weird al­ter­na­tive would be a dan­ger­ous over­sight, please read Un­fore­seen max­i­mum first or in­stead.

In what fol­lows we’ll use \(U\) to de­note a pro­posed util­ity func­tion for an AGI, \(V\) to de­note our own nor­ma­tive val­ues, \(\pi_1\) to de­note the high-$V$ policy that some­body thinks is the at­tain­able max­i­mum of \(U,\) and \(\pi_0\) to de­note what some­body else sug­gests is a higher-$U$ lower-$V$ al­ter­na­tive.

Alleged his­tor­i­cal cases

Some his­tor­i­cal in­stances of AGI goal sys­tems pro­posed in a pub­li­ca­tion or con­fer­ence pre­sen­ta­tion, that have been ar­gued to be “miss­ing the weird al­ter­na­tive” are:

  • “Just pro­gram AIs to max­i­mize their gains in com­pres­sion of sen­sory data.” Pro­posed by Juer­gen Sch­mid­hu­ber, di­rec­tor of IDSIA, in a pre­sen­ta­tion at the 2009 Sin­gu­lar­ity Sum­mit; see the en­try on Un­fore­seen max­i­mum.

  • Claimed by Sch­mid­hu­ber to mo­ti­vate art and sci­ence.

  • Yud­kowsky sug­gested that this would, e.g., mo­ti­vate the AI to con­struct ob­jects that en­crypted streams of 1s or 0s, then re­vealed the en­cryp­tion key to the AI.

  • Pro­gram an AI by show­ing it pic­tures/​video of smil­ing faces to train (via su­per­vised learn­ing) which sen­sory events in­di­cate good out­comes. For­mally pro­posed twice, once by J. Storrs Hall in the book Beyond AI, once in an ACM pa­per by some­body who since ex­er­cised their sovereign right to change their mind.

  • Claimed to mo­ti­vate an AI to make peo­ple happy.

  • Suggested by Yud­kowsky to mo­ti­vate tiling the uni­verse with tiny molec­u­lar smiley­faces.

Many other in­stances of this alleged is­sue have allegedly been spot­ted in more in­for­mal di­cus­sion.

Psy­chol­o­gized rea­sons to miss a weird alternative

Psy­chol­o­giz­ing some pos­si­ble rea­sons why some peo­ple might sys­tem­at­i­cally “miss the weird al­ter­na­tive”, as­sum­ing that was ac­tu­ally hap­pen­ing:

Our brain doesn’t bother search­ing V-bad parts of policy space

Ar­guendo: The hu­man brain is built to im­plic­itly search for high-$V$ ways to ac­com­plish a goal. Or not ac­tu­ally high-$V$, but high-$W$ where \(W\) is what we in­tu­itively want, which has some­thing to do with \(V.\) “Tile the uni­verse with tiny smiley-faces” is low-$W$ so doesn’t get con­sid­ered.

Ar­guendo, your brain is built to search for poli­cies it prefers. If you were look­ing for a way to open a stuck jar, your brain wouldn’t gen­er­ate the op­tion of deto­nat­ing a stick of dy­na­mite, be­cause that would be a policy ranked very low in your prefer­ence-or­der­ing. So what’s the point of search­ing that part of the policy space?

This ar­gu­ment seems to prove too much in that it sug­gests that a chess player would be un­able to search for their op­po­nent’s most preferred moves, if hu­man brains could only search for poli­cies that were high in­side their own prefer­ence or­der­ing. But there could be an ex­plicit per­spec­tive-tak­ing op­er­a­tion re­quired, and some­body mod­el­ing an AI they had warm feel­ings about might fail to fully take the AI’s per­spec­tive; that is, they fail to carry out an ex­plicit cog­ni­tive step needed to switch off the “only \(W\)-good poli­cies” filter.

We might also have a limited na­tive abil­ity to take per­spec­tives on goals not our own. I.e., with­out fur­ther train­ing, our brain can read­ily imag­ine that a chess op­po­nent wants us to lose, or imag­ine that an AI wants to kill us be­cause it hates us, and con­sider “rea­son­able” policy op­tions along those lines. But this ex­panded policy search still fails to con­sider poli­cies on the lines of “turn ev­ery­thing into tiny smiley­faces” when ask­ing for ways to pro­duce smiles, be­cause no­body in the an­ces­tral en­vi­ron­ment would have wanted that op­tion and so our brain has a hard time na­tively mod­el­ing it.

Our brain doesn’t au­to­mat­i­cally search weird parts of policy space

Ar­guendo: The hu­man brain doesn’t search “weird” (gen­er­al­iza­tion-vi­o­lat­ing) parts of the policy space with­out an ex­plicit effort.

The po­ten­tial is­sue here is that “tile the galaxy with tiny smiley­faces” or “build en­vi­ron­men­tal ob­jects that en­crypt streams of 1s or 0s, then re­veal se­crets” would be weird in the sense of vi­o­lat­ing gen­er­al­iza­tions that usu­ally hold about poli­cies or con­se­quences in hu­man ex­pe­rience. Not gen­er­al­iza­tions like, “no­body wants smiles smaller than an inch”, but rather, “most prob­lems are not solved with tiny molec­u­lar things”.

Edge in­stan­ti­a­tion would tend to push the max­i­mum (at­tain­able op­ti­mum) of \(U\) in “weird” or “ex­treme” di­rec­tions—e.g., the most smiles can be ob­tained by mak­ing them very small, if this vari­able is not oth­er­wise con­strained. So the un­fore­seen max­ima might tend to vi­o­late im­plicit gen­er­al­iza­tions that usu­ally gov­ern most goals or poli­cies and that our brains take for granted. Aka, the un­fore­seen max­i­mum isn’t con­sid­ered/​gen­er­ated by the policy search, be­cause it’s weird.

Con­flat­ing the helpful with the optimal

Ar­guendo: Some­one might sim­ply get as far as “$\pi_1$ in­creases \(U\)” and then stop there and con­clude that a \(U\)-agent does \(\pi_1.\)

That is, they might just not re­al­ize that the ar­gu­ment “an ad­vanced agent op­ti­miz­ing \(U\) will ex­e­cute policy \(\pi_1\)” re­quires “$\pi_1$ is the best way to op­ti­mize \(U\)” and not just “ce­teris paribus, do­ing \(\pi_1\) is bet­ter for \(U\) than do­ing noth­ing”. So they don’t re­al­ize that es­tab­lish­ing “a \(U\)-agent does \(\pi_1\)” re­quires es­tab­lish­ing that no other \(\pi_k\) pro­duces higher ex­pected \(U\). So they just never search for a \(\pi_k\) like that.

They might also be im­plic­itly mod­el­ing \(U\)-agents as only weakly op­ti­miz­ing \(U\), and hence not see­ing a \(U\)-agent as fac­ing trade­offs or op­por­tu­nity costs; that is, they im­plic­itly model a \(U\)-agent as hav­ing no de­sire to pro­duce any more \(U\) than \(\pi_1\) pro­duces. Again psy­chol­o­giz­ing, it does some­times seem like peo­ple try to men­tally model a \(U\)-agent as “an agent that sorta wants to pro­duce some \(U\) as a hobby, so long as noth­ing more im­por­tant comes along” rather “an agent whose ac­tion-se­lec­tion crite­rion en­tirely con­sists of do­ing what­ever ac­tion is ex­pected to lead to the high­est \(U\)”.

This would well-re­flect the alleged ob­ser­va­tion that peo­ple allegedly “over­look­ing the weird al­ter­na­tive” seem more like they failed to search at all, than like they con­ducted a search but couldn’t think of any­thing.

Poli­ti­cal per­sua­sion in­stincts on con­ve­nient in­stru­men­tal strategies

If the above hy­po­thet­i­cal was true—that peo­ple just hadn’t thought of the pos­si­bil­ity of higher-$U$ \(\pi_k\) ex­ist­ing—then we’d ex­pect them to quickly change their minds upon this be­ing pointed out. Ac­tu­ally, it’s been em­piri­cally ob­served that there seems to be a lot more re­sis­tance than this.

One pos­si­ble force that could pro­duce re­sis­tance to the ob­ser­va­tion “$\pi_0$ pro­duces more \(U\)”—over and above the null hy­poth­e­sis of or­di­nary push­back in ar­gu­ment, ad­mit­tedly some­times a very pow­er­ful force on its own—might be a brain run­ning in a mode of “per­suade an­other agent to ex­e­cute a strat­egy \(\pi\) which is con­ve­nient to me, by ar­gu­ing to the agent that \(\pi\) best serves the agent’s own goals”. E.g. if you want to per­suade your boss to give you a raise, one would be wise to ar­gue “you should give me a raise be­cause it will make this pro­ject more effi­cient” rather than “you should give me a raise be­cause I like money”. By the gen­eral schema of the poli­ti­cal brain, we’d be very likely to have built-in sup­port for search­ing for ar­gu­ments that policy \(\pi\) that we just hap­pen to like, is a great way to achieve some­body else’s goal \(U.\)

Then on the same schema, a com­pet­ing policy \(\pi_0\) which is bet­ter at achiev­ing the other agent’s \(U\), but less con­ve­nient for us than \(\pi_1\), is an “en­emy sol­dier” in the poli­ti­cal de­bate. We’ll au­to­mat­i­cally search for rea­sons why \(\pi_0\) is ac­tu­ally re­ally bad for \(U\) and \(\pi_1\) is ac­tu­ally re­ally good, and feel an in­stinc­tive dis­like of \(\pi_0.\) By the stan­dard schema on the self-de­cep­tive brain, we’d prob­a­bly con­vince our­selves that \(\pi_0\) is re­ally bad for \(U\) and \(\pi_1\) is re­ally best for \(U.\) It would not be ad­van­ta­geous to our per­sua­sion to go around notic­ing our­selves all the rea­sons that \(\pi_0\) is good for \(U.\) And we definitely wouldn’t start spon­ta­neously search­ing for \(\pi_k\) that are \(U\)-bet­ter than \(\pi_1,\) once we’d already found some \(\pi_1\) that was very con­ve­nient to us.

(For a gen­eral post on the “fear of third al­ter­na­tives”, see here. This es­say also sug­gests that a good test for whether you might be suffer­ing from “fear of third al­ter­na­tives” is to ask your­self whether you in­stinc­tively dis­like or au­to­mat­i­cally feel skep­ti­cal of any pro­posed other op­tions for achiev­ing the stated crite­rion.)

The ap­ple pie problem

Some­times peo­ple pro­pose that the only util­ity func­tion an AGI needs is \(U\), where \(U\) is some­thing very good, like democ­racy or free­dom or ap­ple pie.

In this case, per­haps it sounds like a good thing to say about \(U\) that it is the only util­ity func­tion an AGI needs; and re­fus­ing to agree with this is not prais­ing \(U\) as highly as pos­si­ble, hence an en­emy sol­dier against \(U.\)

Or: The speaker may not re­al­ize that “$U$ is re­ally quite amaz­ingly fan­tas­ti­cally good” is not the same propo­si­tion as “an agent that max­i­mizes \(U\) and noth­ing else is benefi­cial”, so they treat con­tra­dic­tions of the sec­ond state­ment as though they con­tra­dicted the first.

Or: Point­ing out that \(\pi_0\) is high-$U$ but low-$V$ may sound like an ar­gu­ment against \(U,\) rather than an ob­ser­va­tion that ap­ple pie is not the only good. “A uni­verse filled with noth­ing but ap­ple pie has low value” is not the same state­ment as “ap­ple pie is bad and should not be in our util­ity func­tion”.

If the “ap­ple pie prob­lem” is real, it seems likely to im­plic­itly rely on or in­ter­act with some of the other alleged prob­lems. For ex­am­ple, some­one may not re­al­ize that their own com­plex val­ues \(W\) con­tain a num­ber of im­plicit filters \(F_1, F_2\) which act to filter out \(V\)-bad ways of achiev­ing \(U,\) be­cause they them­selves are im­plic­itly search­ing only for high-$W$ ways of achiev­ing \(U.\)


  • Unforeseen maximum

    When you tell AI to pro­duce world peace and it kills ev­ery­one. (Okay, some SF writ­ers saw that one com­ing.)