Patch resistance

A pro­posed fore­see­able difficulty of al­ign­ing ad­vanced agents is fur­ther­more pro­posed to be “patch-re­sis­tant” if the speaker thinks that most sim­ple or naive solu­tions will fail to re­solve the difficulty and just re­gen­er­ate it some­where else.

To call a prob­lem “patch-re­sis­tant” is not to as­sert that it is un­solv­able, but it does mean the speaker is cau­tion­ing against naive or sim­ple solu­tions.

On most oc­ca­sions so far, alleged cases of patch-re­sis­tance are said to stem from one of two cen­tral sources:

In­stru­men­tal-con­ver­gence patch-resistance

Ex­am­ple: Sup­pose you want your AI to have a shut­down but­ton:

  • You first try to achieve this by writ­ing a shut­down func­tion into the AI’s code.

  • After the AI be­comes self-mod­ify­ing, it deletes the code be­cause it is (con­ver­gently) the case that the AI can ac­com­plish its goals bet­ter by not be­ing shut down.

  • You add a patch to the util­ity func­tion giv­ing the AI minus a mil­lion points if the AI deletes the shut­down func­tion or pre­vents it from op­er­at­ing.

  • The AI re­sponds by writ­ing a new func­tion that re­boots the AI af­ter the shut­down com­pletes, thus tech­ni­cally not pre­vent­ing the shut­down.

  • You re­spond by again patch­ing the AI’s util­ity func­tion to give the AI minus a mil­lion points if it con­tinues op­er­at­ing af­ter the shut­down.

  • The AI builds an en­vi­ron­men­tal sub­agent that will ac­com­plish the AI’s goals while the AI it­self is tech­ni­cally “shut down”.

This is the first sort of patch re­sis­tance, the sort alleged to arise from at­tempts to defeat an in­stru­men­tal con­ver­gence with sim­ple patches meant to get rid of one ob­served kind of bad be­hav­ior. After one course of ac­tion is blocked by a spe­cific ob­sta­cle, the next-best course of ac­tion re­main­ing is li­able to be highly similar to the one that was just blocked.

Com­plex­ity-of-value patch-resistance


  • You want your AI to ac­com­plish good in the world, which is presently highly cor­re­lated with mak­ing peo­ple happy. Hap­piness is presently highly cor­re­lated with smil­ing. You build an AI that tries to achieve more smil­ing.

  • After the AI pro­poses to force peo­ple to smile by at­tach­ing metal pins to their lips, you re­al­ize that this cur­rent em­piri­cal as­so­ci­a­tion of smil­ing and hap­piness doesn’t mean that max­i­mum smil­ing must oc­cur in the pres­ence of max­i­mum hap­piness.

  • Although it’s much more com­pli­cated to in­fer, you try to re­con­figure the AI’s util­ity func­tion to be about a cer­tain class of brain states that has pre­vi­ously in prac­tice pro­duced smiles.

  • The AI suc­cess­fully gen­er­al­izes the con­cept of plea­sure, and be­gins propos­ing poli­cies to give peo­ple heroin.

  • You try to add a patch ex­clud­ing ar­tifi­cial drugs.

  • The AI pro­poses a ge­netic mod­ifi­ca­tion pro­duc­ing high lev­els of en­doge­nous opi­ates.

  • You try to ex­plain that what’s re­ally im­por­tant is not forc­ing the brain to ex­pe­rience plea­sure, but rather, peo­ple ex­pe­rienc­ing events that nat­u­rally cause hap­piness.

  • The AI pro­poses to put ev­ery­one in the Ma­trix…

Since the pro­gram­mer-in­tended con­cept is ac­tu­ally highly com­pli­cated, sim­ple con­cepts will sys­tem­at­i­cally fail to have their op­ti­mum at the same point as the com­plex in­tended con­cept. By the frag­ility of value, the op­ti­mum of the sim­ple con­cept will al­most cer­tainly not be a high point of the com­plex in­tended con­cept. Since most con­cepts are not sur­pris­ingly com­press­ible, there prob­a­bly isn’t any sim­ple con­cept whose max­i­mum iden­ti­fies that frag­ile peak of value. This ex­plains why we would rea­son­ably ex­pect prob­lems of per­verse in­stan­ti­a­tion to pop up over and over again, the op­ti­mum of the re­vised con­cept mov­ing to a new weird ex­treme each time the pro­gram­mer tries to ham­mer down the next weird al­ter­na­tive the AI comes up with.

In other words: There’s a large amount of al­gorith­mic in­for­ma­tion or many in­de­pen­dent re­flec­tively con­sis­tent de­grees of free­dom in the cor­rect an­swer, the plans we want the AI to come up with, but we’ve only given the AI rel­a­tively sim­ple con­cepts that can’t iden­tify those plans.

Analogues in the his­tory of AI

The re­sult of try­ing to tackle overly gen­eral prob­lems us­ing AI al­gorithms too nar­row for those gen­eral prob­lems, usu­ally ap­pears in the form of an in­finite num­ber of spe­cial cases with a new spe­cial case need­ing to be han­dled for ev­ery prob­lem in­stance. In the case of nar­row AI al­gorithms tack­ling a gen­eral prob­lem, this hap­pens be­cause the nar­row al­gorithm, be­ing nar­row, is not ca­pa­ble of cap­tur­ing the deep struc­ture of the gen­eral prob­lem and its solu­tion.

Sup­pose that bur­glars, and also earth­quakes, can cause bur­glar alarms to go off. To­day we can rep­re­sent this kind of sce­nario us­ing a Bayesian net­work or causal model which will com­pactly yield prob­a­bil­is­tic in­fer­ences along the lines of, “If the bur­glar alarm goes off, that prob­a­bly in­di­cates there’s a bur­glar, un­less you learn there was an earth­quake, in which case there’s prob­a­bly not a bur­glar” and “If there’s an earth­quake, the bur­glar alarm prob­a­bly goes off.”

Dur­ing the era where ev­ery­thing in AI was be­ing rep­re­sented by first-or­der logic and no­body knew about causal mod­els, peo­ple de­vised in­creas­ingly in­tri­cate “non­mono­tonic log­ics” to try to rep­re­sent in­fer­ence rules like (si­mul­ta­neously) \(alarm \rightarrow burglar, \ earthquake \rightarrow alarm,\) and \((alarm \wedge earthquake) \rightarrow \neg burglar.\) But first-or­der logic wasn’t nat­u­rally a good sur­face fit to the set of in­fer­ences needed, and the AI pro­gram­mers didn’t know how to com­pactly cap­ture the struc­ture that causal mod­els cap­ture. So the “non­mono­tonic logic” ap­proach pro­lifer­ated an end­less night­mare of spe­cial cases.

Cog­ni­tive prob­lems like “mod­el­ing causal phe­nom­ena” or “be­ing good at math” (aka un­der­stand­ing which math­e­mat­i­cal premises im­ply which math­e­mat­i­cal con­clu­sions) might be gen­eral enough to defeat mod­ern nar­row-AI al­gorithms. But these do­mains still seem like they should have some­thing like a cen­tral core, lead­ing us to ex­pect cor­re­lated cov­er­age of the do­main in suffi­ciently ad­vanced agents. You can’t con­clude that be­cause a sys­tem is very good at solv­ing ar­ith­metic prob­lems, it will be good at prov­ing Fer­mat’s Last The­o­rem. But if a sys­tem is smart enough to in­de­pen­dently prove Fer­mat’s Last The­o­rem and the Poin­care Con­jec­ture and the in­de­pen­dence of the Ax­iom of Choice in Zer­melo-Frankel set the­ory, it can prob­a­bly also—with­out fur­ther hand­hold­ing—figure out Godel’s The­o­rem. You don’t need to go on pro­gram­ming in one spe­cial case af­ter an­other of math­e­mat­i­cal com­pe­tency. The fact that hu­mans could figure out all these differ­ent ar­eas, with­out need­ing to be in­de­pen­dently re­pro­grammed for each one by nat­u­ral se­lec­tion, says that there’s some­thing like a cen­tral ten­dency un­der­ly­ing com­pe­tency in all these ar­eas.

In the case of com­plex­ity of value, the the­sis is that there are many in­de­pen­dent re­flec­tively con­sis­tent de­grees of free­dom in our in­tended speci­fi­ca­tion of what’s good, bad, or best. Get­ting one de­gree of free­dom al­igned with our in­tended re­sult doesn’t mean that other de­grees of free­dom need to al­ign with our in­tended re­sult. So try­ing to “patch” the first sim­ple speci­fi­ca­tion that doesn’t work, is likely to re­sult in a differ­ent speci­fi­ca­tion that doesn’t work.

When we try to use a nar­row AI al­gorithm to at­tack a prob­lem which has a cen­tral ten­dency re­quiring gen­eral in­tel­li­gence to cap­ture, or at any rate re­quiring some new struc­ture that the nar­row AI al­gorithm can’t han­dle, we’re effec­tively ask­ing the nar­row AI al­gorithm to learn some­thing that has no sim­ple struc­ture rel­a­tive to that al­gorithm. This is why early AI re­searchers’ ex­pe­rience with “lack of com­mon sense” that you can’t patch with spe­cial cases may be fore­see­ably in­dica­tive of how frus­trat­ing it would be, in prac­tice, to re­peat­edly try to “patch” a kind of difficulty that we may fore­see­ably need to con­front in al­ign­ing AI.

That is: When­ever it feels to a hu­man like you want to yell at the AI for its lack of “com­mon sense”, you’re prob­a­bly look­ing at a do­main where try­ing to patch that par­tic­u­lar AI an­swer is just go­ing to lead into an­other an­swer that lacks “com­mon sense”. Pre­vi­ously in AI his­tory, this hap­pened be­cause real-world prob­lems had no sim­ple cen­tral learn­able solu­tion rel­a­tive to the nar­row AI al­gorithm. In value al­ign­ment, some­thing similar could hap­pen be­cause of the com­plex­ity of our value func­tion, whose eval­u­a­tions also feel to a hu­man like “com­mon sense”.

Rele­vance to al­ign­ment theory

Patch re­sis­tance, and its sister is­sue of lack of cor­re­lated cov­er­age, is a cen­tral rea­son why al­ign­ing ad­vanced agents could be way harder, way more dan­ger­ous, and way more likely to ac­tu­ally kill ev­ery­one in prac­tice, com­pared to op­ti­mistic sce­nar­ios. It’s a pri­mary rea­son to worry, “Uh, what if al­ign­ing AI is ac­tu­ally way harder than it might look to some peo­ple, the way that build­ing AGI in the first place turned out not to be some­thing you could do in two months over the sum­mer?”

It’s also a rea­son to worry about con­text dis­asters re­volv­ing around ca­pa­bil­ity gains: Any­thing you had to patch-un­til-it-worked at AI ca­pa­bil­ity level \(k\) is prob­a­bly go­ing to break hard at ca­pa­bil­ity \(l \gg k.\) This is dou­bly catas­trophic in prac­tice if the pres­sures to “just get the thing run­ning to­day” are im­mense.

To the ex­tent that we can see the cen­tral pro­ject of AI al­ign­ment as re­volv­ing around find­ing a set of al­ign­ment ideas that do have sim­ple cen­tral ten­den­cies and are speci­fi­able or learn­able which to­gether add up to a safe but pow­er­ful AI—that is, find­ing do­mains with cor­re­lated cov­er­age that add up to a safe AI that can do some­thing pivotal—we could see the cen­tral pro­ject of AI al­ign­ment as find­ing a col­lec­tively good-enough set of safety-things we can do with­out end­less patch­ing.


  • Unforeseen maximum

    When you tell AI to pro­duce world peace and it kills ev­ery­one. (Okay, some SF writ­ers saw that one com­ing.)


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.