Sufficiently optimized agents appear coherent


Sum­mary: Vio­la­tions of co­her­ence con­straints in prob­a­bil­ity the­ory and de­ci­sion the­ory cor­re­spond to qual­i­ta­tively de­struc­tive or dom­i­nated be­hav­iors. Co­her­ence vi­o­la­tions so eas­ily com­puted as to be hu­manly pre­dictable should be elimi­nated by op­ti­miza­tion strong enough and gen­eral enough to re­li­ably elimi­nate be­hav­iors that are qual­i­ta­tively dom­i­nated by cheaply com­putable al­ter­na­tives. From our per­spec­tive this should pro­duce agents such that, ce­teris paribus, we do not think we can pre­dict, in ad­vance, any co­her­ence vi­o­la­tion in their be­hav­ior.

Co­her­ence vi­o­la­tions cor­re­spond to qual­i­ta­tively de­struc­tive behaviors

There is a cor­re­spon­dence be­tween, on the one hand, thought pro­cesses that seem to vi­o­late in­tu­itively ap­peal­ing co­her­ence con­straints from the Bayesian fam­ily, and on the other hand, se­quences of overt be­hav­iors that leave the agent qual­i­ta­tively worse off than be­fore or that seem in­tu­itively dom­i­nated by other be­hav­iors.

For ex­am­ple, sup­pose you claim that you pre­fer A to B, B to C, and C to A. This ‘cir­cu­lar prefer­ence’ (A > B > C > A) seems in­tu­itively un­ap­peal­ing; we can also see how to vi­su­al­ize it as an agent with a qual­i­ta­tively self-de­struc­tive be­hav­ior as fol­lows:

  • You pre­fer to be in San Fran­cisco rather than Berkeley, and if you are in Berkeley you will pay $50 for a taxi ride to San Fran­cisco.

  • You pre­fer San Jose to San Fran­cisco and if in San Fran­cisco will pay $50 to go to San Jose. (Still no prob­lem so far.)

  • You like Berkeley more than San Jose and if in San Jose will pay $50 to go to Berkeley.

The cor­re­spond­ing agent will spend $150 on taxi rides and then end up in the same po­si­tion, per­haps ready to spend even more money on taxi rides. The agent is strictly, qual­i­ta­tively worse off than be­fore. We can see this, in some sense, even though the agent’s prefer­ences are par­tially in­co­her­ent. As­sum­ing the agent has a co­her­ent prefer­ence for money or some­thing that can be bought with money, alongside its in­co­her­ent prefer­ence for lo­ca­tion, then the cir­cu­lar trip has left it strictly worse off (since in the end the lo­ca­tion was un­changed). The cir­cu­lar trip is still dom­i­nated by the op­tion of stay­ing in the same place.

(The above is a var­i­ant of an ar­gu­ment first pre­sented by Steve Omo­hun­dro.)

(Phenom­ena like this, known as ‘prefer­ence re­ver­sals’, are a com­mon em­piri­cal find­ing in be­hav­ioral psy­chol­ogy. Since a hu­man mind is an ever-chang­ing bal­ance of drives and de­sires that can be height­ened or weak­ened by changes of en­vi­ron­men­tal con­text, elic­it­ing in­con­sis­tent sets of prefer­ences from hu­mans isn’t hard and can con­sis­tently be done in the lab­o­ra­tory in eco­nomics ex­per­i­ments, es­pe­cially if the cir­cu­lar­ity is buried among other ques­tions or dis­trac­tors.)

As an­other illus­tra­tion, con­sider the Allais para­dox. As a sim­plified ex­am­ple, con­sider offer­ing sub­jects a choice be­tween hy­po­thet­i­cal Gam­ble 1A, a cer­tainty of re­ceiv­ing $1 mil­lion if a die comes up any­where from 00-99, and Gam­ble 1B, a 10% chance of re­ceiv­ing noth­ing (if the die comes up 00-09) and a 90% chance of re­ceiv­ing $5 mil­lion (if the die comes up 10-99). Most sub­jects choose Gam­ble 1A. So far, we have a sce­nario that could be con­sis­tent with a co­her­ent util­ity func­tion in which the in­ter­val of de­sir­a­bil­ity from re­ceiv­ing $0 to re­ceiv­ing $1 mil­lion is more than nine times the in­ter­val from re­ceiv­ing $1 mil­lion to re­ceiv­ing $5 mil­lion.

How­ever, sup­pose only half the sub­jects are ran­domly as­signed to this con­di­tion, and the other half are asked to choose be­tween Gam­ble 2A, a 90% chance of re­ceiv­ing noth­ing (00-89) and a 10% chance of re­ceiv­ing $1 mil­lion (90-99), ver­sus Gam­ble 2B, a 91% chance of re­ceiv­ing noth­ing (00-90) and a 9% chance of re­ceiv­ing $5 mil­lion (91-99). Most sub­jects in this case will pick Gam­ble 2B. This com­bi­na­tion of re­sults guaran­tees that at least some sub­jects must be­have in a way that doesn’t cor­re­spond to any con­sis­tent util­ity func­tion over out­comes.

The Allais Para­dox (in a slightly differ­ent for­mu­la­tion) was ini­tially cel­e­brated as show­ing that hu­mans don’t obey the ex­pected util­ity ax­ioms, and it was thought that maybe the ex­pected util­ity ax­ioms were ‘wrong’ in some sense. How­ever, in ac­cor­dance with the stan­dard fam­i­lies of co­her­ence the­o­rems, we can crank the co­her­ence vi­o­la­tion to ex­hibit a qual­i­ta­tively dom­i­nated be­hav­ior:

Sup­pose you show me a switch, set to “A”, that de­ter­mines whether I will get Gam­ble 2A or Gam­ble 2B. You offer me a chance to pay you one penny to throw the switch from A to B, so I do so (I now have a 91% chance of noth­ing, and a 9% chance of $5 mil­lion). Then you roll one of two ten-sided dice to de­ter­mine the per­centile re­sult, and the first die, the tens digit, comes up “9″. Be­fore rol­ling the sec­ond die, you offer to throw the switch back from B to A in ex­change for an­other penny. Since the re­sult of the first die trans­forms the ex­per­i­ment into Gam­ble 1A vs. 1B, I take your offer. You now have my two cents on the sub­ject. (If the re­sult of the first die is any­thing but 9, I am in­differ­ent to the set­ting of the switch since I re­ceive $0 ei­ther way.)

Again, we see a man­i­fes­ta­tion of a pow­er­ful fam­ily of the­o­rems show­ing that agents which can­not be seen as cor­re­spond­ing to any co­her­ent prob­a­bil­ities and con­sis­tent util­ity func­tion will ex­hibit qual­i­ta­tively de­struc­tive be­hav­ior, like pay­ing some­one a cent to throw a switch and then pay­ing them an­other cent to throw it back.

There is a large liter­a­ture on differ­ent sets of co­her­ence con­straints that all yield ex­pected util­ity, start­ing with the Von Neu­mann-Mor­gen­stern The­o­rem. No other de­ci­sion for­mal­ism has com­pa­rable sup­port from so many fam­i­lies of differ­ently phrased co­her­ence con­straints.

There is similarly a large liter­a­ture on many classes of co­her­ence ar­gu­ments that yield clas­si­cal prob­a­bil­ity the­ory, such as the Dutch Book the­o­rems. There is no sub­stan­tively differ­ent ri­val to prob­a­bil­ity the­ory and de­ci­sion the­ory which is com­pet­i­tive when it comes to (a) plau­si­bly hav­ing some bounded analogue which could ap­pear to de­scribe the un­cer­tainty of a pow­er­ful cog­ni­tive agent, and (b) seem­ing highly mo­ti­vated by co­her­ence con­straints, that is, be­ing forced by the ab­sence of qual­i­ta­tively harm­ful be­hav­iors that cor­re­spond to co­her­ence vi­o­la­tions.

Generic op­ti­miza­tion pres­sures, if suffi­ciently strong and gen­eral, should be ex­pected to elimi­nate be­hav­iors that are dom­i­nated by clearly visi­ble al­ter­na­tives.

Even an in­co­her­ent col­lec­tion of shift­ing drives and de­sires may well rec­og­nize, af­ter hav­ing paid their two cents or $150, that they are wast­ing money, and try to do things differ­ently (self-mod­ify). An AI’s pro­gram­mers may rec­og­nize that, from their own per­spec­tive, they would rather not have their AI spend­ing money on cir­cu­lar taxi rides. This im­plies a path from in­co­her­ent non-ad­vanced agents to co­her­ent ad­vanced agents as more and more op­ti­miza­tion power is ap­plied to them.

A suffi­ciently ad­vanced agent would pre­sum­ably catch on to the ex­is­tence of co­her­ence the­o­rems and see the ab­stract pat­tern of the prob­lems (as hu­mans already have). But it is not nec­es­sary to sup­pose that these qual­i­ta­tively de­struc­tive be­hav­iors are be­ing tar­geted be­cause they are ‘ir­ra­tional’. It suffices for the in­co­heren­cies to be tar­geted as ‘prob­lems’ be­cause par­tic­u­lar cases of them are rec­og­nized as hav­ing pro­duced clear, qual­i­ta­tive losses.

Without know­ing in ad­vance the ex­act speci­fics of the op­ti­miza­tion pres­sures be­ing ap­plied, it seems that, in ad­vance and ce­teris paribus, we should ex­pect that pay­ing a cent to throw a switch and then pay­ing again to switch it back, or throw­ing away $150 on cir­cu­lar taxi rides, are qual­i­ta­tively de­struc­tive be­hav­iors that op­ti­miza­tion would tend to elimi­nate. E.g. one ex­pects a con­se­quen­tial­ist goal-seek­ing agent would pre­fer, or a policy re­in­force­ment learner would be re­in­forced, or a fit­ness crite­rion would eval­u­ate greater fit­ness, etcetera, for elimi­nat­ing the be­hav­ior that cor­re­sponds to in­co­her­ence, ce­teris paribus and given the op­tion of elimi­nat­ing it at a rea­son­able com­pu­ta­tional cost.

If there is a par­tic­u­lar kind of op­ti­miza­tion pres­sure that seems suffi­cient to pro­duce a cog­ni­tively highly ad­vanced agent, but which also seems sure to over­look some par­tic­u­lar form of in­co­her­ence, then this would pre­sent a loop­hole in the over­all ar­gu­ment and yield a route by which an ad­vanced agent with that par­tic­u­lar in­co­her­ence might be pro­duced (al­though the agent’s in­ter­nal op­ti­miza­tion must also be pre­dicted to tol­er­ate the same in­co­her­ence, as oth­er­wise the agent will self-mod­ify away from it).

Elimi­nat­ing be­hav­iors that are dom­i­nated by cheaply com­putable al­ter­na­tive be­hav­iors will pro­duce cog­ni­tion that looks Bayesian-co­her­ent from our per­spec­tive.

Perfect epistemic and in­stru­men­tal co­her­ence is too com­pu­ta­tion­ally ex­pen­sive for bounded agents to achieve. Con­sider e.g. the con­junc­tion rule of prob­a­bil­ity that P(A&B) ⇐ P(A). If A is a the­o­rem, and B is a lemma very helpful in prov­ing A, then ask­ing the agent for the prob­a­bil­ity of A alone may elicit a lower an­swer than ask­ing the agent about the joint prob­a­bil­ity of A&B (since think­ing of B as a lemma in­creases the sub­jec­tive prob­a­bil­ity of A). This is not a full-blown form of con­junc­tion fal­lacy since there is no par­tic­u­lar time at which the agent ex­plic­itly as­signs lower prob­a­bil­ity to P(A&B %% A&~B) than to P(A&B). But even for an ad­vanced agent, if a hu­man was watch­ing the se­ries of prob­a­bil­ity as­sign­ments, the hu­man might be able to say some equiv­a­lent of, “Aha, even though the agent was ex­posed to no new out­side ev­i­dence, it as­signed prob­a­bil­ity X to P(A) at time t, and then as­signed prob­a­bil­ity Y>X to P(A&B) at time t+2.”

Two no­tions of “suffi­ciently op­ti­mized agents will ap­pear co­her­ent (to hu­mans)” that might be sal­vaged from the above ob­jec­tion are as fol­lows:

  • There will be some bounded no­tion of Bayesian ra­tio­nal­ity that in­cor­po­rates e.g. a the­ory of Log­i­calUncer­tainty which agents will ap­pear from a hu­man per­spec­tive to strictly obey. All de­par­tures from this bounded co­her­ence that hu­mans can un­der­stand us­ing their own com­put­ing power will have been elimi­nated.

  • Op­ti­mizedAp­pearCo­her­ent: It will not be pos­si­ble for hu­mans to speci­fi­cally pre­dict in ad­vance any large co­her­ence vi­o­la­tion as e.g. the above in­tertem­po­ral con­junc­tion fal­lacy. Any­thing sim­ple enough and com­putable cheaply enough for hu­mans to pre­dict in ad­vance will also be com­pu­ta­tion­ally pos­si­ble for the agent to elimi­nate in ad­vance. Any pre­dictable co­her­ence vi­o­la­tion which is sig­nifi­cant enough to be hu­manly worth notic­ing, will also be dam­ag­ing enough to be worth elimi­nat­ing.

Although the first no­tion of sal­vage­able co­her­ence above seems to us quite plau­si­ble, it has a large gap with re­spect to what this bounded analogue of ra­tio­nal­ity might be. In­so­far as op­ti­mized agents ap­pear­ing co­her­ent has prac­ti­cal im­pli­ca­tions, these im­pli­ca­tions should prob­a­bly rest upon the sec­ond line of ar­gu­ment.

One pos­si­ble loop­hole of the sec­ond line of ar­gu­ment might be some pre­dictable class of in­co­her­ences which are not at all dam­ag­ing to the agent and hence not worth spend­ing even rel­a­tively tiny amounts of com­put­ing power to elimi­nate. If so, this would im­ply some pos­si­ble hu­manly pre­dictable in­co­her­ences of ad­vanced agents, but these in­co­her­ences would not be ex­ploitable to cause any fi­nal out­come that is less than max­i­mally preferred by the agent, in­clud­ing sce­nar­ios where the agent spends re­sources it would not oth­er­wise spend, etc.

A fi­nal im­plicit step is the as­sump­tion that when all hu­manly-visi­ble agent-dam­ag­ing co­her­ence vi­o­la­tions have been elimi­nated, the agent should look to us co­her­ent; or that if we can­not pre­dict spe­cific co­her­ence vi­o­la­tions in ad­vance, then we should rea­son about the agent as if it is co­her­ent. We don’t yet see a rele­vant case where this would fail, but any failure of this step could also pro­duce a loop­hole in the over­all ar­gu­ment.


Some pos­si­ble mind de­signs may evade the de­fault expectation

Since mind de­sign space is large, we should ex­pect with high prob­a­bil­ity that there are at least some ar­chi­tec­tures that evade the above ar­gu­ments and de­scribe highly op­ti­mized cog­ni­tive sys­tems, or re­flec­tively sta­ble sys­tems, that ap­pear to hu­mans to sys­tem­at­i­cally de­part from bounded Bayesi­anism.

There could be some su­pe­rior al­ter­na­tive to prob­a­bil­ity the­ory and de­ci­sion the­ory that is Bayesian-incoherent

When it comes to the ac­tual out­come for ad­vanced agents, the rele­vant fact is not whether there are cur­rently some even more ap­peal­ing al­ter­na­tives to prob­a­bil­ity the­ory or de­ci­sion the­ory, but whether these ex­ist in prin­ci­ple. The hu­man species has not been around long enough for us to be sure that this is not the case.

Re­mark one: To ad­vance-pre­dict spe­cific in­co­her­ence in an ad­vanced agent, (a) we’d need to know what the su­pe­rior al­ter­na­tive was and (b) it would need to lead to the equiv­a­lent of go­ing around in loops from San Fran­cisco to San Jose to Berkeley.

Re­mark two: If on some de­vel­op­ment method­ol­ogy it might prove catas­trophic for there to ex­ist some generic un­known su­pe­rior to prob­a­bil­ity the­ory or de­ci­sion the­ory, then we should per­haps be wor­ried on this score. Espe­cially since we can be rea­son­ably sure that an ad­vanced agent can­not ac­tu­ally use prob­a­bil­ity the­ory and de­ci­sion the­ory, and must use some bounded analogue if it uses any analogue at all.

A cog­ni­tively pow­er­ful agent might not be suffi­ciently optimized

Sce­nar­ios that negate Rele­vant pow­er­ful agents will be highly op­ti­mized, such as brute forc­ing non-re­cur­sive in­tel­li­gence, can po­ten­tially evade the ‘suffi­ciently op­ti­mized’ con­di­tion re­quired to yield pre­dicted co­her­ence. E.g., it might be pos­si­ble to cre­ate a cog­ni­tively pow­er­ful sys­tem by over­driv­ing some fixed set of al­gorithms, and then to pre­vent this sys­tem from op­ti­miz­ing it­self or cre­at­ing offspring agents in the en­vi­ron­ment. This could al­low the cre­ation of a cog­ni­tively pow­er­ful sys­tem that does not ap­pear to us as a bounded Bayesian. (If, for some rea­son, that was a good idea.)


If prob­a­bil­ity high: The pre­dic­tions we make to­day about be­hav­iors of generic ad­vanced agents should not de­pict them as be­ing visi­bly, speci­fi­cally in­co­her­ent from a prob­a­bil­ity-the­o­retic or de­ci­sion-the­o­retic per­spec­tive.

If prob­a­bil­ity not ex­tremely high: If it were some­how nec­es­sary or helpful for safety to cre­ate an in­co­her­ent agent ar­chi­tec­ture, this might be pos­si­ble, though difficult. The de­vel­op­ment method­ol­ogy would need to con­tend with both the op­ti­miza­tion pres­sures pro­duc­ing the agent, and the op­ti­miza­tion pres­sures that the agent it­self might ap­ply to it­self or to en­vi­ron­men­tal sub­agents. Suc­cess­ful in­tel­li­gence brute forc­ing sce­nar­ios in which a cog­ni­tively pow­er­ful agent is pro­duced by us­ing a great deal of com­put­ing power on known al­gorithms, and then the agent is some­how for­bid­den from self-mod­ify­ing or cre­at­ing other en­vi­ron­men­tal agents, might be able to yield pre­dictably in­co­her­ent agents.

If prob­a­bil­ity not ex­tremely high: The as­sump­tion that an ad­vanced agent will be­come Bayesian-co­her­ent should not be a load bear­ing premise of a safe de­vel­op­ment method­ol­ogy un­less there are fur­ther safe­guards or fal­lbacks. A safe de­vel­op­ment method­ol­ogy should not fail catas­troph­i­cally if there ex­ists a generic, un­known su­pe­rior to prob­a­bil­ity the­ory or de­ci­sion the­ory.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.