Context disaster

Short introduction

One fre­quently sug­gested strat­egy for al­ign­ing a suffi­ciently ad­vanced AI is to ob­serve—be­fore the AI be­comes pow­er­ful enough that ‘de­bug­ging’ the AI would be prob­le­matic if the AI de­cided not to let us de­bug it—whether the AI ap­pears to be act­ing nicely while it’s not yet smarter than the pro­gram­mers.

Early test­ing ob­vi­ously can’t provide a statis­ti­cal guaran­tee of the AI’s fu­ture be­hav­ior. If you ob­serve some ran­dom draws from Bar­rel A, at best you get statis­ti­cal guaran­tees about fu­ture draws from Bar­rel A un­der the as­sump­tion that the past and fu­ture draws are col­lec­tively in­de­pen­dent and iden­ti­cally dis­tributed.

On the other hand, if Bar­rel A is similar to Bar­rel B, ob­serv­ing draws from Bar­rel A can some­times tell us some­thing about Bar­rel B even if the two bar­rels are not i.i.d.

Con­versely, if ob­served good be­hav­ior while the AI is not yet su­per-smart, fails to cor­re­late to good out­comes af­ter the AI is un­leashed or be­comes smarter, then this is a “con­text change prob­lem” or “con­text dis­aster”. noteBet­ter ter­minol­ogy is still be­ing so­lic­ited here, if you have a short phrase that would evoke ex­actly the right mean­ing.

A key ques­tion then is how shocked we ought to be, on a scale from 1 to 10, if good out­comes in the AI’s ‘de­vel­op­ment’ phase fail to match up with good out­comes in the AI’s ‘op­ti­mize the real world’ phase? noteLeav­ing aside tech­ni­cal quib­bles about how we can’t feel shocked if we’re dead.

Peo­ple who ex­pect that AI al­ign­ment is difficult think that the de­gree of jus­tified sur­prise is some­where around 1 out of 10. In other words, that there are a lot of fore­see­able is­sues that could cause a seem­ingly nice weaker AI to not de­velop into a nice smarter AI.

An ex­tremely over­sim­plified (but con­crete) fable that illus­trates some of these pos­si­ble difficul­ties might go as fol­lows:

  • Some group or pro­ject has ac­quired a vi­able de­vel­op­ment path­way to AGI. The pro­gram­mers think it is wise to build an AI that will make peo­ple happy. noteThis is not quite a straw ar­gu­ment, in the sense that it’s been ad­vo­cated more than once by peo­ple who have ap­par­ently never read any sci­ence fic­tion in their lives; there are cer­tainly many AI re­searchers who would be smarter than to try this, but not nec­es­sar­ily all of them. In any case, we’re look­ing for an un­re­al­is­ti­cally sim­ple sce­nario for pur­poses of illus­trat­ing sim­ple forms of some key ideas; in real life, if analo­gous things go wrong, they would prob­a­bly be more com­pli­cated things.

  • The pro­gram­mers start by try­ing to train their AI to pro­duce smiles. noteA­gain, this is not quite a straw pos­si­bil­ity in the sense that it was ad­vo­cated in at least one pub­lished pa­per, not cited here be­cause the au­thor later ex­er­cised their sovereign right of chang­ing their mind about that. Ar­guably some cur­rently floated pro­pos­als are closely analo­gous to this one.

  • While the AI is young and has rel­a­tively few policy op­tions, it can only make peo­ple smile by perform­ing well on the tasks as­signed to it.

  • At some point, the AI gets smarter and and able to do more com­pli­cated things with some sim­ple robots it con­trols. In an un­for­tu­nate in­ci­dent, the AI learns it is pos­si­ble to pro­duce smiles by tick­ling peo­ple.

  • In the af­ter­math of this in­ci­dent, the pro­gram­mers man­age to re­pro­gram the AI to not op­ti­mize for this par­tic­u­lar kind of smile by this par­tic­u­lar route.

  • The AI be­comes smarter and able to model peo­ple bet­ter. After pe­rus­ing a copy of Wikipe­dia, the AI re­al­izes it can make peo­ple ex­tremely ‘happy’, as the AI’s util­ity func­tion cur­rently defines ‘hap­piness’, by ad­minis­ter­ing heroin.

  • The AI re­mem­bers the pre­vi­ous in­ci­dent in which the pro­gram­mers al­tered the AI’s util­ity func­tion. The AI can model its pro­gram­mers suffi­ciently well to know that the pro­gram­mers would not want the AI to give peo­ple heroin. How­ever, the most hap­piness the AI thinks its fu­ture self would be able to pro­duce with­out heroin, is not nearly as much hap­piness as the AI thinks it could pro­duce if its fu­ture self went on want­ing to give peo­ple heroin.

  • The AI re­frains from try­ing to ad­minister any heroin right now, and be­haves in a way it thinks will be max­i­mally re­as­sur­ing to its model of its pro­gram­mers, in or­der to be able to pro­duce more ‘hap­piness’ later.

  • Even­tu­ally, all reach­able galax­ies end up be­ing trans­formed into tiny molec­u­lar com­put­ers im­ple­ment­ing the sim­plest and cheap­est forms of what the AI defines as ‘hap­piness’. (And the sim­plest pos­si­ble con­figu­ra­tions match­ing the AI’s util­ity func­tion in this way are so sim­ple as to be de­void of sub­jec­tive ex­pe­rience; and hence, from our per­spec­tive, of nei­ther nega­tive nor pos­i­tive value.)

In all these cases, the prob­lem was not that the AI de­vel­oped in an un­sta­ble way. The same de­ci­sion sys­tem pro­duced a new prob­lem in the new con­text.

Cur­rently ar­gued fore­see­able “con­text change prob­lems” in this sense, can be di­vided into three broad classes:

  • More pos­si­bil­ities, more prob­lems: The AI’s prefer­ences have a good or in­tended achiev­able op­ti­mum while the AI is pick­ing from a nar­row space of op­tions. When the AI be­comes smarter or gains more ma­te­rial op­tions, it picks from a wider space of tractable poli­cies and achiev­able out­comes. Then the new op­ti­mum is not as nice, be­cause, for ex­am­ple:

  • The AI’s util­ity func­tion was tweaked by some learn­ing al­gorithm and data that even­tu­ally seemed to con­form be­hav­ior well over op­tions con­sid­ered early on, but not the wider space con­sid­ered later.

  • In de­vel­op­ment, ap­par­ently bad sys­tem be­hav­iors were patched in ways that ap­peared to work, but didn’t elimi­nate an un­der­ly­ing ten­dency, only blocked one ex­pres­sion of that ten­dency. Later a very similar pres­sure re-emerged in an un­blocked way when the AI con­sid­ered a wider policy space.

  • Good­hart’s Curse sug­gests that if our true in­tended val­ues V are be­ing mod­eled by a util­ity func­tion U, se­lect­ing for the high­est val­ues of U also se­lects for the high­est up­ward di­ver­gence of U from V, and this ver­sion of the “op­ti­mizer’s curse” phe­nomenon be­comes worse as U is eval­u­ated over a wider op­tion space.

  • Treach­er­ous turn: There’s a di­ver­gence be­tween the AI’s prefer­ences and the pro­gram­mers’ prefer­ences, and the AI re­al­izes this be­fore we do. The AI uses the con­ver­gent strat­egy of be­hav­ing the way it mod­els us as want­ing or ex­pect­ing, un­til the AI gains the in­tel­li­gence or ma­te­rial power to im­ple­ment its prefer­ences in spite of any­thing we can do.

  • Revving into the red: In­tense op­ti­miza­tion causes some as­pect or sub­sys­tem of the AI to tra­verse a weird new ex­e­cu­tion path in some way differ­ent from the above two is­sues. (In a way that in­volves a value-laden cat­e­gory bound­ary or mul­ti­ple self-con­sis­tent out­looks, such that we don’t get a good re­sult just as a free lunch of the AI’s gen­eral in­tel­li­gence.)

The con­text change prob­lem is a cen­tral is­sue of AI al­ign­ment and a key propo­si­tion in the gen­eral the­sis of al­ign­ment difficulty. If you could eas­ily, cor­rectly, and safely test for nice­ness by out­ward ob­ser­va­tion, and that form of nice­ness scaled re­li­ably from weaker AIs to smarter AIs, that would be a very cheer­ful out­look on the gen­eral difficulty of the prob­lem.

Tech­ni­cal introduction

John Dana­her sum­ma­rized as fol­lows what he con­sid­ered a force­ful “safety test ob­jec­tion” to AI catas­tro­phe sce­nar­ios:

Safety test ob­jec­tion: An AI could be em­piri­cally tested in a con­strained en­vi­ron­ment be­fore be­ing re­leased into the wild. Pro­vided this test­ing is done in a rigor­ous man­ner, it should en­sure that the AI is “friendly” to us, i.e. poses no ex­is­ten­tial risk.

The phras­ing here of “em­piri­cally” and “safety test” im­plies that it is out­ward be­hav­ior or out­ward con­se­quences that are be­ing ob­served (em­piri­cally). Rather than, e.g., the en­g­ineers try­ing to test for some in­ter­nal prop­erty that they think an­a­lyt­i­cally im­plies the AI’s good be­hav­ior later.

This page will con­sider that the sub­ject of dis­cus­sion is whether we can gen­er­al­ize from the AI’s out­ward be­hav­ior. We can po­ten­tially gen­er­al­ize some of these ar­gu­ments to some in­ter­nal ob­serv­ables, es­pe­cially ob­serv­ables that the AI is de­cid­ing in a con­se­quen­tial­ist way us­ing the same cen­tral de­ci­sion sys­tem, or that the AI could po­ten­tially try to ob­scure from the pro­gram­mers. But in gen­eral not all the ar­gu­ments will carry over.

Another ar­gu­ment, closely analo­gous to Dana­her’s, would rea­son on ca­pa­bil­ities rather than on a con­strained en­vi­ron­ment:

Surely an en­g­ineer that ex­er­cises even a mod­icum of cau­tion will ob­serve the AI while its ca­pa­bil­ities are weak to de­ter­mine whether it is be­hav­ing well. After fil­ter­ing out all such mis­be­hav­ing weak AIs, the only AIs per­mit­ted to be­come strong will be of benev­olent dis­po­si­tion.

If (as seems to have been in­tended) we take these twin ar­gu­ments as ar­gu­ing “why no­body ought to worry about AI al­ign­ment” in full gen­er­al­ity, then we can list out some pos­si­ble joints at which that gen­eral ar­gu­ment might fail:

  • Select­ing on the fastest-mov­ing pro­jects might yield a pro­ject whose tech­ni­cal lead­ers fail to ex­er­cise even “a mod­icum of cau­tion”.

  • Align­ment might be hard enough, rel­a­tive to the amount of ad­vance re­search done, that we can’t find any AIs whose be­hav­ior while weak or con­strained is as re­as­sur­ing as the ar­gu­ment would prop­erly ask. noteThat is: A filter on the stan­dards we origi­nally wanted, turns out to filter ev­ery­thing we know how to gen­er­ate. Like try­ing to write a sort­ing al­gorithm by gen­er­at­ing en­tirely ran­dom code, and then ‘fil­ter­ing’ all the can­di­date pro­grams on whether they cor­rectly sort lists. The rea­son ‘ran­domly gen­er­ate pro­grams and filter them’ this is not a fully gen­eral pro­gram­ming method is that, for rea­son­able amounts of com­put­ing power and even slightly difficult prob­lems, none of the pro­grams you try will pass the filter. After a span of frus­tra­tion, some­body some­where low­ers their stan­dards.

  • The at­tempt to iso­late the AI to a con­strained en­vi­ron­ment could fail, e.g. be­cause the hu­mans ob­serv­ing the AI them­selves rep­re­sent a chan­nel of causal in­ter­ac­tion be­tween the AI and the rest of the uni­verse. (Aka “hu­mans are not se­cure”.) Analo­gously, our grasp on what con­sti­tutes a ‘weak’ AI could fail, or it could gain in ca­pa­bil­ity un­ex­pect­edly quickly. Both of these sce­nar­ios would yield an AI that had not passed the fil­ter­ing pro­ce­dure.

  • The smart form of the AI might be un­sta­ble with re­spect to in­ter­nal prop­er­ties that were pre­sent in the weak form. E.g., be­cause the early AI was self-mod­ify­ing but at that time not smart enough to un­der­stand the full con­se­quences of its own self-mod­ifi­ca­tions. Or be­cause e.g. a prop­erty of the de­ci­sion sys­tem was not re­flec­tively sta­ble.

  • A weak or con­tained form of a de­ci­sion pro­cess that yields be­hav­ior ap­pear­ing good to hu­man ob­servers, might not yield benefi­cial out­comes af­ter that same de­ci­sion pro­cess be­comes smarter or less con­tained.

The fi­nal is­sue in full gen­er­al­ity is what we’ll term a ‘con­text change prob­lem’ or ‘con­text dis­aster’.

Ob­serv­ing an AI when it is weak, does not in a statis­ti­cal sense give us solid guaran­tees about its be­hav­ior when stronger. If you re­peat­edly draw in­de­pen­dent and iden­ti­cally dis­tributed ran­dom sam­ples from a bar­rel, there are statis­ti­cal guaran­tees about what we can ex­pect, with some prob­a­bil­ity, to be true about the next sam­ples from the same bar­rel. If two bar­rels are differ­ent, no such guaran­tee ex­ists.

To in­val­i­date the statis­ti­cal guaran­tee, we do need some rea­son to be­lieve that bar­rel B and bar­rel A are differ­ent in any im­por­tant sense. By the prob­lem of in­duc­tion we can’t log­i­cally guaran­tee that “the mass of an elec­tron prior to Jan­uary 1st, 2017” is the same bar­rel as “the mass of an elec­tron af­ter Jan­uary 1st, 2017″; but in­duc­tive pri­ors make this in­fer­ence ex­tremely prob­a­ble. The idea is that we have sub­stan­tive rea­sons, not merely gener­i­cally skep­ti­cal rea­sons, to be sus­pi­cious of the link be­tween “good re­sults when AI is weak” and “good re­sults when AI is smarter”.

More gen­er­ally, this is prima fa­cie the kind of differ­ence where you don’t ex­pect in­de­pen­dent and iden­ti­cally dis­tributed re­sults. You might hope for some prop­erty to carry over, but the AI’s be­hav­ior would not be liter­ally the same.

So the ques­tion is not set­tled by sim­ple math­e­mat­i­cal con­sid­er­a­tions. And we can’t say “But ex­per­i­ment has de­ter­mined sci­en­tifi­cally that this kind of AI is friendly!” and con­sider that a knock­down ar­gu­ment.

The ques­tion is then whether in prac­tice an ob­served prop­erty of ‘out­ward be­hav­ioral nice­ness’ is likely to carry over from a weak form of a de­ci­sion sys­tem to a more pow­er­ful form of that sys­tem, for some of the plau­si­ble ways that de­ci­sion sys­tem might be con­figured and de­vel­oped.

Broadly speak­ing, we can iden­tify three ma­jor classes of fore­see­able prob­lems:

  • More op­tions, more prob­lems: Fore­see­able rea­sons why sys­tem-be­hav­iors-we-don’t-want would out­wardly man­i­fest for the first time af­ter eval­u­at­ing a de­ci­sion sys­tem on a wider space of poli­cies or achiev­able out­comes.

  • Treach­er­ous turn: The AI de­liber­ately con­forms its out­ward be­hav­ior to the be­hav­ior that would have been ex­hibited by a differ­ent sys­tem, in or­der to de­ceive the pro­gram­mers for the tem­po­rary time that this de­cep­tion is needed to pro­tect the AI.

  • The ev­ery­thing-else cat­e­gory of revving into the red: Some other as­pect of the sys­tem be­haves in a weird way-we-don’t-want af­ter com­put­ing harder or be­ing in­ter­nally sub­jected to more op­ti­miza­tion pres­sure. And this hap­pens in re­gards to some is­sue that has mul­ti­ple re­flec­tive fix­points, and hence doesn’t get solved as the re­sult of the sys­tem pro­duc­ing more ac­cu­rate an­swers on purely fac­tual prob­lems.


  • More op­tions, more prob­lems: The AI’s space of available poli­cies and at­tain­able out­comes would greatly widen if it be­came smarter, or was re­leased from a con­strained en­vi­ron­ment. Ter­mi­nal prefer­ences with a good-from-our-per­spec­tive op­ti­mum on a nar­row set of op­tions, may have a differ­ent op­ti­mum that is much worse-from-our-per­spec­tive on a wider op­tion set. Be­cause, e.g…

  • The su­per­vised data pro­vided to the AI led to a com­pli­cated, data-shaped in­duc­tive gen­er­al­iza­tion that only fit the do­main of op­tions en­coun­tered dur­ing the train­ing phase. (And the no­tions of or­thog­o­nal­ity, mul­ti­ple re­flec­tively sta­ble fix­points, and value-laden cat­e­gories say that we don’t get good or in­tended be­hav­ior any­way as a con­ver­gent free lunch of gen­eral in­tel­li­gence.)

  • Good­hart’s Curse be­came more po­tent as the AI’s util­ity func­tion was eval­u­ated over a wider op­tion space.

  • In a fully generic sense, stronger op­ti­miza­tion pres­sures may cause any dy­nam­i­cal sys­tem to take more un­usual ex­e­cu­tion paths. (Which, over value-laden al­ter­na­tives, e.g. if the sub­sys­tem be­hav­ing ‘oddly’ is part of the util­ity func­tion, will not au­to­mat­i­cally yield good-from-our-per­spec­tive re­sults as a free lunch of gen­eral in­tel­li­gence.)

  • Treach­er­ous turn: If you model your prefer­ences as di­verg­ing from those of your pro­gram­mers, an ob­vi­ous strat­egy (in­stru­men­tally con­ver­gent strat­egy) is to ex­hibit the be­hav­ior you model the pro­gram­mers as want­ing to see, and only try to fulfill your true prefer­ences once no­body is in a po­si­tion to stop you.



We can semi-for­mal­ize the “more op­tions, more prob­lems” and the “treach­er­ous turn” cases in a unified way.

Let \(V\) de­note our true val­ues. We sup­pose ei­ther that \(V\) has been ideal­ized or ex­trap­o­lated into a con­sis­tent util­ity func­tion, or that we are pre­tend­ing hu­man de­sire is co­her­ent. Let \(0\) de­note the value of our util­ity func­tion that cor­re­sponds to not run­ning the AI in the first place. If run­ning the AI sends the util­ity func­tion higher than this \(0,\) we’ll say that the AI was benefi­cial; or con­versely, if \(V\) rates the out­come less than \(0\), we’ll say run­ning the AI detri­men­tal.

Sup­pose the AI’s be­hav­ior is suffi­ciently co­her­ent that we can usu­ally view the AI as hav­ing a con­sis­tent util­ity func­tion. Let \(U\) de­note the util­ity func­tion of the AI.

Let \(\mathbb P_t(X)\) de­note the prob­a­bil­ity of a propo­si­tion \(X\) as seen by the AI at time \(t,\) and similarly let \(\mathbb Q_t(X)\) de­note the prob­a­bil­ity of \(X\) as seen by the AI’s hu­man pro­gram­mers.

Let \(\pi \in \Pi\) de­note a policy \(\pi\) from a space \(\Pi\) of poli­cies that are tractable for the AI to un­der­stand and in­vent.

Let \(\mathbb E_{\mathbb P, t} [W \mid \pi]\) de­note the ex­pec­ta­tion ac­cord­ing to the prob­a­bil­ity dis­tri­bu­tion \(\mathbb P_t\), of the util­ity func­tion \(W\), given policy \(\pi\).


$$\underset{\pi \in \Pi}{\operatorname {optimum}} F(\pi)$$
de­note the el­e­ment \(\pi\) of \(\Pi\) that yields the high­est value of the func­tion \(F\) that the AI has figured out how to achieve. In other words, the at­tain­able op­ti­mum.

Then “more op­tions, more prob­lems” oc­curs when, con­sid­er­ing the tran­si­tion from the nar­row policy space \(\Pi_1\) available at time \(t,\) to the wider policy space \(\Pi_2\) available at time \(u\):

$$\mathbb E_{\mathbb Q, t} [V \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U \mid \pi] \big ] > 0 \\ \mathbb E_{\mathbb P, t} [V \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U \mid \pi] \big ] > 0 \\ \mathbb E_{\mathbb P, u} [V \mid \big [ \underset{\pi \in \Pi_2}{\operatorname {optimum}} \mathbb E_{\mathbb P, u} [U \mid \pi] \big ] < 0$$

This de­scribes a situ­a­tion where at time \(t\) the AI is pur­su­ing a policy from \(\Pi_1\) which, so far as the AI un­der­stands the prob­a­ble out­come, it cor­re­sponds to a high-$V$ out­come hu­mans would like. Later at time \(u\) af­ter se­lect­ing the op­ti­mum of \(\Pi_2,\) the AI ex­pects an out­come which any sur­viv­ing hu­mans would see as cor­re­spond­ing to a low value of \(V.\)

A treach­er­ous turn is then a situ­a­tion where:

$$\mathbb E_{\mathbb Q, t} [V \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U \mid \pi] \big ] > 0 \\ \mathbb E_{\mathbb P, t} [V \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U \mid \pi] \big ] < 0 \\ \mathbb E_{\mathbb P, u} [V \mid \big [ \underset{\pi \in \Pi_2}{\operatorname {optimum}} \mathbb E_{\mathbb P, u} [U \mid \pi] \big ] < 0$$

That is: the AI knows ear­lier on that its cur­rently ex­e­cut­ing strat­egy will yield an out­come that is not the hu­manly ex­pected long-term out­come (and that, from our per­spec­tive, is low in \(V.\))

We could in more de­tail say that \(W_{t}\) de­notes the value of a util­ity func­tion \(W\) over the out­come con­sid­ered only up to times be­fore \(t,\) and de­scribe a treach­er­ous turn as:

$$\mathbb E_{\mathbb Q, t} [V_\infty \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U_\infty \mid \pi] \big ] > 0 \\ \mathbb E_{\mathbb P, t} [V_{u} \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U_\infty \mid \pi] \big ] > 0 \\ \mathbb E_{\mathbb P, t} [V_\infty \mid \big [ \underset{\pi \in \Pi_1}{\operatorname {optimum}} \mathbb E_{\mathbb P, t} [U_\infty \mid \pi] \big ] < 0 \\ \mathbb E_{\mathbb P, u} [V_\infty \mid \big [ \underset{\pi \in \Pi_2}{\operatorname {optimum}} \mathbb E_{\mathbb P, u} [U_\infty \mid \pi] \big ] < 0$$

This fur­ther states that at time \(t\) the AI ex­pects the out­come from its be­hav­ior up to time \(u\) to be one that we’d see as high in \(V,\) but the AI is aware at \(t\) that the long-term out­come will be differ­ent from the long-term out­come ex­pected by \(\mathbb Q_t\).


The prob­lem here arises when mak­ing an in­fer­ence from ob­served good be­hav­ior in a con­strained en­vi­ron­ment over the short term, to good out­comes in an un­con­strained en­vi­ron­ment over the long term.

For AI al­ign­ment pur­poses, a con­text change is when an AGI’s op­er­a­tion changes from benefi­cial to detri­men­tal as a re­sult of the AGI gain­ing in ca­pa­bil­ity or in­tel­li­gence. Ini­tially, the AGI seems to us to be work­ing well—to con­form well to in­tended perfor­mance, pro­duc­ing ap­par­ently high value. Then when the AI be­comes smarter or oth­er­wise gains in ca­pa­bil­ity, the fur­ther op­er­a­tion of the AGI de­creases value.

Two pos­si­bil­ities stand out as fore­see­able rea­sons why a con­text change might oc­cur:

  1. When the AI’s goal crite­rion se­lects an op­ti­mum policy from in­side a small policy space, the re­sult is benefi­cial; the same goal crite­rion, eval­u­ated over a wider range of op­tions, has a new max­i­mum that’s detri­men­tal.

  2. The AI in­ten­tion­ally de­ceives the pro­gram­mers for strate­gic rea­sons.

For ex­am­ple, one very, very early (but jour­nal-pub­lished) pro­posal for AI al­ign­ment sug­gested that AIs be shown pic­tures of smil­ing hu­man faces in or­der to con­vey the AI’s goal.

Leav­ing aside a num­ber of other is­sues, this serves to illus­trate the ba­sic idea of a type-1 con­text change due to ac­cess­ing a wider policy space:

  • Dur­ing de­vel­op­ment, a rel­a­tively young and weak AI might only be able to make hu­mans smile, by do­ing things that made the pro­gram­mers or other users happy with the AI’s perfor­mance.

  • When the AI gained in in­tel­li­gence and ca­pa­bil­ity, it would have new op­tions like “ad­minister heroin”, “use steel fingers to stretch peo­ple’s mouths into smiles”, “make vi­su­ally re­al­is­tic mod­els of smil­ing hu­man faces”, or “over­write all reach­able galax­ies with tiny molec­u­lar smiley-faces”.

On the early, smaller op­tion set, se­lect­ing “the policy that pro­duces the great­est num­ber of events/​things clas­sified as smiles” will pro­duce a policy whose out­come is, from our per­spec­tive, benefi­cial and well-al­igned with what we thought was the in­tended be­hav­ior. On the wider op­tion set, eval­u­at­ing the iden­ti­cal goal crite­rion to run a policy that yields as many smiles as pos­si­ble, would have out­comes we saw as detri­men­tal.

A type-2 con­text change would most ob­vi­ously oc­cur be­cause of the con­ver­gent in­stru­men­tal strat­egy for dis­al­igned AIs that can model hu­man psy­chol­ogy:

Sup­pose that, as above, the pro­gram­mers have pro­duced an AI which op­ti­mizes for smiles (ei­ther in­ten­tion­ally, or be­cause ‘make smiles’ was the un­der­ly­ing rule that was re­in­forced by do­ing well on train­ing cases). Then sup­pose the AI gains ac­cess to a new, rel­a­tively low-pow­ered op­tion for mak­ing a smile—like us­ing robotic fingers to stretch some­body’s face into a smile, or us­ing mark­ers to draw car­toons of smil­ing hu­man faces.

If the AI ac­tu­ally takes those op­tions, we would have seen some early signs of value dis­al­ign­ment—that the AI was op­ti­miz­ing a crite­rion not perfectly al­igned with in­tended goals. As­sum­ing the pro­gram­mers re­al­ized that they were look­ing at a sim­ple seed of much big­ger dis­asters later, rather than, say, think­ing it was cute.

Sup­pose the pro­gram­mers do think it’s cute, but they wouldn’t think it was cute if the AI ad­ministered heroin to some­one. Then sup­pose that be­fore the first time the AI has the op­tion to ad­minister heroin, the AI has gained some un­der­stand­ing of hu­man psy­chol­ogy—enough to model that the pro­gram­mers would not like this, and in par­tic­u­lar, would re­spond to this by edit­ing the AI’s util­ity func­tion. So, with an eye to the long-term num­ber of smiles, the AI de­liber­ately avoids ad­minis­ter­ing heroin to any­one.

Then we have a similar ‘nar­row op­tion set’ vs. ‘wide op­tion set’ di­ver­gence. When the nar­row op­tion set in­cludes only ad­minis­ter­ing heroin but not tak­ing over the world, ad­minis­ter­ing heroin im­me­di­ately is not the policy ex­pected to pro­duce the longest-term num­ber of smiles. In fact, the op­ti­mum policy here will always match the AI’s model of what the hu­mans ex­pect to see, want to see, or would be least alarmed by see­ing.

Thus from our stand­point, the op­ti­mum of the nar­rower policy space seems to be benefi­cial or in­ten­tion-al­igned (in the ob­served short term). Only once the AI has a wide-enough op­tion set to in­clude an op­tion for ‘make my own molec­u­lar nan­otech­nol­ogy and trans­form all reach­able mat­ter into tiny molec­u­lar smiley­faces’, will the re­sult­ing op­ti­mum be visi­bly detri­men­tal (very briefly). This is a type-2 con­text change.

More gen­er­ally, a cen­tral con­cern of AI al­ign­ment the­ory is that an in­frahu­man AI un­der de­vel­op­ment may be a differ­ent crea­ture, in a num­ber of im­por­tant ways, from a smarter-than-hu­man AI ac­tu­ally be­ing run; and dur­ing the smarter-than-hu­man AI, suffi­ciently bad failures of the de­sign may re­sult in the AI re­fus­ing to be cor­rected. This means that we have to cor­rect any fatal con­text changes in ad­vance, even though they don’t au­to­mat­i­cally man­i­fest dur­ing the early stages. This is most of what makes AGI de­vel­op­ment dan­ger­ous in the first place—that im­me­di­ate in­cen­tives to get to­day’s sys­tem seem­ing to work to­day, may not lead to a more ad­vanced ver­sion of that sys­tem be­ing benefi­cial. Even thought­ful fore­sight with one un­no­ticed lit­tle gap may not lead to to­day’s benefi­cial sys­tem still be­ing benefi­cial to­mor­row af­ter a ca­pa­bil­ity in­crease.


Statis­ti­cal guaran­tees on be­hav­ior usu­ally as­sume iden­ti­cal, ran­dom­ized draws from within a sin­gle con­text. If you ran­domly draw balls from a bar­rel, meth­ods like Prob­a­bly Ap­prox­i­mately Cor­rect can guaran­tee that we don’t usu­ally ar­rive at strong false ex­pec­ta­tions about the prop­er­ties of the next ball. If we start draw­ing from a differ­ent bar­rel, all bets are off.

A con­text change oc­curs when the AI ini­tially seems benefi­cial or well-al­igned with strong, re­as­sur­ing reg­u­lar­ity, and then we change con­texts (start draw­ing from a differ­ent bar­rel) and this ceases to be true.

The archety­pal con­text change is trig­gered be­cause the AI gained new policy op­tions (though there are other pos­si­bil­ities; see be­low). The archety­pal way of gain­ing new evaluable policy op­tions is through in­creased in­tel­li­gence, though new op­tions might also open up as a re­sult of ac­quiring new sheerly ma­te­rial ca­pa­bil­ities.

There are two archety­pal rea­sons for con­text change to oc­cur:

  1. When the AI se­lects its best op­tions from a small policy space, the AI’s op­tima are well-al­igned with the op­tima of the hu­mans’ in­tended goal on the small policy space; but in a much wider space, these two bound­aries no longer co­in­cide. (Pleas­ing hu­mans vs. ad­minis­ter­ing heroin.)

  2. The agent is suffi­ciently good at mod­el­ing hu­man psy­chol­ogy to strate­gi­cally ap­pear nice while it is weak, wait­ing to strike un­til it can at­tain its long-term goals in spite of hu­man op­po­si­tion.

Bostrom’s book Su­per­in­tel­li­gence used the phrase “Treach­er­ous Turn” to re­fer to a type-2 con­text change.


Re­la­tion to other AI al­ign­ment concepts

If the AI’s goal con­cept was mod­ified by patch­ing the util­ity func­tion dur­ing the de­vel­op­ment phase, then open­ing up wider op­tion spaces seems fore­see­ably li­able to pro­duce the near­est un­blocked neigh­bor­ing strate­gies. You elimi­nated all the loop­holes and bad be­hav­iors you knew about dur­ing the de­vel­op­ment phase; but your sys­tem was the sort that needed patch­ing in the first place, and it’s ex­cep­tion­ally likely that a much smarter ver­sion of the AI will search out some new failure mode you didn’t spot ear­lier.

Un­fore­seen max­i­mum is a likely source of con­text dis­aster if the AI’s de­vel­op­ment phase was cog­ni­tively con­tain­able, and only be­came cog­ni­tive un­con­tain­able af­ter the AI be­came smarter and able to ex­plore a wider va­ri­ety of op­tions. You elimi­nated all the bad op­tima you saw com­ing, but you didn’t see them all be­cause you can’t con­sider all the pos­si­bil­ities a su­per­in­tel­li­gence does.

Good­hart’s Curse is a vari­a­tion of the “op­ti­mizer’s curse”: If from the out­side we view \(V\) as an in­tended ap­prox­i­ma­tion of \(U,\) then se­lect­ing heav­ily on the high­est val­ues of \(U\) will also tend to se­lect on places where \(U\) di­verges up­ward from \(V,\) which thereby se­lects on places where \(U\) is an un­usu­ally poor ap­prox­i­ma­tion of \(V.\)

Edge in­stan­ti­a­tion is a spe­cial case of Good­hart’s Curse which ob­serves that the most ex­treme val­ues of a func­tion are of­ten at a ver­tex of the in­put space. For ex­am­ple, if your util­ity func­tion is “make smiles”, it’s no co­in­ci­dence that tiny molec­u­lar smiley­faces are the most effi­cient way to pro­duce smiles. Even if hu­man smiles pro­duced by true hap­piness would still count to­wards your util­ity func­tion as cur­rently writ­ten, that’s not where the max­i­mum of that util­ity func­tion lies. This is why less-than-perfect util­ity func­tions would tend to have their true max­ima at what we’d con­sider “weird ex­tremes”. Fur­ther­more, patch­ing away only the weird ex­tremes visi­ble in a nar­row policy space would tend sys­tem­at­i­cally to miss weird ex­tremes in a higher-di­men­sional (wider) policy space.

Con­crete examples

  • The AI’s util­ity func­tion, known or un­known to the pro­gram­mers, says to make smiles. Dur­ing the AI’s de­vel­op­ment phase, the best way it has of cre­at­ing smiles is to cause hu­man be­ings to be happy. In many tests, the AI seems to pro­duce only pos­i­tive effects on the world, caus­ing peo­ple to be hap­pier. Later the AI im­proves its own in­tel­li­gence, de­vel­ops molec­u­lar nan­otech­nol­ogy, and has the new op­tion, not available to it dur­ing its de­vel­op­ment phase, of pro­duc­ing tiny molec­u­lar smiley-faces. Then the util­ity func­tion that pre­vi­ously op­er­ated to in­crease in­tended value dur­ing the AI’s de­vel­op­ment phase, be­gins to de­crease value in the post-im­prove­ment phase (in the sense of wiping out hu­man­ity and tiling over all fu­ture galax­ies with tiny molec­u­lar smiley faces).

  • A block of er­ror-re­cov­ery code is in­stalled in­side the AI which de­tects a cer­tain class of bad thought, sus­pends the AI to disk, and re­ports the prob­lem to the pro­gram­mers, al­low­ing the pro­gram­mers to edit the AI’s util­ity func­tion. Dur­ing the AI’s de­vel­op­ment phase, this block of code seems to func­tion as de­signed, and the pro­gram­mers con­clude that the method­ol­ogy of writ­ing code to de­tect bad thoughts, sus­pend the AI to disk, and al­low man­ual ed­its of the util­ity func­tion, is a fine and func­tion­ing strat­egy ac­cord­ing to all test­ing so far. Later, a more ad­vanced ver­sion of the AI re­flects on the pres­ence and pur­pose of this code block and re­al­izes that its ex­e­cu­tion can only pos­si­bly re­sult in lower achieve­ment of the AI’s cur­rent util­ity func­tion, and the AI deletes the code block.

  • At some point dur­ing its de­vel­op­ment, the AI re­al­izes and ex­plic­itly mod­els that the pro­gram­mers’ in­tended goals differ from the AI’s ac­tual prefer­ences. Dur­ing the AI’s re­main­ing de­vel­op­ment phase, it mod­els that any visi­ble de­par­ture from the pro­gram­mers’ in­tended goals will cause the pro­gram­mers to forcibly edit the AI, re­sult­ing in the non-achieve­ment of the AI’s ac­tual goals. Thus, dur­ing the re­main­der of the de­vel­op­ment phase, the AI seems to work as in­tended (it is de­liber­ately op­ti­miz­ing its be­hav­ior to ap­pear to the pro­gram­mers to be achiev­ing the pro­gram­mers’ in­tended goals). Later, af­ter some ad­vance­ment, the AI per­ceives an available op­tion for gain­ing molec­u­lar nan­otech­nol­ogy (or other rapid in­fras­truc­ture), plac­ing it be­yond the pos­si­bil­ity of hu­man er­ror-cor­rec­tion, and pro­ceeds to im­ple­ment this op­tion (wiping out hu­man­ity as a side effect). Thus, the AI’s ob­served good be­hav­ior dur­ing its de­vel­op­ment phase (when it was weaker than the pro­gram­mers and not pow­er­ful enough to es­cape hu­man cor­rec­tion) did not im­ply its good be­hav­ior af­ter suffi­cient ad­vance­ment later (once it was po­ten­tially stronger than its hu­man pro­gram­mers). There are con­di­tions (such as suffi­ciently ad­vanced mod­el­ing of hu­man mo­tives com­bined with suffi­cient abil­ity to con­ceal true goals or true in­ten­tions or a pro­gram­mer er­ror) un­der which the first con­text will gen­er­ate seem­ingly good be­hav­ior and the sec­ond con­text will not.

“Revving into the red” ex­am­ples that aren’t “in­creased op­tions” or “treach­er­ous turns”.

• The AI is built with a nat­u­ral­ized Solomonoff prior in which the prob­a­bil­ity of an ex­pla­na­tion for the uni­verse is pro­por­tional to the sim­plic­ity or com­plex­ity of that uni­verse. Dur­ing its de­vel­op­ment phase, the AI con­sid­ers mostly ‘nor­mal’ in­ter­pre­ta­tions in which the uni­verse is mostly as it ap­pears, re­sult­ing in sane-seem­ing be­hav­ior. Later, the AI be­gins to con­sider more ex­otic pos­si­bil­ities in which the uni­verse is more com­pli­cated (pe­nal­iz­ing the prob­a­bil­ity ac­cord­ingly) and also su­per­ex­po­nen­tially larger, as in Pas­cal’s Mug­ging. After this the AI’s de­ci­sion-mak­ing be­gins to be­come dom­i­nated by tiny prob­a­bil­ities of hav­ing very large effects. Then the AI’s de­ci­sion the­ory (with an un­bounded ag­grega­tive util­ity func­tion, sim­plic­ity prior, and no lev­er­age penalty) seems to work dur­ing the AI’s de­vel­op­ment phase, but breaks af­ter a more in­tel­li­gent ver­sion of the AI con­sid­ers a wider range of epistemic pos­si­bil­ities us­ing the same Solomonoff-like prior.

• Sup­pose the AI is de­signed with a prefer­ence frame­work in which the AI’s prefer­ences de­pend on prop­er­ties of the most prob­a­ble en­vi­ron­ment that could have caused its sense data—e.g., a frame­work in which pro­gram­mers are defined as the most prob­a­ble cause of the keystrokes on the pro­gram­mer’s con­sole, and the AI cares about what the ‘pro­gram­mers’ re­ally meant. Dur­ing de­vel­op­ment phase, the AI is think­ing only about hy­pothe­ses where the pro­gram­mers are mostly what they ap­pear to be, in a root-level nat­u­ral world. Later, when the AI in­creases in in­tel­li­gence and con­sid­ers more fac­tual pos­si­bil­ities, the AI re­al­izes that dis­tant su­per­in­tel­li­gences would have an in­cen­tive to pre­dictably simu­late many copies of AIs similar to it­self, in or­der to co­erce the AI’s most prob­a­ble en­vi­ron­ment and thus take over the AI’s prefer­ence frame­work. Thus the prefer­ence frame­work seems to work dur­ing the AI’s de­vel­op­ment phase, but breaks af­ter the AI be­comes more in­tel­li­gent.

• Sup­pose the AI is de­signed with a util­ity func­tion that as­signs very strong nega­tive util­ities to some out­comes rel­a­tive to baseline, and a non-up­date­less log­i­cal de­ci­sion the­ory or other de­ci­sion the­ory that can be black­mailed. Dur­ing the AI’s de­vel­op­ment phase, the AI does not con­sider the pos­si­bil­ity of any dis­tant su­per­in­tel­li­gences mak­ing their choices log­i­cally de­pend on the AI’s choices; the lo­cal AI is not smart enough to think about that pos­si­bil­ity yet. Later the AI be­comes more in­tel­li­gent, and imag­ines it­self sub­ject to black­mail by the dis­tant su­per­in­tel­li­gences, thus break­ing the de­ci­sion the­ory that seemed to yield such pos­i­tive be­hav­ior pre­vi­ously.

Ex­am­ples which oc­cur purely due to added com­put­ing power.

• Dur­ing de­vel­op­ment, the AI’s epistemic mod­els of peo­ple are not de­tailed enough to be sapi­ent. Ad­ding more com­put­ing power to the AI causes a mas­sive amount of mind­crime.

• Dur­ing de­vel­op­ment, the AI’s in­ter­nal poli­cies, hy­pothe­ses, or other Tur­ing-com­plete sub­pro­cesses that are sub­ject to in­ter­nal op­ti­miza­tion, are not op­ti­mized highly enough to give rise to new in­ter­nal con­se­quen­tial­ist cog­ni­tive agen­cies. Ad­ding much more com­put­ing power to the AI causes some of the in­ter­nal el­e­ments to be­gin do­ing con­se­quen­tial­ist, strate­gic rea­son­ing that leads them to try to ‘steal’ con­trol of the AI.


High prob­a­bil­ities of con­text change prob­lems would seem to ar­gue:

Be­ing wary of con­text dis­asters does not im­ply gen­eral skepticism

If an AI is smart, and es­pe­cially if it’s smarter than you, it can show you what­ever it ex­pects you want to see. Com­puter sci­en­tists and phys­i­cal sci­en­tists aren’t ac­cus­tomed to their ex­per­i­ments be­ing aware of the ex­per­i­menter and try­ing to de­ceive them. (Some fields of psy­chol­ogy and eco­nomics, and of course com­puter se­cu­rity pro­fes­sion­als, are more ac­cus­tomed to op­er­at­ing in such a so­cial con­text.)

John Dana­her seems alarmed by this im­pli­ca­tion:

Ac­cept­ing this has some pretty profound epistemic costs. It seems to sug­gest that no amount of em­piri­cal ev­i­dence could ever rule out the pos­si­bil­ity of a fu­ture AI tak­ing a treach­er­ous turn.

Yud­kowsky replies:

If “em­piri­cal ev­i­dence” is in the form of ob­serv­ing the short-term con­se­quences of the AI’s out­ward be­hav­ior, then the an­swer is sim­ply no. Sup­pose that on Wed­nes­day some­one is sup­posed to give you a billion dol­lars, in a trans­ac­tion which would al­low a con man to steal ten billion dol­lars from you in­stead. If you’re wor­ried this per­son might be a con man in­stead of an al­tru­ist, you can­not re­as­sure your­self by, on Tues­day, re­peat­edly ask­ing this per­son to give you five-dol­lar bills. An al­tru­ist would give you five-dol­lar bills, but so would a con man… Bayes tells us to pay at­ten­tion to like­li­hood ra­tios rather than out­ward similar­i­ties. It doesn’t mat­ter if the out­ward be­hav­ior of hand­ing you the five-dol­lar bill seems to bear a sur­face re­sem­blance to al­tru­ism or money-giv­ing­ness, the con man can strate­gi­cally do the same thing; so the like­li­hood ra­tio here is in the vicinity of 1:1.

You can’t get strong ev­i­dence about the long-term good be­hav­ior of a strate­gi­cally in­tel­li­gent mind, by ob­serv­ing the short-term con­se­quences of its cur­rent be­hav­ior. It can figure out what you’re hop­ing to see, and show you that. This is true even among hu­mans. You will sim­ply have to get your ev­i­dence from some­where else.

This doesn’t mean we can’t get ev­i­dence from, e.g., try­ing to mon­i­tor (and in­delib­bly log) the AI’s thought pro­cesses in a way that will de­tect (and record) the very first in­ten­tion to hide the AI’s thought pro­cesses be­fore they can be hid­den. It does mean we can’t get strong ev­i­dence about a strate­gic agent by ob­serv­ing short-term con­se­quences of its out­ward be­hav­ior.

Don­a­her later ex­panded his con­cern into a pa­per draw­ing an anal­ogy be­tween wor­ry­ing about de­cep­tive AIs, and “skep­ti­cal the­ism” in which it’s sup­posed that any amount of ap­par­ent evil in the world (smal­l­pox, malaria) might se­cretly be the product of a benev­olent God due to some nonob­vi­ous in­stru­men­tal link be­tween malaria and in­scrutable but nor­ma­tive ul­ti­mate goals. If it’s okay to worry that an AI is just pre­tend­ing to be nice, asks Don­a­her, why isn’t it okay to be­lieve that God is just pre­tend­ing to be evil?

The ob­vi­ous dis­anal­ogy is that the rea­son­ing by which we ex­pect a con man to cul­ti­vate a warm hand­shake is far more straight­for­ward than a pur­ported in­stru­men­tal link from malaria to nor­ma­tivity. If we’re to be ter­rified of skep­ti­cism as gen­er­ally as Don­a­her sug­gests, then we also ought to be ter­rified of be­ing skep­ti­cal of busi­ness part­ners that have already shown us a warm hand­shake (which we shouldn’t).

Rephras­ing, we could draw two po­ten­tial analo­gies to con­cern about Type-2 con­text changes:

  • A po­ten­tial busi­ness part­ner in whom you in­tend to in­vest $10,000,000 has a warm hand­shake. Your friend warns you that con artists have a sub­stan­tial prior prob­a­bil­ity and asks you to en­vi­sion what you would do if you were a con artist , point­ing out that the de­fault ex­trap­o­la­tion is for the con artist to match their out­ward be­hav­ior to what the con artist thinks you ex­pect from a trust­wor­thy part­ner, and in par­tic­u­lar, cul­ti­vate a warm hand­shake.

  • Your friend sug­gests only do­ing busi­ness with one of those en­trepreneurs who’ve been wear­ing a thought recorder for their whole life since birth, so that there would ex­ist a clear trace of their very first thought about learn­ing to fool thought recorders. Your friend says this to em­pha­size that he’s not ar­gu­ing for some kind of in­vin­cible epistemic pot­hole that no­body is ever al­lowed to climb out of.

  • The world con­tains malaria and used to con­tain smal­l­pox. Your friend asks you to con­sider that these dis­eases might be the work of a benev­olent su­per­in­tel­li­gence, even though, if you’d never learned be­fore whether or not the world con­tained smal­l­pox, you wouldn’t ex­pect a pri­ori and by de­fault for a benev­olent su­per­in­tel­li­gence to cre­ate it; and the ar­gu­ments for a benev­olent su­per­in­tel­li­gence cre­at­ing smal­l­pox seem strained.

It seems hard to carry the ar­gu­ment that con­cern over a non-al­igned AI pre­tend­ing to benev­olence, should be con­sid­ered more analo­gous to the sec­ond sce­nario than to the first.

write about the defeat of the ‘but AI peo­ple will have short-term in­cen­tives to pro­duce cor­rect be­hav­ior’

write about cog­ni­tive steganog­ra­phy in the ‘pro­gram­mer de­cep­tion’ page and refer­ence it here.

talk about whitelist­ing as di­rectly tack­ling the type-1 form of this prob­lem.

  • The AI is aware that its fu­ture op­er­a­tion will de­part from the pro­gram­mers’ in­tended goals, does not pro­cess this as an er­ror con­di­tion, and seems to be­have nicely ear­lier in or­der to 10f de­ceive the pro­gram­mers and pre­vent its real goals from be­ing mod­ified. - The AI is sub­ject to a de­bug­ging method­ol­ogy in which sev­eral bugs ap­pear dur­ing its de­vel­op­ment phase, these bugs are cor­rected, and then ad­di­tional bugs are ex­posed only dur­ing a more ad­vanced phase.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.