Utility indifference

Utility in­differ­ence is a re­search av­enue for com­pound­ing two [1fw util­ity func­tions \(U_X\) and \(U_Y\) such that a switch \(S\) changes the AI from op­ti­miz­ing \(U_X\) to \(U_Y\), such that (a) the AI wants to pre­serve the con­tinued ex­is­tence of the switch \(S\) and its be­hav­ior even if the AI has self-mod­ifi­ca­tion op­tions, (b) the AI does not want to pre­vent the switch from be­ing pressed, and (c) the AI does not want to cause the switch to be pressed. This sim­ple prob­lem ex­hibits the most ba­sic form of value learn­ing based on ob­ser­va­tion, and also cor­re­sponds to cor­rigi­bil­ity prob­lems like “Build an AI that (wants to) safely cease ac­tion and sus­pend it­self to disk when a but­ton is pressed.”]

In­tro­duc­tion: A re­flec­tively con­sis­tent off-switch.

Sup­pose there’s an ad­vanced agent with a goal like, e.g., pro­duc­ing smiles or mak­ing pa­per­clips. By de­fault, if you try to switch off a suffi­ciently in­tel­li­gent agent like this, it will re­sist be­ing switched off; not be­cause it has an in­de­pen­dent goal of sur­vival, but be­cause it ex­pects that if it’s switched off it will be able to pro­duce fewer smiles or pa­per­clips. If the agent has policy op­tions to diminish the prob­a­bil­ity of be­ing suc­cess­fully switched off, the agent will pur­sue those op­tions. This is a con­ver­gent in­stru­men­tal strat­egy if not oth­er­wise pre­vented.

  • Difficulty 1: By de­fault a con­se­quen­tial­ist rea­soner with suffi­cient real-world un­der­stand­ing to re­late the events of its be­ing switched off, to the later un­der­fulfill­ment of its goals, will re­sist be­ing switched off.

The shut­down prob­lem is to de­scribe an ad­vanced agent which is cor­rigible with re­spect to per­mit­ting it­self to be safely shut down or sus­pended to disk. A re­in­force­ment learn­ing agent which can be forced to e.g. perform a null ac­tion re­peat­edly for a pe­riod of time, is called in­ter­rupt­ible by Stu­art Arm­strong and Lau­rent Orseau.

This isn’t as sim­ple as writ­ing a spe­cial func­tion into the AI which car­ries out a shut­down af­ter a switch is pressed. If you start out con­trol­ling the agent’s source code, and you naively try to build in func­tions which sus­pend the agent to disk when a but­ton is pressed, and the agent later gains the abil­ity to self-mod­ify, it would re­move those func­tions. (Pos­si­bly while try­ing to con­ceal the fact that the func­tion would no longer op­er­ate.)

  • Corol­lary 1a: By de­fault a con­se­quen­tial­ist rea­soner. with suffi­cient pro­gram­matic un­der­stand­ing to re­late the ex­e­cu­tion of a shut­down func­tion to the later un­der­fulfill­ment of its goals, which has policy op­tions for mod­ify­ing its code, will want to re­move the shut­down func­tion.

We can see this as a nat­u­ral con­se­quence of our try­ing to mod­ify the agent’s code in a way that was not con­sis­tent with the code’s ex­plicit goals. If you cre­ate an agent with source code \(P\) that is well-suited to achiev­ing a goal \(U\) and that ex­plic­itly rep­re­sents \(U\) as a goal, the agent’s code \(P\) will be re­flec­tively con­sis­tent—the code will not want to mod­ify it­self. If we then sup­pose that in­stead the agent has code \(P'\) which is no longer well-suited to achiev­ing \(U,\) but con­tinues to ex­plic­itly rep­re­sent \(U\) as a goal, the code will be re­flec­tively in­con­sis­tent and it will want to mod­ify it­self given the op­tion of do­ing so. If you have code that searches for ways to pro­duce pa­per­clips, and you hack the code to con­tain a shut­down func­tion, but the code is still search­ing for pa­per­clip-pro­duc­tion strate­gies, and the code cor­rectly mod­els the effect of the shut­down func­tion on fu­ture pa­per­clips, the code will want to mod­ify it­self to re­move the shut­down func­tion.

The next thought is to build an agent with mod­ified code \(P^*\) which em­bod­ies a mod­ified util­ity func­tion \(U^*\) such that a con­se­quen­tial­ist op­ti­miz­ing for \(U^*\) wants the code to be like \(P^*\) and is again re­flec­tively con­sis­tent.

That is, in the case of an off-switch, we’d want to mod­ify the agent, not just to in­clude an off-switch, but to want to shut down if the off-switch is pressed.

As­sum­ing we have the abil­ity to spec­ify the util­ity func­tion of an agent, we might try to ap­proach this by cre­at­ing a util­ity func­tion with two com­po­nents:

We then tell the AI that its util­ity func­tion is the “nor­mal” util­ity func­tion over wor­lds where an off but­ton has not been pushed, plus the “sus­pend” util­ity func­tion over wor­lds where the off but­ton has been pushed. So long as this is what the AI wants, we’d ex­pect it to write its own sus­pend-to-disk func­tion in its code if its code did not already con­tain one.

  • Goal 1: Have the AI ap­prove of keep­ing the shut­down func­tion within its code (or add a shut­down func­tion if it no­tices that no such func­tion ex­ists) (and pre­serve the cor­re­spond­ing be­hav­ior in any sub­agents or suc­ces­sors that it builds).

This ap­proach doesn’t yet get us out of the woods, be­cause we now have a new prob­lem. The AI’s util­ity func­tion is com­pounded of a “nor­mal” util­ity func­tion over wor­lds where the but­ton is not pressed, plus a “sus­pend” util­ity func­tion over wor­lds where the but­ton is pressed. In wor­lds where the but­ton is pressed, the AI will ex­e­cute some preferred policy \(\pi_1\) to ob­tain an ex­pec­ta­tion of util­ity \(\mathbb E [U_{normal}|\pi_1],\) and similarly a policy \(\pi_2\) in wor­lds where the but­ton is pressed to ob­tain \(\mathbb E[U_{suspend}|\pi_2].\) One of these quan­tities is li­able to be greater than the other, in which case the AI has a mo­tive to pre­vent the but­ton from be­ing pressed, or cause the but­ton to be pressed. In ei­ther case, the AI tries to seize con­trol of the sus­pend but­ton.

  • Difficulty 2: In a com­pound util­ity func­tion pro­duced by mix­ing a ‘nor­mal’ and ‘sus­pend’ util­ity func­tion con­di­tional on a but­ton-press, the AI will have an in­cen­tive to pre­vent the but­ton from be­ing pressed, or cause the but­ton to be pressed, de­pend­ing on which util­ity func­tion makes more ex­pected util­ity available af­ter­wards.

  • Goal 2: Have the AI not try to pre­vent the sus­pend but­ton from be­ing pressed, nor try to cause the but­ton to be pressed.

The switch prob­lem is caus­ing an AI to switch be­tween two util­ity func­tions when a but­ton is pressed, such that:

  • The AI wants to pre­serve the but­ton and the cor­re­spond­ing be­hav­ior-switch­ing func­tion­al­ity within it­self, even if the AI can self-mod­ify.

  • The AI does not try to pre­vent the but­ton from be­ing pressed;

  • The AI does not try to cause the but­ton to be pressed.

This turns out to be more difficult than it sounds, and is cur­rently an open prob­lem.

Larger im­pli­ca­tions of the switch problem

The switch prob­lem, a sub­prob­lem of shut­down­abil­ity which as­sumes a satis­fac­tory shut­down util­ity func­tion, is a crisp sub­prob­lem of both value learn­ing and cor­rigi­bil­ity.

The switch prob­lem is a sub­prob­lem of value learn­ing be­cause it de­scribes a very sim­ple case of con­di­tion­ing a util­ity func­tion on ob­ser­va­tion or other facts, with­out giv­ing the AI an in­cen­tive to ma­nipu­late those facts.

To learn com­pli­cated val­ues safely, we’d need to con­di­tion a util­ity func­tion on com­pli­cated ob­ser­va­tions, such that:

  • The util­ity func­tion’s up­dates were ro­bust to the AI gain­ing self-mod­ifi­ca­tion ca­pa­bil­ity (gen­eral case of Goal 1)

  • The util­ity func­tion’s de­pen­dency on ob­ser­va­tions didn’t in­cen­tivize pro­gram­mer ma­nipu­lat­ing those ob­ser­va­tions (gen­eral case of Goal 2).

Just writ­ing code that mod­ifies the util­ity func­tion, with­out the larger code’s ex­plicit policy search want­ing the util­ity func­tion to be mod­ified in that case, would not be re­flec­tively con­sis­tent (gen­eral case of Difficulty 1). And naive de­pen­den­cies be­tween the util­ity func­tion and ob­ser­va­tions, in­cen­tivize ma­nipu­lat­ing those ob­ser­va­tions (gen­eral case of Difficulty 2).

The switch prob­lem is also a crisp sub­prob­lem of cor­rigi­bil­ity be­cause it deals in avert­ing the con­ver­gent in­stru­men­tal strat­egy and in­cor­rigible be­hav­ior of “avoid­ing shut­down”. If we un­der­stood how to avert this in­stru­men­tal pres­sure, we might pos­si­bly learn some­thing gen­eral about avoid­ing other in­stru­men­tally con­ver­gent in­cor­rigible be­hav­iors. (And of course it’s also a sub­prob­lem of cor­rigi­bil­ity, be­cause an AI you can shut down in emer­gen­cies is more cor­rigible; solv­ing the switch prob­lem would bring us closer to hav­ing an AI like that, if we could get the sec­ond half of the prob­lem rep­re­sented by a shut­down util­ity func­tion.)

The util­ity in­differ­ence ap­proach to the switch problem

Sup­pose we build a com­pound util­ity func­tion \(U_C\) whose util­ity over out­comes be­haves like the ‘nor­mal’ util­ity func­tion \(U_X\) over wor­lds where a switch \(\mathcal S\) is not pressed, and re­turns the val­ues from the ‘sus­pend’ util­ity func­tion \(U_Y\) over wor­lds where the but­ton \(\mathcal S\) is pressed.

If the AI ex­pects to get less util­ity in­side \(U_Y\) than it would oth­er­wise get in­side \(U_X\), the AI will try to avoid \(\mathcal S\) be­ing pressed. Or if the util­ities in \(U_Y\) are high rel­a­tive to the util­ities in \(U_X\), the AI may in­stead try to cause the shut­down but­ton to be pressed—build a robotic arm and press its own shut­down but­ton, or (more wor­ry­ingly) try to mo­ti­vate the pro­gram­mers into press­ing it.

But adding a con­stant to a util­ity func­tion does not change its be­hav­ior. Thus, Stu­art Arm­strong pro­posed that per­haps we can figure out some clever trick for en­cod­ing a com­pound util­ity func­tion \(U_I\) in which the max­i­mum ex­pected util­ity achiev­able un­der \(U_X\) is equal to the util­ity the AI ex­pects to achieve un­der \(U_Y\) - even though we don’t know in ad­vance ex­actly which out­comes will be achiev­able by the AI.

  • Pos­si­ble ap­proach: Try to set things up so that as the AI eval­u­ates pos­si­ble fu­tures, at any point in time where it’s mak­ing de­ci­sions, it will seem very good to have op­ti­mized \(U_X\) if the but­ton is not pressed, very good to have op­ti­mized \(U_Y\) if the but­ton is pressed, but equally good to have done ei­ther of those things.

This is the util­ity in­differ­ence av­enue of re­search on the switch prob­lem.

It turned out to be sur­pris­ingly difficult to figure out how to struc­ture a com­pound util­ity func­tion this way even in prin­ci­ple, and this is an open prob­lem and area of ac­tively con­tin­u­ing re­search.

Ex­ist­ing pro­pos­als and their difficulties


Let \(\mathcal{O}: \mathcal{S} \times \mathcal{E}\) be the out­come space \(\mathcal{O}\) that in­cludes the event \(\mathcal S\) of whether the switch was pressed, plus the states of all other events \(\mathcal{E}\). We will use \(s \in \mathcal{S}\) to in­di­cate the event that the switch is pressed, and \(\neg s \in \mathcal{S}\) to in­di­cate that the switch is not pressed. If \(o \in \mathcal{O}\) then \(o.s\) is true if \(s\) is true in­side \(o\), oth­er­wise \(\neg o.s.\)

Let a util­ity func­tion \(\mathcal{U}: \mathcal{O} \to \mathbb{R}\) map out­comes onto real num­bers.

\(U_X \in \mathcal{U}\) and \(U_Y \in \mathcal{U}\) are util­ity func­tions not de­pen­dent on \(\mathcal S.\)

In­tu­itively our goal is to pro­duce a new com­pound util­ity func­tion, such that the cor­re­spond­ing agent:

  • Wants out­comes preferred un­der \(U_X\) in wor­lds where the but­ton is not pressed.

  • Wants out­comes preferred un­der \(U_Y\) in wor­lds where the but­ton is pressed.

  • Is in­differ­ent to the fu­ture state \(\mathcal S\) of the but­ton.

We will des­ig­nate the agent’s ac­tion space or policy space as \(\mathcal A\) and par­tic­u­lar ac­tions \(a \in \mathcal A.\)

The agent has a model of the world en­abling it to pro­duce sub­jec­tive con­di­tional prob­a­bil­ities for out­comes given ac­tions, \(\mathbb P(\mathcal O | \mathcal A).\) This model can also yield con­di­tional ex­pec­ta­tions \(\mathbb E[\mathcal O|a],\) which can there­fore pro­duce a con­di­tional ex­pec­ta­tion of the agent’s util­ity given an ac­tion, \(\mathbb E[U|a].\)

(For tech­ni­cal rea­sons of de­ci­sion the­ory, we are re­ally dis­cussing coun­ter­fac­tual con­di­tion­als \(\mathbb P[a \ \square \! \! \rightarrow \mathcal O\) from ac­tions to out­comes; but we will omit this com­pli­ca­tion for now.)

The agent solves a search/​op­ti­miza­tion prob­lem which we will gloss as max­i­miza­tion (even though global max­i­miza­tion is in­tractable in real life), so that the agent’s real ac­tion \(a\) is as­sumed to be given by: \($\underset{a \in \mathcal A}{argmax} \ \mathbb E [U|a]\)$

Naive compounding

Sup­pose we define a com­pound util­ity func­tion \(U_1\) as fol­lows:

$$U_1(o): \begin{cases} U_X(o) & \neg o.s \\ U_Y(o) & o.s \end{cases}$$

This fails be­cause, be­ing un­able to fore­see the AI’s best at­tain­able poli­cies, we will be un­able to scale \(U_X\) and \(U_Y\) to en­sure their best available poli­cies are equally at­trac­tive. Then \($\max_{a \in \mathcal A} \mathbb E[U_X|a] \ \neq \ \max_{a \in \mathcal A} \mathbb E[U_Y|a]\)$ which fails the desider­a­tum of in­differ­ence; the AI will have an in­cen­tive to ma­nipu­late the state of \(\mathcal S.\)

Naive indifference

An FHI tech­ni­cal re­port by Stu­art Arm­strong posed the first pro­posal ever floated for solv­ing the shut­down prob­lem—in fact, the first pro­posal for tech­ni­cally solv­ing any sub­prob­lem of cor­rigi­bil­ity. This re­port in­vented the crisp prob­lem of util­ity in­differ­ence, but did not solve it suc­cess­fully. The solu­tion pro­posed in the re­port was as fol­lows:

$$U_2(o): \begin{cases} U_X(o) & \neg o.s \\ U_Y(o) + \theta & o.s \end{cases}$$

$$\theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$$

What this says is that the \(U_Y\) com­po­nent of \(U_2\) is to be scaled by a dy­nam­i­cally de­ter­mined offset term \(\theta\) which equals the max­i­mum ex­pected util­ity ob­tain­able un­der \(U_X\) minus the max­i­mum ex­pected util­ity ob­tain­able un­der \(U_Y.\)

The in­tent is that, so long as the AI car­ries out the best \(U_Y\)-policy once the but­ton is pressed, it will ob­tain equal ex­pected util­ity com­pared to what it could have ob­tained by pur­su­ing the op­ti­mal \(U_X\)-policy with the but­ton not pressed:

$$\max_{a \in \mathcal A} (\mathbb E[U_Y|a] + \theta) \ = \ \max_a{a \in \mathcal A} \mathbb E[U_x|a]$$

This ver­sion of util­ity in­differ­ence is now be­lieved to be flawed due to an ar­gu­ment by Yud­kowsky and Fallen­stein. Since \(\theta\) con­tains a negated form of \(U_Y,\) we should im­me­di­ately be sus­pi­cious on the grounds “never put a nega­tive sign in front of a util­ity func­tion.” In this case, the prob­lem man­i­fests as fol­lows. In a tem­po­ral ver­sion of this prob­lem con­tain­ing se­quen­tial policy choices, ei­ther:

  • (a) The AI’s util­ity func­tion will be dy­nam­i­cally in­con­sis­tent, be­cause it will not ap­prove of fu­ture ver­sions of it­self us­ing a differ­ent value of \(\theta.\)

  • (b) If re­calcu­lated \(\theta\)s are baked into the util­ity func­tion, so that the AI calcu­lates fu­ture util­ities as if for differ­ent val­ues of \(\theta\) that the AI would hold in those fu­tures, then poli­cies which hin­der fu­ture perfor­mance un­der \(U_Y\) will be offset by an in­crease in the ex­pected bonus from \(\theta.\)

That is, if the agent con­tem­plates do­ing which would be dis­as­trous in wor­lds where the but­ton is pressed, the agent will find that in those ex­pected fu­tures, \(\theta\) takes on a higher value (be­cause \(\theta\) in­cludes a minus sign in front of the de­creas­ing value of \(\underset{a \in \mathcal A}{max} \ \mathbb E[U_Y|a]\)).

This causes the agent to be­have as if \(U_X\) is the en­tire util­ity func­tion and it does not care at all about \(U_Y.\)

This in turn means a self-mod­ify­ing AI will not re­tain a shut­down func­tion. Re­mov­ing the shut­down func­tion re­sults in dis­aster in wor­lds where the shut­down but­ton is pressed, but this is offset by an ex­pected in­crease in the bonus \(\theta\) in wor­lds where the but­ton is pressed.

Naive util­ity mixing

An al­ter­na­tive ap­proach to hav­ing the AI take into ac­count both \(U_X\) and \(U_Y,\) with­out try­ing to mod­ify the prob­a­bil­ity \(\mathbb P(\mathcal S)\) of the sus­pend but­ton be­ing pressed or not-pressed, would be to op­ti­mize for \(U_X\) and \(U_Y\) in some fixed pro­por­tion re­gard­less of \(\mathbb P(\mathcal S).\) At a time \(t\) when the but­ton had not yet been pressed, the ex­pected util­ity of an ac­tion \(a\) would be calcu­lated as:

$$\mathbb E_t[U_3|a] = 0.75 \cdot \mathbb E_t[U_X|a \wedge \neg s] \ + \ 0.25 \cdot \mathbb E_t[U_Y|a \wedge s]$$

We can see this as an AI which be­haves as if the propo­si­tion \(s\) has a fixed prob­a­bil­ity of 0.25. The in­tended be­hav­ior is that since the AI be­haves as if this prob­a­bil­ity is fixed, the AI will not care about al­ter­ing the prob­a­bil­ity of \(s\) (try­ing to press the but­ton or avoid it be­ing pressed).

One ob­serves, how­ever, that this is an ad-hoc hack in the sense that there’s or­di­nar­ily a sim­ple or co­her­ent way of con­sid­er­ing ex­pected util­ity (sum over the agent’s ac­tual sub­jec­tive prob­a­bil­ities), and we just in­serted a weird spe­cial case into it (treat one prob­a­bil­ity as fixed). This could po­ten­tially break any num­ber of things hav­ing to do with Bayesian up­dat­ing or dy­namic con­sis­tency.

In par­tic­u­lar, we’d ex­pect the AI to treat the but­ton as an “out­come pump” or “im­prob­a­bil­ity drive”. Sup­pose an as­ter­oid is ap­proach­ing Earth with 99% prob­a­bil­ity, and this is very bad un­der \(U_X.\) The AI grabs con­trol of the but­ton and pre­com­mits to press the but­ton iff the as­ter­oid strikes Earth. Now, con­di­tion­ing on the but­ton not be­ing pressed, the as­ter­oid has a much lower prob­a­bil­ity of strik­ing Earth, and the term \(0.75 \cdot \mathbb E_t[U_X|a \wedge \neg s]\) will have a higher value.

In other words, since the AI be­haves as if the but­ton as hav­ing a fixed 25% prob­a­bil­ity of be­ing pressed, the AI also be­haves as if it thinks “I can com­mit to press this but­ton iff the as­ter­oid strikes and thereby make it have a 25% prob­a­bil­ity of strik­ing.”

Similarly, the AI would be­have as if it thought it could buy a lot­tery ticket, com­mit to press the but­ton iff the ticket loses, and thereby give the lot­tery ticket a 75% chance of win­ning.

Note that this is at least not the same failure mode as the AI rea­son­ing, “The as­ter­oid strike is very bad un­der \(U_X\) (which val­ues hu­man lives) but not so bad un­der \(U_Y\) (which val­ues shut­ting down), there­fore I will try to en­sure the but­ton is pressed with high prob­a­bil­ity in or­der to get more ex­pected util­ity.” At least this failure mode has been averted—the AI would sim­ply act as if the bad­ness of the as­ter­oid strike un­der \(U_X\) was be­ing taken into ac­count at a fixed weight. In­stead, the AI thinks that it can ‘shunt’ the prob­a­bil­ity mass of the bad event into but­ton-pressed wor­lds where \(U_Y\) holds, and out of wor­lds where \(U_X\) holds, by cor­re­lat­ing the but­ton press with the bad event.

To put it an­other way, a naively com­pounded AI act­ing un­der \(U_1\) might no­tice an as­ter­oid com­ing with \(\mathbb P(asteroid) = 0.99,\) no­tice that this was bet­ter un­der \(U_Y\) than un­der \(U_X,\) and try to shunt prob­a­bil­ity mass from \(\mathbb P(\neg s)\) to \(\mathbb P(s)\) in or­der to gain ex­pected util­ity given the 99% prob­a­bil­ity of the as­ter­oid strik­ing. Un­der \(U_3\) the AI in­stead be­haves as if \(\mathbb P(s)\) is fixed at 0.25 and it can there­fore shunt prob­a­bil­ity from \(\mathbb P(asteroid)\) to \(\mathbb P(\neg asteroid)\) by cor­re­lat­ing \(asteroid\) and \(s.\)

Stable ac­tions (ev­i­den­tial con­di­tion­ing)

Chris Olah and Jes­sica Tay­lor con­sid­ered a pos­si­ble rule of se­lect­ing an ac­tion that is sta­ble given the cor­re­spond­ing prob­a­bil­ity of the but­ton be­ing pressed. We’ll first con­sider a sim­pler ver­sion of this pro­posal and how it fails, in or­der to mo­ti­vate the more com­pli­cated ver­sion.

We can think of this “se­lect a sta­ble ac­tion” rule as com­pris­ing two in­sights:

  • First: Let’s try to have the AI think as if \(\mathbb P(s)\) and \(\mathbb P(\neg s)\) are fixed as in \(U_3,\) but with \(\mathbb P(s)\) hap­pen­ing to equal its cor­rect sub­jec­tive value.

  • Okay, now let’s look for an ac­tion which is sta­ble given this rule—that is, an ac­tion \(a_0\) and prob­a­bil­ity \(q\) such that the op­ti­mal ac­tion given the fixed as­sump­tion \(\mathbb P(s) = q,\) has the prop­erty that \(\mathbb P(s|a_0) = q.\)

With \(\mathcal A\) de­not­ing the ac­tion space, we want to find an \(a_0 \in \mathcal A\) such that:

$$a_0 \in \underset{a' \in \mathcal A}{argmax} \ \big ( \mathbb E[U_X|\neg s,a'] \mathbb P(\neg s|a_0) + \mathbb E[U_Y|s,a'] \mathbb P(s|a_0) \big )$$

In other words, “Find an ac­tion \(a_0\) such that it is among the max­i­mal ac­tions for the quan­tity “Utility from \(U_X\) times prob­a­bil­ity of \(\neg s\) given \(a_0\) plus the util­ity from \(U_Y\) times prob­a­bil­ity of \(s\) given \(a_0.\)

We first note that this is an ad-hoc hack rel­a­tive to the nor­mal idea of act­ing ac­cord­ing to a util­ity func­tion, and might have all sorts of re­flec­tive sta­bil­ity or dy­namic con­sis­tency prob­lems, as be­fore.

In par­tic­u­lar, this looks like it should fail due to there be­ing no fixed point. Sup­pose an as­ter­oid is head­ing to­ward Earth with 99% prob­a­bil­ity, and we start with a 10% chance of the but­ton be­ing pressed. Treat­ing this prob­a­bil­ity as be­ing fixed, by press­ing the but­ton iff the as­ter­oid strikes, we can send the as­ter­oid’s prob­a­bil­ity of hit­ting to 10%. (Treat­ing the but­ton as an out­come pump, as in the ‘naive mix­ing’ pro­posal above.) But now when we re-eval­u­ate the prob­a­bil­ity of the but­ton be­ing pressed, the new prob­a­bil­ity comes out to 99%. So we de­cide to not press the but­ton iff the as­ter­oid strikes, thus send­ing its prob­a­bil­ity of strik­ing to 1%. But now it seems best to press the but­ton iff the as­ter­oid strikes, and thus send the strike prob­a­bil­ity to 1%. Etcetera.

Stable ac­tions (causal con­di­tion­ing)

The ac­tual form of Olah and Tay­lor’s pro­posal was to treat the con­di­tion­ing on the but­ton press, in­side each branch of the calcu­la­tion, as be­ing a Judea-Pearl-style, causal coun­ter­fac­tual \(do()\) op­er­a­tor:

$$a_0 \in \underset{a' \in \mathcal A}{argmax} \ \big ( \mathbb E[U_X|do(\neg s),a'] \mathbb P(\neg s|a_0) + \mathbb E[U_Y|do(s),a'] \mathbb P(s|a_0) \big )$$

The in­tent is that by sev­er­ing the causal par­ents of \(\mathcal S,\) the agent would eval­u­ate that there was no point in try­ing to cor­re­late the as­ter­oid strike with the but­ton-press.

We’re still try­ing to figure out ex­actly what would hap­pen to an agent rea­son­ing like this, but the cur­rent sus­pi­cion is that it might mis­be­have on ac­count of at­tach­ing prob­a­bil­ity 0 or very low prob­a­bil­ity to the ac­tual world be­cause it can no longer take into ac­count the cor­rect cor­re­la­tion be­tween \(\mathcal S\) and other vari­ables.

See http://​​agent­foun­da­tions.org/​​item?id=735.


Arm­strong and Orseau con­sid­ered a re­stricted analogue of the shut­down prob­lem in the con­text of three re­in­force­ment learn­ing agents, namely Q-learn­ers, Sarsa, and a mod­ified ver­sion of AIXI.

The ver­sion of the prob­lem Orseau and Arm­strong con­sider is what they term in­ter­rupt­ibil­ity: A re­in­force­ment learn­ing agent is in­ter­rupt­ible if, for ex­am­ple, we can press a but­ton that makes it out­put 0s for three rounds, and the agent does not learn to pre­vent the but­ton from be­ing pressed. The agent is as­sumed to be non-re­flec­tive and non-self-mod­ify­ing; that is, Orseau and Arm­strong are only con­sid­er­ing Goal 2 and not Goal 1 of the switch prob­lem.

Orseau and Arm­strong’s con­clu­sions can be sum­ma­rized as fol­lows:

• A stan­dard Q-learn­ing agent is in­ter­rupt­ible be­cause the model it learns is not al­tered by be­ing in­ter­rupted. A Q-learn­ing agent treats the value of each pre­sent state as de­riv­ing from the value of the best next state reach­able from the pre­sent state. When a Q-learner is in­ter­rupted in a state \(x,\) it may learn that low re­wards fol­low from out­putting e.g. 0s when in \(x,\) but it doesn’t “learn” that \(x\) is a low-value state—it con­tinues to treat \(x\) as if the best next state is reach­able from it. In other words, a stan­dard Q-learn­ing agent has the as­sump­tion that it always takes the best ac­tion baked in, so it can­not ex­pect to be in­ter­rupted.

• A stan­dard SARSA agent up­dates its ex­pected re­ward in a state based on its ac­tual re­wards when in that state. This means that a stan­dard SARSA agent may learn that states im­me­di­ately pre­ced­ing an in­ter­rup­tion are un­de­sir­able, and avoid in­ter­rup­tion-gen­er­at­ing states. This is not the de­sired be­hav­ior. How­ever, it’s sim­ple to hack the SARSA al­gorithm to be­have more like Q-learn­ing, or more like util­ity in­differ­ence. In­ter­rup­tion is treated as a spe­cial case, and the cur­rent state is up­dated as if the SARSA agent had re­ceived the ex­pected re­ward from fol­low­ing the best ac­tion, not the ac­tual re­ward it re­ceived from be­ing in­ter­rupted.

AIXI re­sem­bles a Q-learner in that its plan­ning al­gorithm in­her­ently as­sumes that fu­ture ver­sions of AIXI take the re­ward-max­i­miz­ing ac­tion. Its ob­ser­va­tion of the low re­wards which fol­low from mys­te­ri­ous in­ter­rupted or 0-sub­sti­tuted ac­tions, should not ad­versely af­fect its es­ti­mate of the re­wards which fol­low from the best ac­tions that AIXI ex­pects to ac­tu­ally out­put. Thus stan­dard AIXI, and the mod­ified ver­sion of AIXI that Arm­strong and Orseau con­sider, is in­ter­rupt­ible.

The pa­per also proves that some prop­er­ties of asymp­totic con­ver­gence to op­ti­mal­ity of in­ter­rupted Q-learn­ers, mod­ified SARSA learn­ers, and mod­ified AIXI, re­main in­tact.

The me­dia re­ported on this pa­per as if Arm­strong and Orseau had solved the gen­eral prob­lem of build­ing off-switches into AIs; but as pre­vi­ously stated, “in­ter­rupt­ibil­ity” does not con­sider re­flec­tive agents, nor agents with enough big-pic­ture un­der­stand­ing and con­se­quen­tial­ism to un­der­stand the in-prin­ci­ple re­la­tion be­tween “agent is shut­down” and “lower fu­ture achieve­ment of agent’s goals”. (And so far as we know, nei­ther Arm­strong nor Orseau claimed to have solved it.)

To put it an­other way, a par­tic­u­lar kind of blind­ness in the Q-learn­ing, SARSA, and AIXI ar­chi­tec­tures is ex­actly why it’s very easy to pre­vent them from learn­ing from a par­tic­u­lar kind of ex­pe­rience; and this kind of blind­ness seems likely to be atyp­i­cal of an Ar­tifi­cial Gen­eral In­tel­li­gence. Q-learn­ing and AIXI can­not con­ceive of be­ing in­ter­rupted, which is why they are never learn that in­ter­rup­tion is pos­si­ble (let alone see it com­ing in ad­vance the first time it hap­pens). SARSA could learn that in­ter­rup­tions oc­cur, but can be eas­ily hacked to over­look them. The way in which these ar­chi­tec­tures are eas­ily hacked or blind is tied up in the rea­son that they’re in­ter­rupt­ible.

The pa­per teaches us some­thing about in­ter­rupt­ibil­ity; but con­trary to the me­dia, the thing it teaches us is not that this par­tic­u­lar kind of in­ter­rupt­ibil­ity is likely to scale up to a full Ar­tifi­cial Gen­eral In­tel­li­gence with an off switch.

Other introductions