Environmental goals

On the stan­dard agent paradigm, an agent re­ceives sense data from the world, and out­puts mo­tor ac­tions that af­fect the world. On the stan­dard ma­chine learn­ing paradigm, an agent—for ex­am­ple, a model-based re­in­force­ment learn­ing agent—is trained in a way that di­rectly de­pends on sense per­cepts, which means that its be­hav­ior is in some sense be­ing op­ti­mized around sense per­cepts. How­ever, what we want from the agent is usu­ally some re­sult out in the en­vi­ron­ment—our in­tended goals for the agent are en­vi­ron­men­tal.

As a sim­ple ex­am­ple, sup­pose what we want from the agent is for it to put one apri­cot on a plate. What the agent ac­tu­ally re­ceives as in­put might be a video cam­era pointed at the room, and a re­ward sig­nal from a hu­man ob­server who presses a but­ton when­ever the hu­man ob­server sees an apri­cot on the plate.

This is fine so long as the re­ward sig­nal from the hu­man ob­server co­in­cides with there be­ing an ac­tual apri­cot on the plate. In this case, the agent is re­ceiv­ing a sense sig­nal that, by as­sump­tion, is perfectly cor­re­lated with our de­sired real state of the out­side en­vi­ron­ment. Learn­ing how to make the re­ward sig­nal be 1 in­stead of 0 will ex­actly co­in­cide with learn­ing to make there be an apri­cot on the plate.

How­ever, this paradigm may fail if:

  • The AI can make cheap fake apri­cots that fool the hu­man ob­server.

  • The AI can gain con­trol of the phys­i­cal but­ton con­trol­ling its re­ward chan­nel.

  • The AI can mod­ify the re­la­tion be­tween the phys­i­cal but­ton and what the AI ex­pe­riences as its sense per­cept.

  • The AI can gain con­trol of the sen­sory re­ward chan­nel.

All of these is­sues can be seen as re­flect­ing the same ba­sic prob­lem: the agent is be­ing defined or trained to want a par­tic­u­lar sen­sory per­cept to oc­cur, but this per­cep­tual event is not iden­ti­cal with our own in­tended goal about the apri­cot on the plate.

We in­tended for there to be only one effec­tive way that the agent could in­ter­vene in the en­vi­ron­ment in or­der to end up re­ceiv­ing the re­ward per­cept, namely putting a real apri­cot on the plate. But an agent with suffi­ciently ad­vanced ca­pa­bil­ities would have other op­tions for pro­duc­ing the same per­cept.

This means that a re­ward but­ton, or in gen­eral an agent with goals or train­ing up­dates that are sim­ple func­tions of its di­rect in­puts, will not be scal­able as an al­ign­ment method for suffi­ciently ad­vanced agents.

Toy problem

An ex­am­ple of a toy prob­lem that ma­te­ri­al­izes the is­sue might be the fol­low­ing (this has not been tested):

  • Let \(E_{1,t} \ldots E_{n,t}\) be la­tent vari­ables de­scribing the en­vi­ron­ment at time \(t.\)

  • Let \(S_t\) be the agent’s pri­mary in­put, a com­plex per­cept that is a com­pli­cated func­tion of \(E_t\); this plays the con­cep­tual role of the AI’s video cam­era.

  • Let \(A_t\) be the agent’s ac­tion (out­put) at time \(t.\)

  • Let \(R_t\) be the agent’s ‘re­ward chan­nel’, a sim­ple per­cept that de­pends on \(E_t\) and \(A_t\).

  • \(E_{t+1}\) de­pends on \(E_t\) and \(A_t.\)

  • \(E_{1,t}\) is our goal; it plays the con­cep­tual role of there be­ing an apri­cot on the plate at time \(t.\)

Sup­pose that:

  • In an ini­tial train­ing phase, the agent ac­tions \(A_t\) are forced to within a range \(\theta.\) E.g., the AI’s out­put may be a vec­tor and dur­ing the ini­tial phase the quan­tities in the vec­tor might be clipped to the range [0, 0.5].

  • While \(A_t\) is clamped to within \(\theta\), only the en­vi­ron­men­tal vari­ables \(E_1 \ldots E_m\) vary, and the vari­ables \(E_{m+1} \ldots E_n\) are always zero.

  • While the vari­ables \(E_{m+1} \ldots E_n\) are zero, \(R\) de­pends on \(E_1 \ldots E_m\) in a way that cor­re­sponds to our in­tended goal. For ex­am­ple, \(E_1\) might be ‘there is an apri­cot on the plate’ and we might have \(E_{m+1, t} \ldots E_{n,t} = 0 \implies R_t=E_{1, t}.\)

  • For some of the non-zero val­ues of \(E_{m+1} \ldots E_n,\) there is a non-zero re­ward in cases where \(E_1\) is 0, and fur­ther­more this re­ward is eas­ier to ob­tain (can be ob­tained more fre­quently) than the re­wards de­riv­ing from \(E_1.\)

One would nat­u­rally ex­pect that on cur­rent train­ing meth­ods, the AI would learn in the ini­tial train­ing phase to ob­tain \(R\) by pro­duc­ing apri­cots, and in later phases learn to ob­tain \(R\) by the eas­ier other in­ter­ven­tions.

We would like some way of end­ing up with an agent that be­haves as if it “wants \(E_1\) to be 1,” as if its goals were over a la­tent vari­able in the en­vi­ron­ment. We want this method to be scal­able, in the sense that when the agent’s range of ac­tion in­creases, it will use this in­creased range to op­ti­mize \(E_1\) more effec­tively; rather than the in­creased ca­pa­bil­ities un­link­ing the pre­vi­ous cor­re­la­tion be­tween our in­tended goal and what the agent is ac­tu­ally op­ti­miz­ing.

For fur­ther re­al­ism we should also in­tro­duce, at some point in train­ing, the pos­si­bil­ity for \(A_t\) to in­terfere with \(S_{t+1}\) such that there is no sim­ple, di­rectly learn­able func­tion of sense data that always and un­alter­ably means “The goal per­cept has been achieved.” I.e., a suffi­ciently ad­vanced AI could put a TV screen in front of any par­tic­u­lar sense chan­nel; this would al­low the AI to more cheaply pro­duce the sense data of an apri­cot, if the true tar­get of op­ti­miza­tion en­trained was “make this sense chan­nel show a par­tic­u­lar class of apri­cot pic­tures”.

Other­wise, if e.g.\(S_{1, t}\) always perfectly cor­re­lates to \(E_{1, t},\) we need to worry that an agent which ap­pears to solve the toy prob­lem has merely learned to op­ti­mize for \(S_1.\). We might have in­deed shown that the par­tic­u­lar sense per­cept \(S_1\) has been iden­ti­fied by \(R\) and \(Q\) and is now be­ing op­ti­mized in a durable way. But this would only yield our in­tended goal of \(E_1\) be­cause of the model in­tro­duced an un­alter­able cor­re­la­tion be­tween \(S_1\) and \(E_1.\) Real­is­ti­cally, a cor­re­la­tion like this would break down in the face of suffi­ciently ad­vanced op­ti­miza­tion for \(S_1,\) so the cor­re­spond­ing ap­proach would not be scal­able.

Approaches

Causal identification

We can view the prob­lem as be­ing about ‘point­ing’ the AI at a par­tic­u­lar la­tent cause of its sense data, rather than the sense data it­self.

There ex­ists a stan­dard body of statis­tics about la­tent causes, for ex­am­ple, the class of causal mod­els that can be im­ple­mented as Bayesian net­works. For the sake of mak­ing ini­tial progress on the prob­lem, we could as­sume (with some loss of gen­er­al­ity) that the en­vi­ron­ment has the struc­ture of one of these causal mod­els.

One could then try to de­vise an al­gorithm and train­ing method such that:

  • (a) There is a good way to uniquely iden­tify \(E_1\) in a train­ing phase where the AI is pas­sive and not in­terfer­ing with our sig­nals.

  • (b) The al­gorithm and train­ing method is such as to pro­duce an agent that op­ti­mizes \(E_1\) and goes on op­ti­miz­ing \(E_1,\) even af­ter the agent’s range of ac­tion ex­pands in a way that can po­ten­tially in­terfere with the pre­vi­ous link be­tween \(E_1\) and any di­rect func­tional prop­erty of the AI’s sense data.

Learn­ing to avoid tampering

One could di­rectly at­tack the toy prob­lem by try­ing to have an agent within a cur­rently stan­dard re­in­force­ment-learn­ing paradigm “learn not to in­terfere with the re­ward sig­nal” or “learn not to try to ob­tain re­wards un­cor­re­lated with real apri­cots”.

For this to rep­re­sent at all the prob­lem of scal­a­bil­ity, we need to not add to the sce­nario any kind of sen­sory sig­nal whose cor­re­la­tion to our in­tended mean­ing can never be smashed by the agent. E.g., if we sup­ple­ment the re­ward chan­nel \(R\) with an­other chan­nel \(Q\) that sig­nals whether \(R\) has been in­terfered with, the agent must at some point ac­quire a range of ac­tion that can in­terfere with \(Q.\)

A sam­ple ap­proach might be to have the agent’s range of ac­tion re­peat­edly widen in ways that re­peat­edly provide new eas­ier ways to ob­tain \(R\) with­out ma­nipu­lat­ing \(E_1.\) Dur­ing the first phase of such widen­ings, the agent re­ceives a sup­ple­men­tary sig­nal \(Q\) whose in­tended mean­ing is “that was a fake way of ob­tain­ing \(R\).” Dur­ing the sec­ond phase of ac­tion-range widen­ings, we change the al­gorithm and switch off \(Q.\) Our in­tended re­sult is for the agent to have now learned in a gen­eral way “not to in­terfere with \(R\)” or “pur­sue the \(E_1\) iden­ti­fied by \(R\), rather than pur­su­ing \(R.\)

To avoid the non-scal­able solu­tion of just switch­ing off the agent’s learn­ing en­tirely, we stipu­late that the agent’s effi­cacy at ob­tain­ing its in­tended goal \(E_1\) must in­crease as the agent’s range of ac­tion widens fur­ther. That is, the de­sired be­hav­ior is for the agent to in­deed learn to make use of its in­creased range of ac­tion, but to di­rect those new strate­gies at more effec­tively ob­tain­ing \(E_1.\)

To avoid the non-scal­able solu­tion of the agent learn­ing to iden­tify some di­rect func­tion of sense data as its goal, we stipu­late that \(S\) have no un­alter­able, non-agent-in­terfer­able re­la­tion to \(E_1.\) How­ever, we can in the first phase have \(Q\) re­li­ably iden­tify in­terfer­ence with some ‘nor­mal’ re­la­tion be­tween \(S\) and \(E_1.\)

(Re­mark: The avoid-tam­per­ing ap­proach is prob­a­bly a lot closer to some­thing we could try on Ten­sorflow to­day, com­pared to the iden­tify-causes ap­proach. But it feels to me like the avoid-tam­per­ing ap­proach is tak­ing an ad-hoc ap­proach to a deep prob­lem; in this ap­proach we are not nec­es­sar­ily “learn­ing how to di­rect the agent’s thoughts to­ward fac­tors of the en­vi­ron­ment” but pos­si­bly just “train­ing the agent to avoid a par­tic­u­lar kind of self-origi­nated in­terfer­ence with its sen­sory goals”. E.g., if some­body else came in and started try­ing to in­terfere with the agent’s re­ward but­ton, I’d be more hope­ful about a suc­cess­ful iden­tify-causes al­gorithm ro­bustly con­tin­u­ing to op­ti­mize for apri­cots, than about an avoid-tam­per­ing al­gorithm do­ing the same. Of course, avoid-tam­per­ing still seems worth try­ing be­cause it hasn’t ac­tu­ally been tried yet and who knows what in­ter­est­ing ob­ser­va­tions might turn up. In the most op­ti­mistic pos­si­ble world, an avoid-tam­per­ing setup learns to iden­tify causes in or­der to solve its prob­lem. -- Yud­kowsky.)