The AI must tolerate your safety measures

A corol­lary of the non-ad­ver­sar­ial prin­ci­ple: For ev­ery kind of safety mea­sure pro­posed for a suffi­ciently ad­vanced ar­tifi­cial in­tel­li­gence, we should im­me­di­ately ask how to avoid this safety mea­sure in­duc­ing an ad­ver­sar­ial con­text be­tween the hu­man pro­gram­mers and the agent be­ing con­structed.

A fur­ther corol­lary of the gen­er­al­ized prin­ci­ple of cog­ni­tive al­ign­ment would sug­gest that, if we know how to do it with­out in­duc­ing fur­ther prob­lems, the AI should pos­i­tively want the safety mea­sure to be there.

E.g., if the safety mea­sure we want is a sus­pend but­ton (off switch), our first thought should be, “How do we build an agent such that it doesn’t mind the off-switch be­ing pressed?”

At a higher level of al­ign­ment, if some­thing dam­ages the off-switch, the AI might be so con­figured that it nat­u­rally and spon­ta­neously thinks, “Oh no! The off-switch is dam­aged!” and re­ports this to the pro­gram­mers, or failing any re­sponse there, tries to re­pair the off-switch it­self. But this would only be a good idea if we were pretty sure we knew this wouldn’t lead to the AI sub­sti­tut­ing its own helpful ideas of what an off-switch would do, or shut­ting off ex­tra hard.

Similarly, if you start think­ing how nice it would be to have the AI op­er­at­ing in­side a box rather than run­ning around in the out­side world, your first thought should not be “How do I en­close this box in 12 lay­ers of Fara­day cages, a vir­tual ma­chine run­ning a Java sand­box, and 15 me­ters of con­crete?” but rather “How would I go about con­struct­ing an agent that only cared about things in­side a box and ex­pe­rienced no mo­tive to af­fect any­thing out­side the box?”

At a higher level of al­ign­ment we might imag­ine con­struct­ing a sort of agent that, if some­thing went wrong, would think “Oh no I am out­side the box, that seems very un­safe, how do I go back in?” But only if we were very sure that we were not thereby con­struct­ing a kind of agent that would, e.g., build a su­per­in­tel­li­gence out­side the box just to make ex­tra sure the origi­nal agent stayed in­side it.

Many classes of safety mea­sures are only meant to come into play af­ter some­thing else has already gone wrong, im­ply­ing that other things may have gone wrong ear­lier and with­out no­tice. This sug­gests that prag­mat­i­cally we should fo­cus on the prin­ci­ple of “The AI should leave the safety mea­sures alone and not ex­pe­rience an in­cen­tive to change their straight­for­ward op­er­a­tion” rather than tack­ling the more com­pli­cated prob­lems of ex­act al­ign­ment in­her­ent in “The AI should be en­thu­si­as­tic about the safety mea­sures and want them to work even bet­ter.”

How­ever, if the AI is chang­ing its own code or con­struct­ing sub­agents, it is nec­es­sary for the AI to have at least some pos­i­tive mo­ti­va­tion re­lat­ing to any safety mea­sures em­bod­ied in the op­er­a­tion of an in­ter­nal al­gorithm. An AI in­differ­ent to that code-based safety mea­sure would tend to just leave the un­in­ter­est­ing code out of the next self-mod­ifi­ca­tion.


  • Non-adversarial principle

    At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.