Principles in AI alignment

A ‘prin­ci­ple’ of AI al­ign­ment is some­thing we want in a broad sense for the whole AI, which has in­formed nar­rower de­sign pro­pos­als for par­tic­u­lar parts or as­pects of the AI.

For ex­am­ple:

  • The Non-ad­ver­sar­ial prin­ci­ple says that the AI should never be search­ing for a way to defeat our safety mea­sures or do some­thing else we don’t want, even if we think this search will come up empty; it’s just the wrong thing for us to pro­gram com­put­ing power to do.

  • This in­forms the pro­posal of Value al­ign­ment prob­lem: we ought to build an AI that wants to at­tain the class of out­comes we want to see.

  • This in­forms the pro­posal of Cor­rigi­bil­ity, sub­pro­posal Utility in­differ­ence: if we build a sus­pend but­ton into the AI, we need to make sure the AI ex­pe­riences no in­stru­men­tal pres­sure to dis­able the sus­pend but­ton.

  • The Min­i­mal­ity prin­ci­ple says that when we are build­ing the first al­igned AGI, we should try to do as lit­tle as pos­si­ble, us­ing the least dan­ger­ous cog­ni­tive com­pu­ta­tions pos­si­ble, that is nec­es­sary in or­der to pre­vent the de­fault out­come of the world be­ing de­stroyed by the first un­al­igned AGI.

  • This in­forms the pro­posal of Mild op­ti­miza­tion and Task­ish­ness: We are safer if all goals and sub­goals of the AI are for­mu­lated in such a way that they can be achieved as greatly as prefer­able us­ing a bounded amount of effort, and the AI only ex­erts enough effort to do that.

  • This in­forms the pro­posal of Be­hav­iorism: It seems like there are some pivotal-act pro­pos­als that don’t re­quire the AI to un­der­stand and pre­dict hu­mans in great de­tail, just to mas­ter en­g­ineer­ing; and it seems like we can head off mul­ti­ple thorny prob­lems by not hav­ing the AI try­ing to model hu­mans or other minds in as much de­tail as pos­si­ble.

Please be guarded about declar­ing things to be ‘prin­ci­ples’ un­less they have already in­formed more than one spe­cific de­sign pro­posal and more than one per­son thinks they are a good idea. You could call them ‘pro­posed prin­ci­ples’ and post them un­der your own do­main if you per­son­ally think they are a good idea. There are a lot of pos­si­ble ‘broad de­sign wishes’, or things that peo­ple think are ‘broad de­sign wishes’, and the prin­ci­ples that have ac­tu­ally already in­formed spe­cific de­sign pro­pos­als would oth­er­wise get lost in the crowd.


  • Non-adversarial principle

    At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.

  • Minimality principle

    The first AGI ever built should save the world in a way that re­quires the least amount of the least dan­ger­ous cog­ni­tion.

  • Understandability principle

    The more you un­der­stand what the heck is go­ing on in­side your AI, the safer you are.

  • Separation from hyperexistential risk

    The AI should be widely sep­a­rated in the de­sign space from any AI that would con­sti­tute a “hy­per­ex­is­ten­tial risk” (any­thing worse than death).


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.