Principles in AI alignment

A ‘principle’ of AI alignment is something we want in a broad sense for the whole AI, which has informed narrower design proposals for particular parts or aspects of the AI.

For example:

  • The Non-adversarial principle says that the AI should never be searching for a way to defeat our safety measures or do something else we don’t want, even if we think this search will come up empty; it’s just the wrong thing for us to program computing power to do.

  • This informs the proposal of Value alignment problem: we ought to build an AI that wants to attain the class of outcomes we want to see.

  • This informs the proposal of Corrigibility, subproposal Utility indifference: if we build a suspend button into the AI, we need to make sure the AI experiences no instrumental pressure to disable the suspend button.

  • The Minimality principle says that when we are building the first aligned AGI, we should try to do as little as possible, using the least dangerous cognitive computations possible, that is necessary in order to prevent the default outcome of the world being destroyed by the first unaligned AGI.

  • This informs the proposal of Mild optimization and Taskishness: We are safer if all goals and subgoals of the AI are formulated in such a way that they can be achieved as greatly as preferable using a bounded amount of effort, and the AI only exerts enough effort to do that.

  • This informs the proposal of Behaviorism: It seems like there are some pivotal-act proposals that don’t require the AI to understand and predict humans in great detail, just to master engineering; and it seems like we can head off multiple thorny problems by not having the AI trying to model humans or other minds in as much detail as possible.

Please be guarded about declaring things to be ‘principles’ unless they have already informed more than one specific design proposal and more than one person thinks they are a good idea. You could call them ‘proposed principles’ and post them under your own domain if you personally think they are a good idea. There are a lot of possible ‘broad design wishes’, or things that people think are ‘broad design wishes’, and the principles that have actually already informed specific design proposals would otherwise get lost in the crowd.


  • Non-adversarial principle

    At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.

  • Minimality principle

    The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.

  • Understandability principle

    The more you understand what the heck is going on inside your AI, the safer you are.

  • Separation from hyperexistential risk

    The AI should be widely separated in the design space from any AI that would constitute a “hyperexistential risk” (anything worse than death).


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.