AI alignment

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

Both ‘advanced agent’ and ‘good’ should be understood as metasyntactic placeholders for complicated ideas still under debate. The term ‘alignment’ is intended to convey the idea of pointing an AI in a direction—just like, once you build a rocket, it has to be pointed in a particular direction.

“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences with (potentially idealized) human preferences.

Some alternative terms for this general field of study, such as ‘control problem’, can sound adversarial—like the rocket is already pointed in a bad direction and you need to wrestle with it. Other terms, like ‘AI safety’, understate the advocated degree to which alignment ought to be an intrinsic part of building advanced agents. E.g., there isn’t a separate theory of “bridge safety” for how to build bridges that don’t fall down. Pointing the agent in a particular direction ought to be seen as part of the standard problem of building an advanced machine agent. The problem does not divide into “building an advanced AI” and then separately “somehow causing that AI to produce good outcomes”, the problem is “getting good outcomes via building a cognitive agent that brings about those good outcomes”.

A good introductory article or survey paper for this field does not presently exist. If you have no idea what this problem is about, consider reading Nick Bostrom’s popular book Superintelligence.

You can explore this Arbital domain by following this link. See also the List of Value Alignment Topics on Arbital although this is not up-to-date.

comment:

Overview

Supporting knowledge

If you’re willing to spend time on learning this field and are not previously familiar with the basics of decision theory and probability theory, it’s worth reading the Arbital introductions to those first. In particular, it may be useful to become familiar with the notion of priors and belief revision, and with the coherence arguments for expected utility.

If you have time to read a textbook to gain general familiarity with AI, “Artificial Intelligence: A Modern Approach” is highly recommended.

Key terms

  • Advanced agent

  • Value /​ goodness /​ beneficial

  • Artificial General Intelligence

  • Superintelligence

Key factual propositions

  • Orthogonality thesis

  • Instrumental convergence thesis

  • Capability gain

  • Complexity and fragility of cosmopolitan value

  • Alignment difficulty

Advocated basic principles

  • Nonadversarial principle

  • Minimality principle

  • Intrinsic safety

  • Unreliable monitoring %note: Treat human monitoring as expensive, unreliable, and fragile.%

Advocated methodology

  • Generalized security mindset

  • Foreseeable difficulties

Advocated design principles

  • Value alignment

  • Corrigibility

  • Mild optimization /​ Taskishness

  • Conservatism

  • Transparency

  • Whitelisting

  • Redzoning

Current research areas

  • Cooperative inverse reinforcement learning

  • Interruptibility

  • Utility switching

  • Stable self-modification & Vingean reflection

  • Mirror models

  • Robust machine learning

Open problems

  • Environmental goals

  • Other-izer problem

  • Impact penalty

  • Shutdown utility function & abortability

  • Fully updated deference

  • Epistemic exclusion

Partially solved problems

  • Logical uncertainty for doubly-exponential reflective agents.

  • Interruptibility for non-reflective non-consequentialist Q-learners.

  • Strategy-determined Newcomblike problems.

  • Leverage prior

Future work

  • Ontology identification problem

  • Behaviorism

  • Averting instrumental strategies

  • Reproducibility

<div>

Children: