AI alignment

Eliezer Yudkowsky26 Mar 2015 23:12 UTC

The “alignment problem for advanced agents” or “AI alignment” is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

Both ‘advanced agent’ and ‘good’ should be understood as metasyntactic placeholders for complicated ideas still under debate. The term ‘alignment’ is intended to convey the idea of pointing an AI in a direction—just like, once you build a rocket, it has to be pointed in a particular direction.

“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences with (potentially idealized) human preferences.

Some alternative terms for this general field of study, such as ‘control problem’, can sound adversarial—like the rocket is already pointed in a bad direction and you need to wrestle with it. Other terms, like ‘AI safety’, understate the advocated degree to which alignment ought to be an intrinsic part of building advanced agents. E.g., there isn’t a separate theory of “bridge safety” for how to build bridges that don’t fall down. Pointing the agent in a particular direction ought to be seen as part of the standard problem of building an advanced machine agent. The problem does not divide into “building an advanced AI” and then separately “somehow causing that AI to produce good outcomes”, the problem is “getting good outcomes via building a cognitive agent that brings about those good outcomes”.

A good introductory article or survey paper for this field does not presently exist. If you have no idea what this problem is about, consider reading Nick Bostrom’s popular book Superintelligence.

You can explore this Arbital domain by following this link. See also the List of Value Alignment Topics on Arbital although this is not up-to-date.

comment:

Overview

Supporting knowledge

If you’re willing to spend time on learning this field and are not previously familiar with the basics of decision theory and probability theory, it’s worth reading the Arbital introductions to those first. In particular, it may be useful to become familiar with the notion of priors and belief revision, and with the coherence arguments for expected utility.

If you have time to read a textbook to gain general familiarity with AI, “Artificial Intelligence: A Modern Approach” is highly recommended.

Key terms

Advanced agent
Value / goodness / beneficial
Artificial General Intelligence
Superintelligence

Key factual propositions

Orthogonality thesis
Instrumental convergence thesis
Capability gain
Complexity and fragility of cosmopolitan value
Alignment difficulty

Advocated basic principles

Nonadversarial principle
Minimality principle
Intrinsic safety
Unreliable monitoring %note: Treat human monitoring as expensive, unreliable, and fragile.%

Advocated methodology

Generalized security mindset
Foreseeable difficulties

Advocated design principles

Value alignment
Corrigibility
Mild optimization / Taskishness
Conservatism
Transparency
Whitelisting
Redzoning

Current research areas

Cooperative inverse reinforcement learning
Interruptibility
Utility switching
Stable self-modification & Vingean reflection
Mirror models
Robust machine learning

Open problems

Environmental goals
Other-izer problem
Impact penalty
Shutdown utility function & abortability
Fully updated deference
Epistemic exclusion

Partially solved problems

Logical uncertainty for doubly-exponential reflective agents.
Interruptibility for non-reflective non-consequentialist Q-learners.
Strategy-determined Newcomblike problems.
Leverage prior

Future work

Ontology identification problem
Behaviorism
Averting instrumental strategies
Reproducibility

<div>

Eliezer Yudkowsky26 Mar 2015 23:12 UTC

Children:

Executable philosophy
Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.
Some computations are people
It’s possible to have a conscious person being simulated inside a computer or other substrate.
Researchers in value alignment theory
Who’s working full-time in value alignment theory?
The rocket alignment problem
If people talked about the problem of space travel the way they talked about AI…
Vingean reflection
The problem of thinking about your future self when it’s smarter than you.
Correlated coverage
In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?
Modeling distant superintelligences
The several large problems that might occur if an AI starts to think about alien superintelligences.
Strategic AGI typology
What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?
Sufficiently optimized agents appear coherent
If you could think as well as a superintelligence, you’d be at least that smart yourself.
Relevant powerful agents will be highly optimized
Strong cognitive uncontainability
An advanced agent can win in ways humans can’t understand in advance.
Advanced safety
An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.
Relevant powerful agent
An agent is relevant if it completely changes the course of history.
Informed oversight
Incentivize a reinforcement learner that’s less smart than you to accomplish some task
Safe training procedures for human-imitators
How does one train a reinforcement learner to act like a human?
Reliable prediction
How can we train predictors that reliably predict observable phenomena such as human behavior?
Selective similarity metrics for imitation
Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?
Relevant limited AI
Can we have a limited AI, that’s nonetheless relevant?
Value achievement dilemma
How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?
VAT playpen
Playpen page for VAT domain.
Nick Bostrom's book Superintelligence
The current best book-form introduction to AI alignment theory.
List: value-alignment subjects
Bullet point list of core VAT subjects.
AI arms races
AI arms races are bad
Corrigibility
“I can’t let you do that, Dave.”
Unforeseen maximum
When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
Patch resistance
One does not simply solve the value alignment problem.
Coordinative AI development hypothetical
What would safe AI development look like if we didn’t have to worry about anything else?
Safe impact measure
What can we measure to make sure an agent is acting in a safe manner?
AI alignment open problem
Tag for open problems under AI alignment.
Natural language understanding of "right" will yield normativity
What will happen if you tell an advanced agent to do the “right” thing?
Identifying ambiguous inductions
What do a “red strawberry”, a “red apple”, and a “red cherry” have in common that a “yellow carrot” doesn’t? Are they “red fruits” or “red objects”?
Value
The word ‘value’ in the phrase ‘value alignment’ is a metasyntactic variable that indicates the speaker’s future goals for intelligent life.
Linguistic conventions in value alignment
How and why to use precise language and words with special meaning when talking about value alignment.
Development phase unpredictable
Complexity of value
There’s no simple way to describe the goals we want Artificial Intelligences to want.
Value alignment problem
You want to build an advanced AI with the right values… but how?
Object-level vs. indirect goals
Difference between “give Alice the apple” and “give Alice what she wants”.
Value identification problem
Intended goal
Mindcrime
Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?
Task-directed AGI
An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.
Principles in AI alignment
A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.
Theory of (advanced) agents
One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.
Difficulty of AI alignment
How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?
Glossary (Value Alignment Theory)
Words that have a special meaning in the context of creating nice AIs.
Programmer
Who is building these advanced agents?