AI alignment

Eliezer Yudkowsky26 Mar 2015 23:12 UTC

Executable philosophy
Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.
Some computations are people
It’s possible to have a conscious person being simulated inside a computer or other substrate.
Researchers in value alignment theory
Who’s working full-time in value alignment theory?
- Nick Bostrom
  Nick Bostrom, secretly the inventor of Friendly AI
The rocket alignment problem
If people talked about the problem of space travel the way they talked about AI…
Vingean reflection
The problem of thinking about your future self when it’s smarter than you.
- Vinge's Principle
  An agent building another agent must usually approve its design without knowing the agent’s exact policy choices.
- Reflective stability
  Wanting to think the way you currently think, building other agents and self-modifications that think the same way.
  - Reflectively consistent degree of freedom
    When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.
    - Humean degree of freedom
      A concept includes ‘Humean degrees of freedom’ when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.
    - Value-laden
      Cure cancer, but avoid any bad side effects? Categorizing “bad side effects” requires knowing what’s “bad”. If an agent needs to load complex human goals to evaluate something, it’s “value-laden”.
  - Other-izing (wanted: new optimization idiom)
    Maximization isn’t possible for bounded agents, and satisficing doesn’t seem like enough. What other kind of ‘izing’ might be good for realistic, bounded agents?
  - Consequentialist preferences are reflectively stable by default
    Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.
- Tiling agents theory
  The theory of self-modifying agents that build successors that are very similar to themselves, like repeating tiles on a tesselated plane.
- Reflective consistency
  A decision system is reflectively consistent if it can approve of itself, or approve the construction of similar decision systems (as well as perhaps approving other decision systems too).
Correlated coverage
In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?
Modeling distant superintelligences
The several large problems that might occur if an AI starts to think about alien superintelligences.
- Distant superintelligences can coerce the most probable environment of your AI
  Distant superintelligences may be able to hack your local AI, if your AI’s preference framework depends on its most probable environment.
Strategic AGI typology
What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?
- Known-algorithm non-self-improving agent
  Possible advanced AIs that aren’t self-modifying, aren’t self-improving, and where we know and understand all the component algorithms.
- Autonomous AGI
  The hardest possible class of Friendly AI to build, with the least moral hazard; an AI intended to neither require nor accept further direction.
- Task-directed AGI
  An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.
  - Behaviorist genie
    An advanced agent that’s forbidden to model minds in too much detail.
  - Epistemic exclusion
    How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?
  - Open subproblems in aligning a Task-based AGI
    Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.
  - Low impact
    The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.
    - Shutdown utility function
      A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).
    - Abortable plans
      Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.
  - Conservative concept boundary
    Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.
  - Querying the AGI user
    Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).
  - Mild optimization
    An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.
  - Task identification problem
    If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?
    - Look where I'm pointing, not at my finger
      When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.
  - Safe plan identification and verification
    On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.
    - Do-What-I-Mean hierarchy
      Successive levels of “Do What I Mean” or AGIs that understand their users increasingly well
  - Faithful simulation
    How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?
  - Task (AI goal)
    When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.
  - Limited AGI
    Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.
  - Oracle
    System designed to safely answer questions.
    - Zermelo-Fraenkel provability oracle
      We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
  - Boxed AI
    Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??
    - Zermelo-Fraenkel provability oracle
      We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
- Oracle
  System designed to safely answer questions.
  - Zermelo-Fraenkel provability oracle
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
Sufficiently optimized agents appear coherent
If you could think as well as a superintelligence, you’d be at least that smart yourself.
Relevant powerful agents will be highly optimized
Strong cognitive uncontainability
An advanced agent can win in ways humans can’t understand in advance.
Advanced safety
An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.
- Methodology of unbounded analysis
  What we do and don’t understand how to do, using unlimited computing power, is a critical distinction and important frontier.
  - AIXI
    How to build an (evil) superintelligent AI using unlimited computing power and one page of Python code.
    - AIXI-tl
      A time-bounded version of the ideal agent AIXI that uses an impossibly large finite computer instead of a hypercomputer.
  - Solomonoff induction
    A simple way to superintelligently predict sequences of data, given unlimited computing power.
    - Solomonoff induction: Intro Dialogue (Math 2)
      An introduction to Solomonoff induction for the unfamiliar reader who isn’t bad at math
  - Hypercomputer
    Some formalisms demand computers larger than the limit of all finite computers
  - Unphysically large finite computer
    The imaginary box required to run programs that require impossibly large, but finite, amounts of computing power.
  - Cartesian agent
    Agents separated from their environments by impermeable barriers through which only sensory information can enter and motor output can exit.
    - Cartesian agent-environment boundary
      If your agent is separated from the environment by an absolute border that can only be crossed by sensory information and motor outputs, it might just be a Cartesian agent.
  - Mechanical Turk (example)
    The 19th-century chess-playing automaton known as the Mechanical Turk actually had a human operator inside. People at the time had interesting thoughts about the possibility of mechanical chess.
  - No-Free-Lunch theorems are often irrelevant
    There’s often a theorem proving that some problem has no optimal answer across every possible world. But this may not matter, since the real world is a special case. (E.g., a low-entropy universe.)
- AI safety mindset
  Asking how AI designs could go wrong, instead of imagining them going right.
  - Valley of Dangerous Complacency
    When the AGI works often enough that you let down your guard, but it still has bugs. Imagine a robotic car that almost always steers perfectly, but sometimes heads off a cliff.
  - Show me what you've broken
    To demonstrate competence at computer security, or AI alignment, think in terms of breaking proposals and finding technically demonstrable flaws in them.
  - Ad-hoc hack (alignment theory)
    A “hack” is when you alter the behavior of your AI in a way that defies, or doesn’t correspond to, a principled approach for that problem.
  - Don't try to solve the entire alignment problem
    New to AI alignment theory? Want to work in this area? Already been working in it for years? Don’t try to solve the entire alignment problem with your next good idea!
  - Flag the load-bearing premises
    If somebody says, “This AI safety plan is going to fail, because X” and you reply, “Oh, that’s fine because of Y and Z”, then you’d better clearly flag Y and Z as “load-bearing” parts of your plan.
  - Directing, vs. limiting, vs. opposing
    Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
- Optimization daemons
  When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences
- Nearest unblocked strategy
  If you patch an agent’s preference framework to avoid an undesirable solution, what can you expect to happen?
- Safe but useless
  Sometimes, at the end of locking down your AI so that it seems extremely safe, you’ll end up with an AI that can’t be used to do anything interesting.
- Distinguish which advanced-agent properties lead to the foreseeable difficulty
  Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you’re talking about.
- Goodness estimate biaser
  Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.
- Goodhart's Curse
  The Optimizer’s Curse meets Goodhart’s Law. For example, if our values are V, and an AI’s utility function U is a proxy for V, optimizing for high U seeks out ‘errors’—that is, high values of U—V.
- Context disaster
  Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it’s smarter.
- Methodology of foreseeable difficulties
  Building a nice AI is likely to be hard enough, and contain enough gotchas that won’t show up in the AI’s early days, that we need to foresee problems coming in advance.
- Actual effectiveness
  If you want the AI’s so-called ‘utility function’ to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.
Relevant powerful agent
An agent is relevant if it completely changes the course of history.
Informed oversight
Incentivize a reinforcement learner that’s less smart than you to accomplish some task
Safe training procedures for human-imitators
How does one train a reinforcement learner to act like a human?
Reliable prediction
How can we train predictors that reliably predict observable phenomena such as human behavior?
Selective similarity metrics for imitation
Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?
Relevant limited AI
Can we have a limited AI, that’s nonetheless relevant?
Value achievement dilemma
How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?
- Moral hazards in AGI development
  “Moral hazard” is when owners of an advanced AGI give in to the temptation to do things with it that the rest of us would regard as ‘bad’, like, say, declaring themselves God-Emperor.
- Coordinative AI development hypothetical
  What would safe AI development look like if we didn’t have to worry about anything else?
- Pivotal act
  Which types of AIs, if they work, can do things that drastically change the nature of the further game?
- Cosmic endowment
  The ‘cosmic endowment’ consists of all the stars that could be reached from probes originating on Earth; the sum of all matter and energy potentially available to be transformed into life and fun.
- Aligning an AGI adds significant development time
  Aligning an advanced AI foreseeably involves extra code and extra testing and not being able to do everything the fastest way, so it takes longer.
VAT playpen
Playpen page for VAT domain.
Nick Bostrom's book Superintelligence
The current best book-form introduction to AI alignment theory.
List: value-alignment subjects
Bullet point list of core VAT subjects.
AI arms races
AI arms races are bad
Corrigibility
“I can’t let you do that, Dave.”
- Programmer deception
  - Cognitive steganography
    Disaligned AIs that are modeling human psychology and trying to deceive their programmers will want to hide their internal thought processes from their programmers.
- Utility indifference
  How can we make an AI indifferent to whether we press a button that changes its goals?
- Averting instrumental pressures
  Almost-any utility function for an AI, whether the target is diamonds or paperclips or eudaimonia, implies subgoals like rapidly self-improving and refusing to shut down. Can we make that not happen?
- Averting the convergent instrumental strategy of self-improvement
  We probably want the first AGI to not improve as fast as possible, but improving as fast as possible is a convergent strategy for accomplishing most things.
- Shutdown problem
  How to build an AGI that lets you shut it down, despite the obvious fact that this will interfere with whatever the AGI’s goals are.
  - You can't get the coffee if you're dead
    An AI given the goal of ‘get the coffee’ can’t achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shutdown button being pressed.
- User manipulation
  If not otherwise averted, many of an AGI’s desired outcomes are likely to interact with users and hence imply an incentive to manipulate users.
  - User maximization
    A sub-principle of avoiding user manipulation—if you see an argmax over X or ‘optimize X’ instruction and X includes a user interaction, you’ve just told the AI to optimize the user.
- Hard problem of corrigibility
  Can you build an agent that reasons as if it knows itself to be incomplete and sympathizes with your wanting to rebuild or correct it?
- Problem of fully updated deference
  Why moral uncertainty doesn’t stop an AI from defending its off-switch.
- Interruptibility
  A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.
Unforeseen maximum
When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
- Missing the weird alternative
  People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
Patch resistance
One does not simply solve the value alignment problem.
- Unforeseen maximum
  When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
  - Missing the weird alternative
    People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
Coordinative AI development hypothetical
What would safe AI development look like if we didn’t have to worry about anything else?
Safe impact measure
What can we measure to make sure an agent is acting in a safe manner?
AI alignment open problem
Tag for open problems under AI alignment.
Natural language understanding of "right" will yield normativity
What will happen if you tell an advanced agent to do the “right” thing?
Identifying ambiguous inductions
What do a “red strawberry”, a “red apple”, and a “red cherry” have in common that a “yellow carrot” doesn’t? Are they “red fruits” or “red objects”?
Value
The word ‘value’ in the phrase ‘value alignment’ is a metasyntactic variable that indicates the speaker’s future goals for intelligent life.
- Extrapolated volition (normative moral theory)
  If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?
  - Rescuing the utility function
    If your utility function values ‘heat’, and then you discover to your horror that there’s no ontologically basic heat, switch to valuing disordered kinetic energy. Likewise ‘free will’ or ‘people’.
- Coherent extrapolated volition (alignment target)
  A proposed direction for an extremely well-aligned autonomous superintelligence—do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.
- 'Beneficial'
  Really actually good. A metasyntactic variable to mean “favoring whatever the speaker wants ideally to accomplish”, although different speakers have different morals and metaethics.
- William Frankena's list of terminal values
  Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions…
- 'Detrimental'
  The opposite of beneficial.
- Immediate goods
- Cosmopolitan value
  Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.
Linguistic conventions in value alignment
How and why to use precise language and words with special meaning when talking about value alignment.
- Utility
  What is “utility” in the context of Value Alignment Theory?
Development phase unpredictable
- Unforeseen maximum
  When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
  - Missing the weird alternative
    People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.
Complexity of value
There’s no simple way to describe the goals we want Artificial Intelligences to want.
- Underestimating complexity of value because goodness feels like a simple property
  When you just want to yell at the AI, “Just do normal high-value X, dammit, not weird low-value X!” and that ‘high versus low value’ boundary is way more complicated than your brain wants to think.
- Meta-rules for (narrow) value learning are still unsolved
  We don’t currently know a simple meta-utility function that would take in observation of humans and spit out our true values, or even a good target for a Task AGI.
Value alignment problem
You want to build an advanced AI with the right values… but how?
- Total alignment
  We say that an advanced AI is “totally aligned” when it knows exactly which outcomes and plans are beneficial, with no further user input.
- Preference framework
  What’s the thing an agent uses to compare its preferences?
  - Moral uncertainty
    A meta-utility function in which the utility function as usually considered, takes on different values in different possible worlds, potentially distinguishable by evidence.
    - Ideal target
      The ‘ideal target’ of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the ‘true’ utilities under moral uncertainty.
  - Meta-utility function
    Preference frameworks built out of simple utility functions, but where, e.g., the ‘correct’ utility function for a possible world depends on whether a button is pressed.
  - Attainable optimum
    The ‘attainable optimum’ of an agent’s preferences is the best that agent can actually do given its finite intelligence and resources (as opposed to the global maximum of those preferences).
Object-level vs. indirect goals
Difference between “give Alice the apple” and “give Alice what she wants”.
Value identification problem
- Happiness maximizer
- Edge instantiation
  When you ask the AI to make people happy, and it tiles the universe with the smallest objects that can be happy.
- Identifying causal goal concepts from sensory data
  If the intended goal is “cure cancer” and you show the AI healthy patients, it sees, say, a pattern of pixels on a webcam. How do you get to a goal concept about the real patients?
- Goal-concept identification
  Figuring out how to say “strawberry” to an AI that you want to bring you strawberries (and not fake plastic strawberries, either).
- Ontology identification problem
  How do we link an agent’s utility function to its model of the world, when we don’t know what that model will look like?
  - Diamond maximizer
    How would you build an agent that made as much diamond material as possible, given vast computing power but an otherwise rich and complicated environment?
  - Ontology identification problem: Technical tutorial
    Technical tutorial for ontology identification problem.
- Environmental goals
  The problem of having an AI want outcomes that are out in the world, not just want direct sense events.
Intended goal
Mindcrime
Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?
- Mindcrime: Introduction
- Nonperson predicate
  If we knew which computations were definitely not people, we could tell AIs which programs they were definitely allowed to compute.
Task-directed AGI
An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.
- Behaviorist genie
  An advanced agent that’s forbidden to model minds in too much detail.
- Epistemic exclusion
  How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?
- Open subproblems in aligning a Task-based AGI
  Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.
- Low impact
  The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.
  - Shutdown utility function
    A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).
  - Abortable plans
    Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.
- Conservative concept boundary
  Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.
- Querying the AGI user
  Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).
- Mild optimization
  An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.
- Task identification problem
  If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?
  - Look where I'm pointing, not at my finger
    When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.
- Safe plan identification and verification
  On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.
  - Do-What-I-Mean hierarchy
    Successive levels of “Do What I Mean” or AGIs that understand their users increasingly well
- Faithful simulation
  How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?
- Task (AI goal)
  When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.
- Limited AGI
  Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.
- Oracle
  System designed to safely answer questions.
  - Zermelo-Fraenkel provability oracle
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
- Boxed AI
  Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??
  - Zermelo-Fraenkel provability oracle
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.
Principles in AI alignment
A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.
- Non-adversarial principle
  At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.
  - Omnipotence test for AI safety
    Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that wants to hurt you and is held back only by lacking the power?
  - Niceness is the first line of defense
    The first line of defense in dealing with any partially superhuman AI system advanced enough to possibly be dangerous is that it does not want to hurt you or defeat your safety measures.
  - Directing, vs. limiting, vs. opposing
    Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
  - The AI must tolerate your safety measures
    A corollary of the nonadversarial principle is that “The AI must tolerate your safety measures.”
  - Generalized principle of cognitive alignment
    When we’re asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.
- Minimality principle
  The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.
- Understandability principle
  The more you understand what the heck is going on inside your AI, the safer you are.
  - Effability principle
    You are safer the more you understand the inner structure of how your AI thinks; the better you can describe the relation of smaller pieces of the AI’s thought process.
- Separation from hyperexistential risk
  The AI should be widely separated in the design space from any AI that would constitute a “hyperexistential risk” (anything worse than death).
Theory of (advanced) agents
One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.
- Instrumental convergence
  Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.
  - Paperclip maximizer
    This agent will not stop until the entire universe is filled with paperclips.
    - Paperclip
      A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.
    - Random utility function
      A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.
  - Instrumental
    What is “instrumental” in the context of Value Alignment Theory?
  - Instrumental pressure
    A consequentialist agent will want to bring about certain instrumental events that will help to fulfill its goals.
  - Convergent instrumental strategies
    Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?
    - Convergent strategies of self-modification
      The strategies we’d expect to be employed by an AI that understands the relevance of its code and hardware to achieving its goals, which therefore has subgoals about its code and hardware.
    - Consequentialist preferences are reflectively stable by default
      Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.
  - You can't get more paperclips that way
    Most arguments that “A paperclip maximizer could get more paperclips by (doing nice things)” are flawed.
- Orthogonality Thesis
  Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?
  - Paperclip maximizer
    This agent will not stop until the entire universe is filled with paperclips.
    - Paperclip
      A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.
    - Random utility function
      A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.
  - Mind design space is wide
    Imagine all human beings as one tiny dot inside a much vaster sphere of possibilities for “The space of minds in general.” It is wiser to make claims about some minds than all minds.
  - Instrumental goals are almost-equally as tractable as terminal goals
    Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.
- Advanced agent properties
  How smart does a machine intelligence need to be, for its niceness to become an issue? “Advanced” is a broad term to cover cognitive abilities such that we’d need to start considering AI alignment.
  - Big-picture strategic awareness
    We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.
  - Superintelligent
    A “superintelligence” is strongly superhuman (strictly higher-performing than any and all humans) on every cognitive problem.
  - Intelligence explosion
    What happens if a self-improving AI gets to the point where each amount x of self-improvement triggers >x further self-improvement, and it stays that way for a while.
  - Artificial General Intelligence
    An AI which has the same kind of “significantly more general” intelligence that humans have compared to chimpanzees; it can learn new domains, like we can.
  - Advanced nonagent
    Hypothetically, cognitively powerful programs that don’t follow the loop of “observe, learn, model the consequences, act, observe results” that a standard “agent” would.
  - Epistemic and instrumental efficiency
    An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.
    - Time-machine metaphor for efficient agents
      Don’t imagine a paperclip maximizer as a mind. Imagine it as a time machine that always spits out the output leading to the greatest number of future paperclips.
  - Standard agent properties
    What’s a Standard Agent, and what can it do?
    - Bounded agent
      An agent that operates in the real world, using realistic amounts of computing power, that is uncertain of its environment, etcetera.
  - Real-world domain
    Some AIs play chess, some AIs play Go, some AIs drive cars. These different ‘domains’ present different options. All of reality, in all its messy entanglement, is the ‘real-world domain’.
  - Sufficiently advanced Artificial Intelligence
    ‘Sufficiently advanced Artificial Intelligences’ are AIs with enough ‘advanced agent properties’ that we start needing to do ‘AI alignment’ to them.
  - Infrahuman, par-human, superhuman, efficient, optimal
    A categorization of AI ability levels relative to human, with some gotchas in the ordering. E.g., in simple domains where humans can play optimally, optimal play is not superhuman.
  - General intelligence
    Compared to chimpanzees, humans seem to be able to learn a much wider variety of domains. We have ‘significantly more generally applicable’ cognitive abilities, aka ‘more general intelligence’.
  - Corporations vs. superintelligences
    Corporations have relatively few of the advanced-agent properties that would allow one mistake in aligning a corporation to immediately kill all humans and turn the future light cone into paperclips.
  - Cognitive uncontainability
    ‘Cognitive uncontainability’ is when we can’t hold all of an agent’s possibilities inside our own minds.
    - Rich domain
      - Logical game
        Game’s mathematical structure at its purest form.
      - Almost all real-world domains are rich
        Anything you’re trying to accomplish in the real world can potentially be accomplished in a lot of different ways.
  - Vingean uncertainty
    You can’t predict the exact actions of an agent smarter than you—so is there anything you can say about them?
    - Vinge's Law
      You can’t predict exactly what someone smarter than you would do, because if you could, you’d be that smart yourself.
    - Deep Blue
      The chess-playing program, built by IBM, that first won the world chess championship from Garry Kasparov in 1996.
  - Consequentialist cognition
    The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.
Difficulty of AI alignment
How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?
Glossary (Value Alignment Theory)
Words that have a special meaning in the context of creating nice AIs.
- Friendly AI
  Old terminology for an AI whose preferences have been successfully aligned with idealized human values.
- Cognitive domain
  An allegedly compact unit of knowledge, such that ideas inside the unit interact mainly with each other and less with ideas in other domains.
  - Distances between cognitive domains
    Often in AI alignment we want to ask, “How close is ‘being able to do X’ to ‘being able to do Y’?”
- 'Concept'
  In the context of Artificial Intelligence, a ‘concept’ is a category, something that identifies thingies as being inside or outside the concept.
Programmer
Who is building these advanced agents?