AI alignment

  • Executable philosophy

    Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.

  • Some computations are people

    It’s possible to have a conscious person being simulated inside a computer or other substrate.

  • Researchers in value alignment theory

    Who’s working full-time in value alignment theory?

    • Nick Bostrom

      Nick Bostrom, secretly the inventor of Friendly AI

  • The rocket alignment problem

    If people talked about the problem of space travel the way they talked about AI…

  • Vingean reflection

    The problem of thinking about your future self when it’s smarter than you.

    • Vinge's Principle

      An agent building another agent must usually approve its design without knowing the agent’s exact policy choices.

    • Reflective stability

      Wanting to think the way you currently think, building other agents and self-modifications that think the same way.

      • Reflectively consistent degree of freedom

        When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.

        • Humean degree of freedom

          A concept includes ‘Humean degrees of freedom’ when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.

        • Value-laden

          Cure cancer, but avoid any bad side effects? Categorizing “bad side effects” requires knowing what’s “bad”. If an agent needs to load complex human goals to evaluate something, it’s “value-laden”.

      • Other-izing (wanted: new optimization idiom)

        Maximization isn’t possible for bounded agents, and satisficing doesn’t seem like enough. What other kind of ‘izing’ might be good for realistic, bounded agents?

      • Consequentialist preferences are reflectively stable by default

        Gandhi wouldn’t take a pill that made him want to kill people, because he knows in that case more people will be murdered. A paperclip maximizer doesn’t want to stop maximizing paperclips.

    • Tiling agents theory

      The theory of self-modifying agents that build successors that are very similar to themselves, like repeating tiles on a tesselated plane.

    • Reflective consistency

      A decision system is reflectively consistent if it can approve of itself, or approve the construction of similar decision systems (as well as perhaps approving other decision systems too).

  • Correlated coverage

    In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?

  • Modeling distant superintelligences

    The several large problems that might occur if an AI starts to think about alien superintelligences.

  • Strategic AGI typology

    What broad types of advanced AIs, corresponding to which strategic scenarios, might it be possible or wise to create?

    • Known-algorithm non-self-improving agent

      Possible advanced AIs that aren’t self-modifying, aren’t self-improving, and where we know and understand all the component algorithms.

    • Autonomous AGI

      The hardest possible class of Friendly AI to build, with the least moral hazard; an AI intended to neither require nor accept further direction.

    • Task-directed AGI

      An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.

      • Behaviorist genie

        An advanced agent that’s forbidden to model minds in too much detail.

      • Epistemic exclusion

        How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?

      • Open subproblems in aligning a Task-based AGI

        Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.

      • Low impact

        The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.

        • Shutdown utility function

          A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).

        • Abortable plans

          Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.

      • Conservative concept boundary

        Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.

      • Querying the AGI user

        Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).

      • Mild optimization

        An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.

      • Task identification problem

        If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?

        • Look where I'm pointing, not at my finger

          When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.

      • Safe plan identification and verification

        On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.

      • Faithful simulation

        How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?

      • Task (AI goal)

        When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.

      • Limited AGI

        Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.

      • Oracle

        System designed to safely answer questions.

        • Zermelo-Fraenkel provability oracle

          We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

      • Boxed AI

        Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??

        • Zermelo-Fraenkel provability oracle

          We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

    • Oracle

      System designed to safely answer questions.

      • Zermelo-Fraenkel provability oracle

        We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

  • Sufficiently optimized agents appear coherent

    If you could think as well as a superintelligence, you’d be at least that smart yourself.

  • Relevant powerful agents will be highly optimized
  • Strong cognitive uncontainability

    An advanced agent can win in ways humans can’t understand in advance.

  • Advanced safety

    An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.

    • Methodology of unbounded analysis

      What we do and don’t understand how to do, using unlimited computing power, is a critical distinction and important frontier.

      • AIXI

        How to build an (evil) superintelligent AI using unlimited computing power and one page of Python code.

        • AIXI-tl

          A time-bounded version of the ideal agent AIXI that uses an impossibly large finite computer instead of a hypercomputer.

      • Solomonoff induction

        A simple way to superintelligently predict sequences of data, given unlimited computing power.

      • Hypercomputer

        Some formalisms demand computers larger than the limit of all finite computers

      • Unphysically large finite computer

        The imaginary box required to run programs that require impossibly large, but finite, amounts of computing power.

      • Cartesian agent

        Agents separated from their environments by impermeable barriers through which only sensory information can enter and motor output can exit.

        • Cartesian agent-environment boundary

          If your agent is separated from the environment by an absolute border that can only be crossed by sensory information and motor outputs, it might just be a Cartesian agent.

      • Mechanical Turk (example)

        The 19th-century chess-playing automaton known as the Mechanical Turk actually had a human operator inside. People at the time had interesting thoughts about the possibility of mechanical chess.

      • No-Free-Lunch theorems are often irrelevant

        There’s often a theorem proving that some problem has no optimal answer across every possible world. But this may not matter, since the real world is a special case. (E.g., a low-entropy universe.)

    • AI safety mindset

      Asking how AI designs could go wrong, instead of imagining them going right.

      • Valley of Dangerous Complacency

        When the AGI works often enough that you let down your guard, but it still has bugs. Imagine a robotic car that almost always steers perfectly, but sometimes heads off a cliff.

      • Show me what you've broken

        To demonstrate competence at computer security, or AI alignment, think in terms of breaking proposals and finding technically demonstrable flaws in them.

      • Ad-hoc hack (alignment theory)

        A “hack” is when you alter the behavior of your AI in a way that defies, or doesn’t correspond to, a principled approach for that problem.

      • Don't try to solve the entire alignment problem

        New to AI alignment theory? Want to work in this area? Already been working in it for years? Don’t try to solve the entire alignment problem with your next good idea!

      • Flag the load-bearing premises

        If somebody says, “This AI safety plan is going to fail, because X” and you reply, “Oh, that’s fine because of Y and Z”, then you’d better clearly flag Y and Z as “load-bearing” parts of your plan.

      • Directing, vs. limiting, vs. opposing

        Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)

    • Optimization daemons

      When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences

    • Nearest unblocked strategy

      If you patch an agent’s preference framework to avoid an undesirable solution, what can you expect to happen?

    • Safe but useless

      Sometimes, at the end of locking down your AI so that it seems extremely safe, you’ll end up with an AI that can’t be used to do anything interesting.

    • Distinguish which advanced-agent properties lead to the foreseeable difficulty

      Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you’re talking about.

    • Goodness estimate biaser

      Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.

    • Goodhart's Curse

      The Optimizer’s Curse meets Goodhart’s Law. For example, if our values are V, and an AI’s utility function U is a proxy for V, optimizing for high U seeks out ‘errors’—that is, high values of U—V.

    • Context disaster

      Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it’s smarter.

    • Methodology of foreseeable difficulties

      Building a nice AI is likely to be hard enough, and contain enough gotchas that won’t show up in the AI’s early days, that we need to foresee problems coming in advance.

    • Actual effectiveness

      If you want the AI’s so-called ‘utility function’ to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.

  • Relevant powerful agent

    An agent is relevant if it completely changes the course of history.

  • Informed oversight

    Incentivize a reinforcement learner that’s less smart than you to accomplish some task

  • Safe training procedures for human-imitators

    How does one train a reinforcement learner to act like a human?

  • Reliable prediction

    How can we train predictors that reliably predict observable phenomena such as human behavior?

  • Selective similarity metrics for imitation

    Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?

  • Relevant limited AI

    Can we have a limited AI, that’s nonetheless relevant?

  • Value achievement dilemma

    How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?

    • Moral hazards in AGI development

      “Moral hazard” is when owners of an advanced AGI give in to the temptation to do things with it that the rest of us would regard as ‘bad’, like, say, declaring themselves God-Emperor.

    • Coordinative AI development hypothetical

      What would safe AI development look like if we didn’t have to worry about anything else?

    • Pivotal act

      Which types of AIs, if they work, can do things that drastically change the nature of the further game?

    • Cosmic endowment

      The ‘cosmic endowment’ consists of all the stars that could be reached from probes originating on Earth; the sum of all matter and energy potentially available to be transformed into life and fun.

    • Aligning an AGI adds significant development time

      Aligning an advanced AI foreseeably involves extra code and extra testing and not being able to do everything the fastest way, so it takes longer.

  • VAT playpen

    Playpen page for VAT domain.

  • Nick Bostrom's book Superintelligence

    The current best book-form introduction to AI alignment theory.

  • List: value-alignment subjects

    Bullet point list of core VAT subjects.

  • AI arms races

    AI arms races are bad

  • Corrigibility

    “I can’t let you do that, Dave.”

    • Programmer deception
      • Cognitive steganography

        Disaligned AIs that are modeling human psychology and trying to deceive their programmers will want to hide their internal thought processes from their programmers.

    • Utility indifference

      How can we make an AI indifferent to whether we press a button that changes its goals?

    • Averting instrumental pressures

      Almost-any utility function for an AI, whether the target is diamonds or paperclips or eudaimonia, implies subgoals like rapidly self-improving and refusing to shut down. Can we make that not happen?

    • Averting the convergent instrumental strategy of self-improvement

      We probably want the first AGI to not improve as fast as possible, but improving as fast as possible is a convergent strategy for accomplishing most things.

    • Shutdown problem

      How to build an AGI that lets you shut it down, despite the obvious fact that this will interfere with whatever the AGI’s goals are.

      • You can't get the coffee if you're dead

        An AI given the goal of ‘get the coffee’ can’t achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shutdown button being pressed.

    • User manipulation

      If not otherwise averted, many of an AGI’s desired outcomes are likely to interact with users and hence imply an incentive to manipulate users.

      • User maximization

        A sub-principle of avoiding user manipulation—if you see an argmax over X or ‘optimize X’ instruction and X includes a user interaction, you’ve just told the AI to optimize the user.

    • Hard problem of corrigibility

      Can you build an agent that reasons as if it knows itself to be incomplete and sympathizes with your wanting to rebuild or correct it?

    • Problem of fully updated deference

      Why moral uncertainty doesn’t stop an AI from defending its off-switch.

    • Interruptibility

      A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.

  • Unforeseen maximum

    When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

    • Missing the weird alternative

      People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Patch resistance

    One does not simply solve the value alignment problem.

    • Unforeseen maximum

      When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

      • Missing the weird alternative

        People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Coordinative AI development hypothetical

    What would safe AI development look like if we didn’t have to worry about anything else?

  • Safe impact measure

    What can we measure to make sure an agent is acting in a safe manner?

  • AI alignment open problem

    Tag for open problems under AI alignment.

  • Natural language understanding of "right" will yield normativity

    What will happen if you tell an advanced agent to do the “right” thing?

  • Identifying ambiguous inductions

    What do a “red strawberry”, a “red apple”, and a “red cherry” have in common that a “yellow carrot” doesn’t? Are they “red fruits” or “red objects”?

  • Value

    The word ‘value’ in the phrase ‘value alignment’ is a metasyntactic variable that indicates the speaker’s future goals for intelligent life.

    • Extrapolated volition (normative moral theory)

      If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?

      • Rescuing the utility function

        If your utility function values ‘heat’, and then you discover to your horror that there’s no ontologically basic heat, switch to valuing disordered kinetic energy. Likewise ‘free will’ or ‘people’.

    • Coherent extrapolated volition (alignment target)

      A proposed direction for an extremely well-aligned autonomous superintelligence—do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.

    • 'Beneficial'

      Really actually good. A metasyntactic variable to mean “favoring whatever the speaker wants ideally to accomplish”, although different speakers have different morals and metaethics.

    • William Frankena's list of terminal values

      Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions…

    • 'Detrimental'

      The opposite of beneficial.

    • Immediate goods
    • Cosmopolitan value

      Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.

  • Linguistic conventions in value alignment

    How and why to use precise language and words with special meaning when talking about value alignment.

    • Utility

      What is “utility” in the context of Value Alignment Theory?

  • Development phase unpredictable
    • Unforeseen maximum

      When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)

      • Missing the weird alternative

        People might systematically overlook “make tiny molecular smileyfaces” as a way of “producing smiles”, because our brains automatically search for high-utility-to-us ways of “producing smiles”.

  • Complexity of value

    There’s no simple way to describe the goals we want Artificial Intelligences to want.

  • Value alignment problem

    You want to build an advanced AI with the right values… but how?

    • Total alignment

      We say that an advanced AI is “totally aligned” when it knows exactly which outcomes and plans are beneficial, with no further user input.

    • Preference framework

      What’s the thing an agent uses to compare its preferences?

      • Moral uncertainty

        A meta-utility function in which the utility function as usually considered, takes on different values in different possible worlds, potentially distinguishable by evidence.

        • Ideal target

          The ‘ideal target’ of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the ‘true’ utilities under moral uncertainty.

      • Meta-utility function

        Preference frameworks built out of simple utility functions, but where, e.g., the ‘correct’ utility function for a possible world depends on whether a button is pressed.

      • Attainable optimum

        The ‘attainable optimum’ of an agent’s preferences is the best that agent can actually do given its finite intelligence and resources (as opposed to the global maximum of those preferences).

  • Object-level vs. indirect goals

    Difference between “give Alice the apple” and “give Alice what she wants”.

  • Value identification problem
  • Intended goal
  • Mindcrime

    Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?

  • Task-directed AGI

    An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.

    • Behaviorist genie

      An advanced agent that’s forbidden to model minds in too much detail.

    • Epistemic exclusion

      How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?

    • Open subproblems in aligning a Task-based AGI

      Open research problems, especially ones we can model today, in building an AGI that can “paint all cars pink” without turning its future light cone into pink-painted cars.

    • Low impact

      The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.

      • Shutdown utility function

        A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).

      • Abortable plans

        Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they’ll have a quiet self-destruct option that includes any replicated nanomachines.

    • Conservative concept boundary

      Given N example burritos, draw a boundary around what is a ‘burrito’ that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.

    • Querying the AGI user

      Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).

    • Mild optimization

      An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the universe with pink-painted cars, because it’s not trying that hard to max out its car-painting score.

    • Task identification problem

      If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?

      • Look where I'm pointing, not at my finger

        When trying to communicate the concept “glove”, getting the AGI to focus on “gloves” rather than “my user’s decision to label something a glove” or “anything that depresses the glove-labeling button”.

    • Safe plan identification and verification

      On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don’t want it to do.

    • Faithful simulation

      How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?

    • Task (AI goal)

      When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.

    • Limited AGI

      Task-based AGIs don’t need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.

    • Oracle

      System designed to safely answer questions.

      • Zermelo-Fraenkel provability oracle

        We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

    • Boxed AI

      Idea: what if we limit how AI can interact with the world. That’ll make it safe, right??

      • Zermelo-Fraenkel provability oracle

        We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can’t see how to use that capability to save the world.

  • Principles in AI alignment

    A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.

    • Non-adversarial principle

      At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.

      • Omnipotence test for AI safety

        Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that wants to hurt you and is held back only by lacking the power?

      • Niceness is the first line of defense

        The first line of defense in dealing with any partially superhuman AI system advanced enough to possibly be dangerous is that it does not want to hurt you or defeat your safety measures.

      • Directing, vs. limiting, vs. opposing

        Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)

      • The AI must tolerate your safety measures

        A corollary of the nonadversarial principle is that “The AI must tolerate your safety measures.”

      • Generalized principle of cognitive alignment

        When we’re asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.

    • Minimality principle

      The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.

    • Understandability principle

      The more you understand what the heck is going on inside your AI, the safer you are.

      • Effability principle

        You are safer the more you understand the inner structure of how your AI thinks; the better you can describe the relation of smaller pieces of the AI’s thought process.

    • Separation from hyperexistential risk

      The AI should be widely separated in the design space from any AI that would constitute a “hyperexistential risk” (anything worse than death).

  • Theory of (advanced) agents

    One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.

    • Instrumental convergence

      Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.

      • Paperclip maximizer

        This agent will not stop until the entire universe is filled with paperclips.

        • Paperclip

          A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.

        • Random utility function

          A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.

      • Instrumental

        What is “instrumental” in the context of Value Alignment Theory?

      • Instrumental pressure

        A consequentialist agent will want to bring about certain instrumental events that will help to fulfill its goals.

      • Convergent instrumental strategies

        Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?

      • You can't get more paperclips that way

        Most arguments that “A paperclip maximizer could get more paperclips by (doing nice things)” are flawed.

    • Orthogonality Thesis

      Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?

      • Paperclip maximizer

        This agent will not stop until the entire universe is filled with paperclips.

        • Paperclip

          A configuration of matter that we’d see as being worthless even from a very cosmopolitan perspective.

        • Random utility function

          A ‘random’ utility function is one chosen at random according to some simple probability measure (e.g. weight by Kolmorogov complexity) on a logical space of formal utility functions.

      • Mind design space is wide

        Imagine all human beings as one tiny dot inside a much vaster sphere of possibilities for “The space of minds in general.” It is wiser to make claims about some minds than all minds.

      • Instrumental goals are almost-equally as tractable as terminal goals

        Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.

    • Advanced agent properties

      How smart does a machine intelligence need to be, for its niceness to become an issue? “Advanced” is a broad term to cover cognitive abilities such that we’d need to start considering AI alignment.

      • Big-picture strategic awareness

        We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.

      • Superintelligent

        A “superintelligence” is strongly superhuman (strictly higher-performing than any and all humans) on every cognitive problem.

      • Intelligence explosion

        What happens if a self-improving AI gets to the point where each amount x of self-improvement triggers >x further self-improvement, and it stays that way for a while.

      • Artificial General Intelligence

        An AI which has the same kind of “significantly more general” intelligence that humans have compared to chimpanzees; it can learn new domains, like we can.

      • Advanced nonagent

        Hypothetically, cognitively powerful programs that don’t follow the loop of “observe, learn, model the consequences, act, observe results” that a standard “agent” would.

      • Epistemic and instrumental efficiency

        An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.

      • Standard agent properties

        What’s a Standard Agent, and what can it do?

        • Bounded agent

          An agent that operates in the real world, using realistic amounts of computing power, that is uncertain of its environment, etcetera.

      • Real-world domain

        Some AIs play chess, some AIs play Go, some AIs drive cars. These different ‘domains’ present different options. All of reality, in all its messy entanglement, is the ‘real-world domain’.

      • Sufficiently advanced Artificial Intelligence

        ‘Sufficiently advanced Artificial Intelligences’ are AIs with enough ‘advanced agent properties’ that we start needing to do ‘AI alignment’ to them.

      • Infrahuman, par-human, superhuman, efficient, optimal

        A categorization of AI ability levels relative to human, with some gotchas in the ordering. E.g., in simple domains where humans can play optimally, optimal play is not superhuman.

      • General intelligence

        Compared to chimpanzees, humans seem to be able to learn a much wider variety of domains. We have ‘significantly more generally applicable’ cognitive abilities, aka ‘more general intelligence’.

      • Corporations vs. superintelligences

        Corporations have relatively few of the advanced-agent properties that would allow one mistake in aligning a corporation to immediately kill all humans and turn the future light cone into paperclips.

      • Cognitive uncontainability

        ‘Cognitive uncontainability’ is when we can’t hold all of an agent’s possibilities inside our own minds.

      • Vingean uncertainty

        You can’t predict the exact actions of an agent smarter than you—so is there anything you can say about them?

        • Vinge's Law

          You can’t predict exactly what someone smarter than you would do, because if you could, you’d be that smart yourself.

        • Deep Blue

          The chess-playing program, built by IBM, that first won the world chess championship from Garry Kasparov in 1996.

      • Consequentialist cognition

        The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.

  • Difficulty of AI alignment

    How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?

  • Glossary (Value Alignment Theory)

    Words that have a special meaning in the context of creating nice AIs.

    • Friendly AI

      Old terminology for an AI whose preferences have been successfully aligned with idealized human values.

    • Cognitive domain

      An allegedly compact unit of knowledge, such that ideas inside the unit interact mainly with each other and less with ideas in other domains.

    • 'Concept'

      In the context of Artificial Intelligence, a ‘concept’ is a category, something that identifies thingies as being inside or outside the concept.

  • Programmer

    Who is building these advanced agents?