AI alignment

  • Executable philosophy

    Philo­soph­i­cal dis­course aimed at pro­duc­ing a trust­wor­thy an­swer or meta-an­swer, in limited time, which can used in con­struct­ing an Ar­tifi­cial In­tel­li­gence.

  • Some computations are people

    It’s pos­si­ble to have a con­scious per­son be­ing simu­lated in­side a com­puter or other sub­strate.

  • Researchers in value alignment theory

    Who’s work­ing full-time in value al­ign­ment the­ory?

    • Nick Bostrom

      Nick Bostrom, se­cretly the in­ven­tor of Friendly AI

  • The rocket alignment problem

    If peo­ple talked about the prob­lem of space travel the way they talked about AI…

  • Vingean reflection

    The prob­lem of think­ing about your fu­ture self when it’s smarter than you.

    • Vinge's Principle

      An agent build­ing an­other agent must usu­ally ap­prove its de­sign with­out know­ing the agent’s ex­act policy choices.

    • Reflective stability

      Want­ing to think the way you cur­rently think, build­ing other agents and self-mod­ifi­ca­tions that think the same way.

      • Reflectively consistent degree of freedom

        When an in­stru­men­tally effi­cient, self-mod­ify­ing AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a re­flec­tively con­sis­tent de­gree of free­dom.

        • Humean degree of freedom

          A con­cept in­cludes ‘Humean de­grees of free­dom’ when the in­tu­itive bor­ders of the hu­man ver­sion of that con­cept de­pend on our val­ues, mak­ing that con­cept less nat­u­ral for AIs to learn.

        • Value-laden

          Cure can­cer, but avoid any bad side effects? Cat­e­go­riz­ing “bad side effects” re­quires know­ing what’s “bad”. If an agent needs to load com­plex hu­man goals to eval­u­ate some­thing, it’s “value-laden”.

      • Other-izing (wanted: new optimization idiom)

        Max­i­miza­tion isn’t pos­si­ble for bounded agents, and satis­fic­ing doesn’t seem like enough. What other kind of ‘iz­ing’ might be good for re­al­is­tic, bounded agents?

      • Consequentialist preferences are reflectively stable by default

        Gandhi wouldn’t take a pill that made him want to kill peo­ple, be­cause he knows in that case more peo­ple will be mur­dered. A pa­per­clip max­i­mizer doesn’t want to stop max­i­miz­ing pa­per­clips.

    • Tiling agents theory

      The the­ory of self-mod­ify­ing agents that build suc­ces­sors that are very similar to them­selves, like re­peat­ing tiles on a tes­se­lated plane.

    • Reflective consistency

      A de­ci­sion sys­tem is re­flec­tively con­sis­tent if it can ap­prove of it­self, or ap­prove the con­struc­tion of similar de­ci­sion sys­tems (as well as per­haps ap­prov­ing other de­ci­sion sys­tems too).

  • Correlated coverage

    In which parts of AI al­ign­ment can we hope that get­ting many things right, will mean the AI gets ev­ery­thing right?

  • Modeling distant superintelligences

    The sev­eral large prob­lems that might oc­cur if an AI starts to think about alien su­per­in­tel­li­gences.

  • Strategic AGI typology

    What broad types of ad­vanced AIs, cor­re­spond­ing to which strate­gic sce­nar­ios, might it be pos­si­ble or wise to cre­ate?

    • Known-algorithm non-self-improving agent

      Pos­si­ble ad­vanced AIs that aren’t self-mod­ify­ing, aren’t self-im­prov­ing, and where we know and un­der­stand all the com­po­nent al­gorithms.

    • Autonomous AGI

      The hard­est pos­si­ble class of Friendly AI to build, with the least moral haz­ard; an AI in­tended to nei­ther re­quire nor ac­cept fur­ther di­rec­tion.

    • Task-directed AGI

      An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.

      • Behaviorist genie

        An ad­vanced agent that’s for­bid­den to model minds in too much de­tail.

      • Epistemic exclusion

        How would you build an AI that, no mat­ter what else it learned about the world, never knew or wanted to know what was in­side your base­ment?

      • Open subproblems in aligning a Task-based AGI

        Open re­search prob­lems, es­pe­cially ones we can model to­day, in build­ing an AGI that can “paint all cars pink” with­out turn­ing its fu­ture light cone into pink-painted cars.

      • Low impact

        The open prob­lem of hav­ing an AI carry out tasks in ways that cause min­i­mum side effects and change as lit­tle of the rest of the uni­verse as pos­si­ble.

        • Shutdown utility function

          A spe­cial case of a low-im­pact util­ity func­tion where you just want the AGI to switch it­self off harm­lessly (and not cre­ate sub­agents to make ab­solutely sure it stays off, etcetera).

        • Abortable plans

          Plans that can be un­done, or switched to hav­ing low fur­ther im­pact. If the AI builds abortable nanoma­chines, they’ll have a quiet self-de­struct op­tion that in­cludes any repli­cated nanoma­chines.

      • Conservative concept boundary

        Given N ex­am­ple bur­ri­tos, draw a bound­ary around what is a ‘bur­rito’ that is rel­a­tively sim­ple and al­lows as few pos­i­tive in­stances as pos­si­ble. Helps make sure the next thing gen­er­ated is a bur­rito.

      • Querying the AGI user

        Pos­tu­lat­ing that an ad­vanced agent will check some­thing with its user, prob­a­bly comes with some stan­dard is­sues and gotchas (e.g., pri­ori­tiz­ing what to query, not ma­nipu­lat­ing the user, etc etc).

      • Mild optimization

        An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the uni­verse with pink-painted cars, be­cause it’s not try­ing that hard to max out its car-paint­ing score.

      • Task identification problem

        If you have a task-based AGI (Ge­nie) then how do you pin­point ex­actly what you want it to do (and not do)?

        • Look where I'm pointing, not at my finger

          When try­ing to com­mu­ni­cate the con­cept “glove”, get­ting the AGI to fo­cus on “gloves” rather than “my user’s de­ci­sion to la­bel some­thing a glove” or “any­thing that de­presses the glove-la­bel­ing but­ton”.

      • Safe plan identification and verification

        On a par­tic­u­lar task or prob­lem, the is­sue of how to com­mu­ni­cate to the AGI what you want it to do and all the things you don’t want it to do.

        • Do-What-I-Mean hierarchy

          Suc­ces­sive lev­els of “Do What I Mean” or AGIs that un­der­stand their users in­creas­ingly well

      • Faithful simulation

        How would you iden­tify, to a Task AGI (aka Ge­nie), the prob­lem of scan­ning a hu­man brain, and then run­ning a suffi­ciently ac­cu­rate simu­la­tion of it for the simu­la­tion to not be crazy or psy­chotic?

      • Task (AI goal)

        When build­ing the first AGIs, it may be wiser to as­sign them only goals that are bounded in space and time, and can be satis­fied by bounded efforts.

      • Limited AGI

        Task-based AGIs don’t need un­limited cog­ni­tive and ma­te­rial pow­ers to carry out their Tasks; which means their pow­ers can po­ten­tially be limited.

      • Oracle

        Sys­tem de­signed to safely an­swer ques­tions.

        • Zermelo-Fraenkel provability oracle

          We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.

      • Boxed AI

        Idea: what if we limit how AI can in­ter­act with the world. That’ll make it safe, right??

        • Zermelo-Fraenkel provability oracle

          We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.

    • Oracle

      Sys­tem de­signed to safely an­swer ques­tions.

      • Zermelo-Fraenkel provability oracle

        We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.

  • Sufficiently optimized agents appear coherent

    If you could think as well as a su­per­in­tel­li­gence, you’d be at least that smart your­self.

  • Relevant powerful agents will be highly optimized
  • Strong cognitive uncontainability

    An ad­vanced agent can win in ways hu­mans can’t un­der­stand in ad­vance.

  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.

    • Methodology of unbounded analysis

      What we do and don’t un­der­stand how to do, us­ing un­limited com­put­ing power, is a crit­i­cal dis­tinc­tion and im­por­tant fron­tier.

      • AIXI

        How to build an (evil) su­per­in­tel­li­gent AI us­ing un­limited com­put­ing power and one page of Python code.

        • AIXI-tl

          A time-bounded ver­sion of the ideal agent AIXI that uses an im­pos­si­bly large finite com­puter in­stead of a hy­per­com­puter.

      • Solomonoff induction

        A sim­ple way to su­per­in­tel­li­gently pre­dict se­quences of data, given un­limited com­put­ing power.

      • Hypercomputer

        Some for­mal­isms de­mand com­put­ers larger than the limit of all finite computers

      • Unphysically large finite computer

        The imag­i­nary box re­quired to run pro­grams that re­quire im­pos­si­bly large, but finite, amounts of com­put­ing power.

      • Cartesian agent

        Agents sep­a­rated from their en­vi­ron­ments by im­per­me­able bar­ri­ers through which only sen­sory in­for­ma­tion can en­ter and mo­tor out­put can exit.

        • Cartesian agent-environment boundary

          If your agent is sep­a­rated from the en­vi­ron­ment by an ab­solute bor­der that can only be crossed by sen­sory in­for­ma­tion and mo­tor out­puts, it might just be a Carte­sian agent.

      • Mechanical Turk (example)

        The 19th-cen­tury chess-play­ing au­toma­ton known as the Me­chan­i­cal Turk ac­tu­ally had a hu­man op­er­a­tor in­side. Peo­ple at the time had in­ter­est­ing thoughts about the pos­si­bil­ity of me­chan­i­cal chess.

      • No-Free-Lunch theorems are often irrelevant

        There’s of­ten a the­o­rem prov­ing that some prob­lem has no op­ti­mal an­swer across ev­ery pos­si­ble world. But this may not mat­ter, since the real world is a spe­cial case. (E.g., a low-en­tropy uni­verse.)

    • AI safety mindset

      Ask­ing how AI de­signs could go wrong, in­stead of imag­in­ing them go­ing right.

      • Valley of Dangerous Complacency

        When the AGI works of­ten enough that you let down your guard, but it still has bugs. Imag­ine a robotic car that al­most always steers perfectly, but some­times heads off a cliff.

      • Show me what you've broken

        To demon­strate com­pe­tence at com­puter se­cu­rity, or AI al­ign­ment, think in terms of break­ing pro­pos­als and find­ing tech­ni­cally demon­stra­ble flaws in them.

      • Ad-hoc hack (alignment theory)

        A “hack” is when you al­ter the be­hav­ior of your AI in a way that defies, or doesn’t cor­re­spond to, a prin­ci­pled ap­proach for that prob­lem.

      • Don't try to solve the entire alignment problem

        New to AI al­ign­ment the­ory? Want to work in this area? Already been work­ing in it for years? Don’t try to solve the en­tire al­ign­ment prob­lem with your next good idea!

      • Flag the load-bearing premises

        If some­body says, “This AI safety plan is go­ing to fail, be­cause X” and you re­ply, “Oh, that’s fine be­cause of Y and Z”, then you’d bet­ter clearly flag Y and Z as “load-bear­ing” parts of your plan.

      • Directing, vs. limiting, vs. opposing

        Get­ting the AI to com­pute the right ac­tion in a do­main; ver­sus get­ting the AI to not com­pute at all in an un­safe do­main; ver­sus try­ing to pre­vent the AI from act­ing suc­cess­fully. (Pre­fer 1 & 2.)

    • Optimization daemons

      When you op­ti­mize some­thing so hard that it crys­tal­izes into an op­ti­mizer, like the way nat­u­ral se­lec­tion op­ti­mized apes so hard they turned into hu­man-level intelligences

    • Nearest unblocked strategy

      If you patch an agent’s prefer­ence frame­work to avoid an un­de­sir­able solu­tion, what can you ex­pect to hap­pen?

    • Safe but useless

      Some­times, at the end of lock­ing down your AI so that it seems ex­tremely safe, you’ll end up with an AI that can’t be used to do any­thing in­ter­est­ing.

    • Distinguish which advanced-agent properties lead to the foreseeable difficulty

      Say what kind of AI, or thresh­old level of in­tel­li­gence, or key type of ad­vance­ment, first pro­duces the difficulty or challenge you’re talk­ing about.

    • Goodness estimate biaser

      Some of the main prob­lems in AI al­ign­ment can be seen as sce­nar­ios where ac­tual good­ness is likely to be sys­tem­at­i­cally lower than a bro­ken way of es­ti­mat­ing good­ness.

    • Goodhart's Curse

      The Op­ti­mizer’s Curse meets Good­hart’s Law. For ex­am­ple, if our val­ues are V, and an AI’s util­ity func­tion U is a proxy for V, op­ti­miz­ing for high U seeks out ‘er­rors’—that is, high val­ues of U—V.

    • Context disaster

      Some pos­si­ble de­signs cause your AI to be­have nicely while de­vel­op­ing, and be­have a lot less nicely when it’s smarter.

    • Methodology of foreseeable difficulties

      Build­ing a nice AI is likely to be hard enough, and con­tain enough gotchas that won’t show up in the AI’s early days, that we need to fore­see prob­lems com­ing in ad­vance.

    • Actual effectiveness

      If you want the AI’s so-called ‘util­ity func­tion’ to ac­tu­ally be steer­ing the AI, you need to think about how it meshes up with be­liefs, or what gets out­put to ac­tions.

  • Relevant powerful agent

    An agent is rele­vant if it com­pletely changes the course of his­tory.

  • Informed oversight

    In­cen­tivize a re­in­force­ment learner that’s less smart than you to ac­com­plish some task

  • Safe training procedures for human-imitators

    How does one train a re­in­force­ment learner to act like a hu­man?

  • Reliable prediction

    How can we train pre­dic­tors that re­li­ably pre­dict ob­serv­able phe­nom­ena such as hu­man be­hav­ior?

  • Selective similarity metrics for imitation

    Can we make hu­man-imi­ta­tors more effi­cient by scor­ing them more heav­ily on imi­tat­ing the as­pects of hu­man be­hav­ior we care about more?

  • Relevant limited AI

    Can we have a limited AI, that’s nonethe­less rele­vant?

  • Value achievement dilemma

    How can Earth-origi­nat­ing in­tel­li­gent life achieve most of its po­ten­tial value, whether by AI or oth­er­wise?

    • Moral hazards in AGI development

      “Mo­ral haz­ard” is when own­ers of an ad­vanced AGI give in to the temp­ta­tion to do things with it that the rest of us would re­gard as ‘bad’, like, say, declar­ing them­selves God-Em­peror.

    • Coordinative AI development hypothetical

      What would safe AI de­vel­op­ment look like if we didn’t have to worry about any­thing else?

    • Pivotal event

      Which types of AIs, if they work, can do things that dras­ti­cally change the na­ture of the fur­ther game?

    • Cosmic endowment

      The ‘cos­mic en­dow­ment’ con­sists of all the stars that could be reached from probes origi­nat­ing on Earth; the sum of all mat­ter and en­ergy po­ten­tially available to be trans­formed into life and fun.

    • Aligning an AGI adds significant development time

      Align­ing an ad­vanced AI fore­see­ably in­volves ex­tra code and ex­tra test­ing and not be­ing able to do ev­ery­thing the fastest way, so it takes longer.

  • VAT playpen

    Playpen page for VAT do­main.

  • Nick Bostrom's book Superintelligence

    The cur­rent best book-form in­tro­duc­tion to AI al­ign­ment the­ory.

  • List: value-alignment subjects

    Bul­let point list of core VAT sub­jects.

  • AI arms races

    AI arms races are bad

  • Corrigibility

    “I can’t let you do that, Dave.”

    • Programmer deception
      • Cognitive steganography

        Disal­igned AIs that are mod­el­ing hu­man psy­chol­ogy and try­ing to de­ceive their pro­gram­mers will want to hide their in­ter­nal thought pro­cesses from their pro­gram­mers.

    • Utility indifference

      How can we make an AI in­differ­ent to whether we press a but­ton that changes its goals?

    • Averting instrumental pressures

      Al­most-any util­ity func­tion for an AI, whether the tar­get is di­a­monds or pa­per­clips or eu­daimo­nia, im­plies sub­goals like rapidly self-im­prov­ing and re­fus­ing to shut down. Can we make that not hap­pen?

    • Averting the convergent instrumental strategy of self-improvement

      We prob­a­bly want the first AGI to not im­prove as fast as pos­si­ble, but im­prov­ing as fast as pos­si­ble is a con­ver­gent strat­egy for ac­com­plish­ing most things.

    • Shutdown problem

      How to build an AGI that lets you shut it down, de­spite the ob­vi­ous fact that this will in­terfere with what­ever the AGI’s goals are.

      • You can't get the coffee if you're dead

        An AI given the goal of ‘get the coffee’ can’t achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shut­down but­ton be­ing pressed.

    • User manipulation

      If not oth­er­wise averted, many of an AGI’s de­sired out­comes are likely to in­ter­act with users and hence im­ply an in­cen­tive to ma­nipu­late users.

      • User maximization

        A sub-prin­ci­ple of avoid­ing user ma­nipu­la­tion—if you see an argmax over X or ‘op­ti­mize X’ in­struc­tion and X in­cludes a user in­ter­ac­tion, you’ve just told the AI to op­ti­mize the user.

    • Hard problem of corrigibility

      Can you build an agent that rea­sons as if it knows it­self to be in­com­plete and sym­pa­thizes with your want­ing to re­build or cor­rect it?

    • Problem of fully updated deference

      Why moral un­cer­tainty doesn’t stop an AI from defend­ing its off-switch.

    • Interruptibility

      A sub­prob­lem of cor­rigi­bil­ity un­der the ma­chine learn­ing paradigm: when the agent is in­ter­rupted, it must not learn to pre­vent fu­ture in­ter­rup­tions.

  • Unforeseen maximum

    When you tell AI to pro­duce world peace and it kills ev­ery­one. (Okay, some SF writ­ers saw that one com­ing.)

    • Missing the weird alternative

      Peo­ple might sys­tem­at­i­cally over­look “make tiny molec­u­lar smiley­faces” as a way of “pro­duc­ing smiles”, be­cause our brains au­to­mat­i­cally search for high-util­ity-to-us ways of “pro­duc­ing smiles”.

  • Patch resistance

    One does not sim­ply solve the value al­ign­ment prob­lem.

    • Unforeseen maximum

      When you tell AI to pro­duce world peace and it kills ev­ery­one. (Okay, some SF writ­ers saw that one com­ing.)

      • Missing the weird alternative

        Peo­ple might sys­tem­at­i­cally over­look “make tiny molec­u­lar smiley­faces” as a way of “pro­duc­ing smiles”, be­cause our brains au­to­mat­i­cally search for high-util­ity-to-us ways of “pro­duc­ing smiles”.

  • Coordinative AI development hypothetical

    What would safe AI de­vel­op­ment look like if we didn’t have to worry about any­thing else?

  • Safe impact measure

    What can we mea­sure to make sure an agent is act­ing in a safe man­ner?

  • AI alignment open problem

    Tag for open prob­lems un­der AI al­ign­ment.

  • Natural language understanding of "right" will yield normativity

    What will hap­pen if you tell an ad­vanced agent to do the “right” thing?

  • Identifying ambiguous inductions

    What do a “red straw­berry”, a “red ap­ple”, and a “red cherry” have in com­mon that a “yel­low car­rot” doesn’t? Are they “red fruits” or “red ob­jects”?

  • Value

    The word ‘value’ in the phrase ‘value al­ign­ment’ is a meta­syn­tac­tic vari­able that in­di­cates the speaker’s fu­ture goals for in­tel­li­gent life.

    • Extrapolated volition (normative moral theory)

      If some­one asks you for or­ange juice, and you know that the re­friger­a­tor con­tains no or­ange juice, should you bring them lemon­ade?

      • Rescuing the utility function

        If your util­ity func­tion val­ues ‘heat’, and then you dis­cover to your hor­ror that there’s no on­tolog­i­cally ba­sic heat, switch to valu­ing di­s­or­dered ki­netic en­ergy. Like­wise ‘free will’ or ‘peo­ple’.

    • Coherent extrapolated volition (alignment target)

      A pro­posed di­rec­tion for an ex­tremely well-al­igned au­tonomous su­per­in­tel­li­gence—do what hu­mans would want, if we knew what the AI knew, thought that fast, and un­der­stood our­selves.

    • 'Beneficial'

      Really ac­tu­ally good. A meta­syn­tac­tic vari­able to mean “fa­vor­ing what­ever the speaker wants ideally to ac­com­plish”, al­though differ­ent speak­ers have differ­ent morals and metaethics.

    • William Frankena's list of terminal values

      Life, con­scious­ness, and ac­tivity; health and strength; plea­sures and satis­fac­tions of all or cer­tain kinds; hap­piness, beat­i­tude, con­tent­ment, etc.; truth; knowl­edge and true opinions…

    • 'Detrimental'

      The op­po­site of benefi­cial.

    • Immediate goods
    • Cosmopolitan value

      In­tu­itively: Value as seen from a broad, em­brac­ing stand­point that is aware of how other en­tities may not always be like us or eas­ily un­der­stand­able to us, yet still worth­while.

  • Linguistic conventions in value alignment

    How and why to use pre­cise lan­guage and words with spe­cial mean­ing when talk­ing about value al­ign­ment.

    • Utility

      What is “util­ity” in the con­text of Value Align­ment The­ory?

  • Development phase unpredictable
    • Unforeseen maximum

      When you tell AI to pro­duce world peace and it kills ev­ery­one. (Okay, some SF writ­ers saw that one com­ing.)

      • Missing the weird alternative

        Peo­ple might sys­tem­at­i­cally over­look “make tiny molec­u­lar smiley­faces” as a way of “pro­duc­ing smiles”, be­cause our brains au­to­mat­i­cally search for high-util­ity-to-us ways of “pro­duc­ing smiles”.

  • Complexity of value

    There’s no sim­ple way to de­scribe the goals we want Ar­tifi­cial In­tel­li­gences to want.

  • Value alignment problem

    You want to build an ad­vanced AI with the right val­ues… but how?

    • Total alignment

      We say that an ad­vanced AI is “to­tally al­igned” when it knows ex­actly which out­comes and plans are benefi­cial, with no fur­ther user in­put.

    • Preference framework

      What’s the thing an agent uses to com­pare its prefer­ences?

      • Moral uncertainty

        A meta-util­ity func­tion in which the util­ity func­tion as usu­ally con­sid­ered, takes on differ­ent val­ues in differ­ent pos­si­ble wor­lds, po­ten­tially dis­t­in­guish­able by ev­i­dence.

        • Ideal target

          The ‘ideal tar­get’ of a meta-util­ity func­tion is the value the ground-level util­ity func­tion would take on if the agent up­dated on all pos­si­ble ev­i­dence; the ‘true’ util­ities un­der moral un­cer­tainty.

      • Meta-utility function

        Prefer­ence frame­works built out of sim­ple util­ity func­tions, but where, e.g., the ‘cor­rect’ util­ity func­tion for a pos­si­ble world de­pends on whether a but­ton is pressed.

      • Attainable optimum

        The ‘at­tain­able op­ti­mum’ of an agent’s prefer­ences is the best that agent can ac­tu­ally do given its finite in­tel­li­gence and re­sources (as op­posed to the global max­i­mum of those prefer­ences).

  • Object-level vs. indirect goals

    Differ­ence be­tween “give Alice the ap­ple” and “give Alice what she wants”.

  • Value identification problem
    • Happiness maximizer
    • Edge instantiation

      When you ask the AI to make peo­ple happy, and it tiles the uni­verse with the small­est ob­jects that can be happy.

    • Identifying causal goal concepts from sensory data

      If the in­tended goal is “cure can­cer” and you show the AI healthy pa­tients, it sees, say, a pat­tern of pix­els on a we­b­cam. How do you get to a goal con­cept about the real pa­tients?

    • Goal-concept identification

      Figur­ing out how to say “straw­berry” to an AI that you want to bring you straw­ber­ries (and not fake plas­tic straw­ber­ries, ei­ther).

    • Ontology identification problem

      How do we link an agent’s util­ity func­tion to its model of the world, when we don’t know what that model will look like?

    • Environmental goals

      The prob­lem of hav­ing an AI want out­comes that are out in the world, not just want di­rect sense events.

  • Intended goal
  • Mindcrime

    Might a ma­chine in­tel­li­gence con­tain vast num­bers of un­happy con­scious sub­pro­cesses?

  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.

    • Behaviorist genie

      An ad­vanced agent that’s for­bid­den to model minds in too much de­tail.

    • Epistemic exclusion

      How would you build an AI that, no mat­ter what else it learned about the world, never knew or wanted to know what was in­side your base­ment?

    • Open subproblems in aligning a Task-based AGI

      Open re­search prob­lems, es­pe­cially ones we can model to­day, in build­ing an AGI that can “paint all cars pink” with­out turn­ing its fu­ture light cone into pink-painted cars.

    • Low impact

      The open prob­lem of hav­ing an AI carry out tasks in ways that cause min­i­mum side effects and change as lit­tle of the rest of the uni­verse as pos­si­ble.

      • Shutdown utility function

        A spe­cial case of a low-im­pact util­ity func­tion where you just want the AGI to switch it­self off harm­lessly (and not cre­ate sub­agents to make ab­solutely sure it stays off, etcetera).

      • Abortable plans

        Plans that can be un­done, or switched to hav­ing low fur­ther im­pact. If the AI builds abortable nanoma­chines, they’ll have a quiet self-de­struct op­tion that in­cludes any repli­cated nanoma­chines.

    • Conservative concept boundary

      Given N ex­am­ple bur­ri­tos, draw a bound­ary around what is a ‘bur­rito’ that is rel­a­tively sim­ple and al­lows as few pos­i­tive in­stances as pos­si­ble. Helps make sure the next thing gen­er­ated is a bur­rito.

    • Querying the AGI user

      Pos­tu­lat­ing that an ad­vanced agent will check some­thing with its user, prob­a­bly comes with some stan­dard is­sues and gotchas (e.g., pri­ori­tiz­ing what to query, not ma­nipu­lat­ing the user, etc etc).

    • Mild optimization

      An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn’t tile the uni­verse with pink-painted cars, be­cause it’s not try­ing that hard to max out its car-paint­ing score.

    • Task identification problem

      If you have a task-based AGI (Ge­nie) then how do you pin­point ex­actly what you want it to do (and not do)?

      • Look where I'm pointing, not at my finger

        When try­ing to com­mu­ni­cate the con­cept “glove”, get­ting the AGI to fo­cus on “gloves” rather than “my user’s de­ci­sion to la­bel some­thing a glove” or “any­thing that de­presses the glove-la­bel­ing but­ton”.

    • Safe plan identification and verification

      On a par­tic­u­lar task or prob­lem, the is­sue of how to com­mu­ni­cate to the AGI what you want it to do and all the things you don’t want it to do.

      • Do-What-I-Mean hierarchy

        Suc­ces­sive lev­els of “Do What I Mean” or AGIs that un­der­stand their users in­creas­ingly well

    • Faithful simulation

      How would you iden­tify, to a Task AGI (aka Ge­nie), the prob­lem of scan­ning a hu­man brain, and then run­ning a suffi­ciently ac­cu­rate simu­la­tion of it for the simu­la­tion to not be crazy or psy­chotic?

    • Task (AI goal)

      When build­ing the first AGIs, it may be wiser to as­sign them only goals that are bounded in space and time, and can be satis­fied by bounded efforts.

    • Limited AGI

      Task-based AGIs don’t need un­limited cog­ni­tive and ma­te­rial pow­ers to carry out their Tasks; which means their pow­ers can po­ten­tially be limited.

    • Oracle

      Sys­tem de­signed to safely an­swer ques­tions.

      • Zermelo-Fraenkel provability oracle

        We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.

    • Boxed AI

      Idea: what if we limit how AI can in­ter­act with the world. That’ll make it safe, right??

      • Zermelo-Fraenkel provability oracle

        We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.

  • Principles in AI alignment

    A ‘prin­ci­ple’ of AI al­ign­ment is a very gen­eral de­sign goal like ‘un­der­stand what the heck is go­ing on in­side the AI’ that has in­formed a wide set of spe­cific de­sign pro­pos­als.

    • Non-adversarial principle

      At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.

      • Omnipotence test for AI safety

        Would your AI pro­duce dis­as­trous out­comes if it sud­denly gained om­nipo­tence and om­ni­science? If so, why did you pro­gram some­thing that wants to hurt you and is held back only by lack­ing the power?

      • Niceness is the first line of defense

        The first line of defense in deal­ing with any par­tially su­per­hu­man AI sys­tem ad­vanced enough to pos­si­bly be dan­ger­ous is that it does not want to hurt you or defeat your safety mea­sures.

      • Directing, vs. limiting, vs. opposing

        Get­ting the AI to com­pute the right ac­tion in a do­main; ver­sus get­ting the AI to not com­pute at all in an un­safe do­main; ver­sus try­ing to pre­vent the AI from act­ing suc­cess­fully. (Pre­fer 1 & 2.)

      • The AI must tolerate your safety measures

        A corol­lary of the non­ad­ver­sar­ial prin­ci­ple is that “The AI must tol­er­ate your safety mea­sures.”

      • Generalized principle of cognitive alignment

        When we’re ask­ing how we want the AI to think about an al­ign­ment prob­lem, one source of in­spira­tion is try­ing to have the AI mir­ror our own thoughts about that prob­lem.

    • Minimality principle

      The first AGI ever built should save the world in a way that re­quires the least amount of the least dan­ger­ous cog­ni­tion.

    • Understandability principle

      The more you un­der­stand what the heck is go­ing on in­side your AI, the safer you are.

      • Effability principle

        You are safer the more you un­der­stand the in­ner struc­ture of how your AI thinks; the bet­ter you can de­scribe the re­la­tion of smaller pieces of the AI’s thought pro­cess.

    • Separation from hyperexistential risk

      The AI should be widely sep­a­rated in the de­sign space from any AI that would con­sti­tute a “hy­per­ex­is­ten­tial risk” (any­thing worse than death).

  • Theory of (advanced) agents

    One of the re­search sub­prob­lems of build­ing pow­er­ful nice AIs, is the the­ory of (suffi­ciently ad­vanced) minds in gen­eral.

    • Instrumental convergence

      Some strate­gies can help achieve most pos­si­ble sim­ple goals. E.g., ac­quiring more com­put­ing power or more ma­te­rial re­sources. By de­fault, un­less averted, we can ex­pect ad­vanced AIs to do that.

      • Paperclip maximizer

        This agent will not stop un­til the en­tire uni­verse is filled with pa­per­clips.

        • Paperclip

          A con­figu­ra­tion of mat­ter that we’d see as be­ing worth­less even from a very cos­mopoli­tan per­spec­tive.

        • Random utility function

          A ‘ran­dom’ util­ity func­tion is one cho­sen at ran­dom ac­cord­ing to some sim­ple prob­a­bil­ity mea­sure (e.g. weight by Kol­moro­gov com­plex­ity) on a log­i­cal space of for­mal util­ity func­tions.

      • Instrumental

        What is “in­stru­men­tal” in the con­text of Value Align­ment The­ory?

      • Instrumental pressure

        A con­se­quen­tial­ist agent will want to bring about cer­tain in­stru­men­tal events that will help to fulfill its goals.

      • Convergent instrumental strategies

        Paper­clip max­i­miz­ers can make more pa­per­clips by im­prov­ing their cog­ni­tive abil­ities or con­trol­ling more re­sources. What other strate­gies would al­most-any AI try to use?

        • Convergent strategies of self-modification

          The strate­gies we’d ex­pect to be em­ployed by an AI that un­der­stands the rele­vance of its code and hard­ware to achiev­ing its goals, which there­fore has sub­goals about its code and hard­ware.

        • Consequentialist preferences are reflectively stable by default

          Gandhi wouldn’t take a pill that made him want to kill peo­ple, be­cause he knows in that case more peo­ple will be mur­dered. A pa­per­clip max­i­mizer doesn’t want to stop max­i­miz­ing pa­per­clips.

      • You can't get more paperclips that way

        Most ar­gu­ments that “A pa­per­clip max­i­mizer could get more pa­per­clips by (do­ing nice things)” are flawed.

    • Orthogonality Thesis

      Will smart AIs au­to­mat­i­cally be­come benev­olent, or au­to­mat­i­cally be­come hos­tile? Or do differ­ent AI de­signs im­ply differ­ent goals?

      • Paperclip maximizer

        This agent will not stop un­til the en­tire uni­verse is filled with pa­per­clips.

        • Paperclip

          A con­figu­ra­tion of mat­ter that we’d see as be­ing worth­less even from a very cos­mopoli­tan per­spec­tive.

        • Random utility function

          A ‘ran­dom’ util­ity func­tion is one cho­sen at ran­dom ac­cord­ing to some sim­ple prob­a­bil­ity mea­sure (e.g. weight by Kol­moro­gov com­plex­ity) on a log­i­cal space of for­mal util­ity func­tions.

      • Mind design space is wide

        Imag­ine all hu­man be­ings as one tiny dot in­side a much vaster sphere of pos­si­bil­ities for “The space of minds in gen­eral.” It is wiser to make claims about some minds than all minds.

      • Instrumental goals are almost-equally as tractable as terminal goals

        Get­ting the milk from the re­friger­a­tor be­cause you want to drink it, is not vastly harder than get­ting the milk from the re­friger­a­tor be­cause you in­her­ently de­sire it.

    • Advanced agent properties

      How smart does a ma­chine in­tel­li­gence need to be, for its nice­ness to be­come an is­sue? “Ad­vanced” is a broad term to cover cog­ni­tive abil­ities such that we’d need to start con­sid­er­ing AI al­ign­ment.

      • Big-picture strategic awareness

        We start en­coun­ter­ing new AI al­ign­ment is­sues at the point where a ma­chine in­tel­li­gence rec­og­nizes the ex­is­tence of a real world, the ex­is­tence of pro­gram­mers, and how these re­late to its goals.

      • Superintelligent

        A “su­per­in­tel­li­gence” is strongly su­per­hu­man (strictly higher-perform­ing than any and all hu­mans) on ev­ery cog­ni­tive prob­lem.

      • Intelligence explosion

        What hap­pens if a self-im­prov­ing AI gets to the point where each amount x of self-im­prove­ment trig­gers >x fur­ther self-im­prove­ment, and it stays that way for a while.

      • Artificial General Intelligence

        An AI which has the same kind of “sig­nifi­cantly more gen­eral” in­tel­li­gence that hu­mans have com­pared to chim­panzees; it can learn new do­mains, like we can.

      • Advanced nonagent

        Hy­po­thet­i­cally, cog­ni­tively pow­er­ful pro­grams that don’t fol­low the loop of “ob­serve, learn, model the con­se­quences, act, ob­serve re­sults” that a stan­dard “agent” would.

      • Epistemic and instrumental efficiency

        An effi­cient agent never makes a mis­take you can pre­dict. You can never suc­cess­fully pre­dict a di­rec­tional bias in its es­ti­mates.

        • Time-machine metaphor for efficient agents

          Don’t imag­ine a pa­per­clip max­i­mizer as a mind. Imag­ine it as a time ma­chine that always spits out the out­put lead­ing to the great­est num­ber of fu­ture pa­per­clips.

      • Standard agent properties

        What’s a Stan­dard Agent, and what can it do?

        • Bounded agent

          An agent that op­er­ates in the real world, us­ing re­al­is­tic amounts of com­put­ing power, that is un­cer­tain of its en­vi­ron­ment, etcetera.

      • Real-world domain

        Some AIs play chess, some AIs play Go, some AIs drive cars. Th­ese differ­ent ‘do­mains’ pre­sent differ­ent op­tions. All of re­al­ity, in all its messy en­tan­gle­ment, is the ‘real-world do­main’.

      • Sufficiently advanced Artificial Intelligence

        ‘Suffi­ciently ad­vanced Ar­tifi­cial In­tel­li­gences’ are AIs with enough ‘ad­vanced agent prop­er­ties’ that we start need­ing to do ‘AI al­ign­ment’ to them.

      • Infrahuman, par-human, superhuman, efficient, optimal

        A cat­e­go­riza­tion of AI abil­ity lev­els rel­a­tive to hu­man, with some gotchas in the or­der­ing. E.g., in sim­ple do­mains where hu­mans can play op­ti­mally, op­ti­mal play is not su­per­hu­man.

      • General intelligence

        Com­pared to chim­panzees, hu­mans seem to be able to learn a much wider va­ri­ety of do­mains. We have ‘sig­nifi­cantly more gen­er­ally ap­pli­ca­ble’ cog­ni­tive abil­ities, aka ‘more gen­eral in­tel­li­gence’.

      • Corporations vs. superintelligences

        Cor­po­ra­tions have rel­a­tively few of the ad­vanced-agent prop­er­ties that would al­low one mis­take in al­ign­ing a cor­po­ra­tion to im­me­di­ately kill all hu­mans and turn the fu­ture light cone into pa­per­clips.

      • Cognitive uncontainability

        ‘Cog­ni­tive un­con­tain­abil­ity’ is when we can’t hold all of an agent’s pos­si­bil­ities in­side our own minds.

      • Vingean uncertainty

        You can’t pre­dict the ex­act ac­tions of an agent smarter than you—so is there any­thing you can say about them?

        • Vinge's Law

          You can’t pre­dict ex­actly what some­one smarter than you would do, be­cause if you could, you’d be that smart your­self.

        • Deep Blue

          The chess-play­ing pro­gram, built by IBM, that first won the world chess cham­pi­onship from Garry Kas­parov in 1996.

      • Consequentialist cognition

        The cog­ni­tive abil­ity to fore­see the con­se­quences of ac­tions, pre­fer some out­comes to oth­ers, and out­put ac­tions lead­ing to the preferred out­comes.

  • Difficulty of AI alignment

    How hard is it ex­actly to point an Ar­tifi­cial Gen­eral In­tel­li­gence in an in­tu­itively okay di­rec­tion?

  • Glossary (Value Alignment Theory)

    Words that have a spe­cial mean­ing in the con­text of cre­at­ing nice AIs.

    • Friendly AI

      Old ter­minol­ogy for an AI whose prefer­ences have been suc­cess­fully al­igned with ideal­ized hu­man val­ues.

    • Cognitive domain

      An allegedly com­pact unit of knowl­edge, such that ideas in­side the unit in­ter­act mainly with each other and less with ideas in other do­mains.

    • 'Concept'

      In the con­text of Ar­tifi­cial In­tel­li­gence, a ‘con­cept’ is a cat­e­gory, some­thing that iden­ti­fies thin­gies as be­ing in­side or out­side the con­cept.

  • Programmer

    Who is build­ing these ad­vanced agents?