AI alignment

The “al­ign­ment prob­lem for ad­vanced agents” or “AI al­ign­ment” is the over­ar­ch­ing re­search topic of how to de­velop suffi­ciently ad­vanced ma­chine in­tel­li­gences such that run­ning them pro­duces good out­comes in the real world.

Both ‘ad­vanced agent’ and ‘good’ should be un­der­stood as meta­syn­tac­tic place­hold­ers for com­pli­cated ideas still un­der de­bate. The term ‘al­ign­ment’ is in­tended to con­vey the idea of point­ing an AI in a di­rec­tion—just like, once you build a rocket, it has to be pointed in a par­tic­u­lar di­rec­tion.

“AI al­ign­ment the­ory” is meant as an over­ar­ch­ing term to cover the whole re­search field as­so­ci­ated with this prob­lem, in­clud­ing, e.g., the much-de­bated at­tempt to es­ti­mate how rapidly an AI might gain in ca­pa­bil­ity once it goes over var­i­ous par­tic­u­lar thresh­olds.

Other terms that have been used to de­scribe this re­search prob­lem in­clude “ro­bust and benefi­cial AI” and “Friendly AI”. The term “value al­ign­ment prob­lem” was coined by Stu­art Rus­sell to re­fer to the pri­mary sub­prob­lem of al­ign­ing AI prefer­ences with (po­ten­tially ideal­ized) hu­man prefer­ences.

Some al­ter­na­tive terms for this gen­eral field of study, such as ‘con­trol prob­lem’, can sound ad­ver­sar­ial—like the rocket is already pointed in a bad di­rec­tion and you need to wres­tle with it. Other terms, like ‘AI safety’, un­der­state the ad­vo­cated de­gree to which al­ign­ment ought to be an in­trin­sic part of build­ing ad­vanced agents. E.g., there isn’t a sep­a­rate the­ory of “bridge safety” for how to build bridges that don’t fall down. Point­ing the agent in a par­tic­u­lar di­rec­tion ought to be seen as part of the stan­dard prob­lem of build­ing an ad­vanced ma­chine agent. The prob­lem does not di­vide into “build­ing an ad­vanced AI” and then sep­a­rately “some­how caus­ing that AI to pro­duce good out­comes”, the prob­lem is “get­ting good out­comes via build­ing a cog­ni­tive agent that brings about those good out­comes”.

A good in­tro­duc­tory ar­ti­cle or sur­vey pa­per for this field does not presently ex­ist. If you have no idea what this prob­lem is about, con­sider read­ing Nick Bostrom’s pop­u­lar book Su­per­in­tel­li­gence.

You can ex­plore this Ar­bital do­main by fol­low­ing this link. See also the List of Value Align­ment Topics on Ar­bital al­though this is not up-to-date.



Sup­port­ing knowledge

If you’re will­ing to spend time on learn­ing this field and are not pre­vi­ously fa­mil­iar with the ba­sics of de­ci­sion the­ory and prob­a­bil­ity the­ory, it’s worth read­ing the Ar­bital in­tro­duc­tions to those first. In par­tic­u­lar, it may be use­ful to be­come fa­mil­iar with the no­tion of pri­ors and be­lief re­vi­sion, and with the co­her­ence ar­gu­ments for ex­pected util­ity.

If you have time to read a text­book to gain gen­eral fa­mil­iar­ity with AI, “Ar­tifi­cial In­tel­li­gence: A Modern Ap­proach” is highly recom­mended.

Key terms

  • Ad­vanced agent

  • Value /​ good­ness /​ beneficial

  • Ar­tifi­cial Gen­eral Intelligence

  • Superintelligence

Key fac­tual propositions

  • Orthog­o­nal­ity thesis

  • In­stru­men­tal con­ver­gence thesis

  • Ca­pa­bil­ity gain

  • Com­plex­ity and frag­ility of cos­mopoli­tan value

  • Align­ment difficulty

Ad­vo­cated ba­sic principles

  • Non­ad­ver­sar­ial principle

  • Min­i­mal­ity principle

  • In­trin­sic safety

  • Un­re­li­able mon­i­tor­ing %note: Treat hu­man mon­i­tor­ing as ex­pen­sive, un­re­li­able, and frag­ile.%

Ad­vo­cated methodology

  • Gen­er­al­ized se­cu­rity mindset

  • Fore­see­able difficulties

Ad­vo­cated de­sign principles

  • Value alignment

  • Corrigibility

  • Mild op­ti­miza­tion /​ Taskishness

  • Conservatism

  • Transparency

  • Whitelisting

  • Redzoning

Cur­rent re­search areas

  • Co­op­er­a­tive in­verse re­in­force­ment learning

  • Interruptibility

  • Utility switching

  • Stable self-mod­ifi­ca­tion & Vingean reflection

  • Mir­ror models

  • Ro­bust ma­chine learning

Open problems

  • En­vi­ron­men­tal goals

  • Other-izer problem

  • Im­pact penalty

  • Shut­down util­ity func­tion & abortability

  • Fully up­dated deference

  • Epistemic exclusion

Par­tially solved problems

  • Log­i­cal un­cer­tainty for dou­bly-ex­po­nen­tial re­flec­tive agents.

  • In­ter­rupt­ibil­ity for non-re­flec­tive non-con­se­quen­tial­ist Q-learn­ers.

  • Strat­egy-de­ter­mined New­comblike prob­lems.

  • Lev­er­age prior

Fu­ture work

  • On­tol­ogy iden­ti­fi­ca­tion problem

  • Behaviorism

  • Avert­ing in­stru­men­tal strategies

  • Reproducibility