In the con­text of value al­ign­ment as a sub­ject, the word ‘value’ is a speaker-de­pen­dent vari­able that in­di­cates our ul­ti­mate goal—the prop­erty or meta-prop­erty that the speaker wants or ‘should want’ to see in the fi­nal out­come of Earth-origi­nat­ing in­tel­li­gent life. E.g: hu­man flour­ish­ing, fun, co­her­ent ex­trap­o­lated vo­li­tion, nor­ma­tivity.

Differ­ent view­points are still be­ing de­bated on this topic; peo­ple some­times change their minds about their views. We don’t yet have full knowl­edge of which views are ‘rea­son­able’ in the sense that peo­ple with good cog­ni­tive skills might re­tain them even in the limit of on­go­ing dis­cus­sion. Some sub­types of po­ten­tially in­ter­nally co­her­ent views may not be suffi­ciently in­ter­per­son­al­iz­able for even very small AI pro­jects to co­op­er­ate on them; if e.g. Alice wants to own the whole world and will go on be­liev­ing that in the limit of con­tin­u­ing con­tem­pla­tion, this is not a desider­a­tum on which Alice, Bob, and Carol can all co­op­er­ate. Thus, us­ing ‘value’ as a po­ten­tially speaker-de­pen­dent vari­able isn’t meant to im­ply that ev­ery­one has their own ‘value’ and that no fur­ther de­bate or co­op­er­a­tion is pos­si­ble; peo­ple can and do talk each other out of po­si­tions which are then re­garded as hav­ing been mis­taken, and com­pletely in­com­mu­ni­ca­ble stances seem un­likely to be reified even into a very small AI pro­ject. But since this de­bate is on­go­ing, there is not yet any one defi­ni­tion of ‘value’ that can be re­garded as set­tled.

Nonethe­less, on many of the cur­rent views be­ing ad­vo­cated, it seems like very similar tech­ni­cal prob­lems of value al­ign­ment seem to arise in many of them. We would need to figure out how to iden­tify the ob­jects of value to the AI, ro­bustly as­sure that the AI’s prefer­ences are sta­ble as the AI self-mod­ifies, or cre­ate cor­rigible ways of re­cov­er­ing from er­rors in the way we tried to iden­tify and spec­ify the ob­jects of value.

To cen­tral­ize the very similar dis­cus­sions of these tech­ni­cal prob­lems while the outer de­bate about rea­son­able end goals is on­go­ing, the word ‘value’ acts as a meta­syn­tac­tic place­holder for differ­ent views about the tar­get of value al­ign­ment.

Similarly, in the larger value achieve­ment dilemma, the ques­tion of what the end goals should be, and policy difficul­ties of get­ting ‘good’ goals to be adopted in name by the builders or cre­ators of AI, are fac­tored out as the value se­lec­tion prob­lem. The out­put of this pro­cess is taken to be an in­put into the value load­ing prob­lem, and ‘value’ is a name refer­ring to this out­put.

‘Value’ is not as­sumed to be what the AI is given as its util­ity func­tion or prefer­ence frame­work. On many views im­ply­ing that value is com­plex or oth­er­wise difficult to con­vey to an AI, the AI may be, e.g., a Ge­nie where some stress is taken off the propo­si­tion that the AI ex­actly un­der­stands value and put onto hu­man abil­ity to use the Ge­nie well.

Con­sider a Ge­nie with an ex­plicit prefer­ence frame­work tar­geted on a Do What I Know I Mean sys­tem for mak­ing checked wishes. The word ‘value’ in any dis­cus­sion thereof should still only be used to re­fer to what­ever the AI cre­ators are tar­get­ing for real-world out­comes. We would say the ‘value al­ign­ment prob­lem’ had been suc­cess­fully solved to the ex­tent that run­ning the Ge­nie pro­duced high-value out­comes in the sense of the hu­mans’ view­point on ‘value’, not to the ex­tent that the out­come matched the Ge­nie’s prefer­ence frame­work for how to fol­low or­ders.

Spe­cific views on value

Ob­vi­ously, a list­ing like this will only sum­ma­rize long de­bates. But that sum­mary at least lets us point to some ex­am­ples of views that have been ad­vo­cated, and not in­definitely defer the ques­tion of what ‘value’ could pos­si­bly re­fer to.

Again, keep in mind that by tech­ni­cal defi­ni­tion, ‘value’ is what we are us­ing or should use to rate the ul­ti­mate real-world con­se­quences of run­ning the AI, not the ex­plicit goals we are giv­ing the AI.

Some of the ma­jor views that have been ad­vo­cated by more than one per­son are as fol­lows:

  • Reflec­tive equil­ibrium. We can talk about ‘what I should want’ as a con­cept dis­tinct from ‘what I want right now’ by con­stru­ing some limit of how our pre­sent de­sires would di­rec­tion­ally change given more fac­tual knowl­edge, time to con­sider more knowl­edge, bet­ter self-aware­ness, and bet­ter self-con­trol. Model­ing this pro­cess is ex­trap­o­la­tion, a re­served term to mean this pro­cess in the con­text of dis­cussing prefer­ences. Value would con­sist in, e.g., what­ever prop­er­ties a su­per­ma­jor­ity of hu­mans would agree, in the limit of re­flec­tive equil­ibrium, are de­sir­able. See also co­her­ent ex­trap­o­lated vo­li­tion.

  • Stan­dard de­sires. An ob­ject-level view that iden­ti­fies value with qual­ities that we cur­rently find very de­sir­able, en­joy­able, fun, and prefer­able, such as Frankena’s list of desider­ata (in­clud­ing truth, hap­piness, aes­thet­ics, love, challenge and achieve­ment, etc.) On the closely re­lated view of Fun The­ory, such de­sires may be fur­ther ex­trap­o­lated, with­out chang­ing their es­sen­tial char­ac­ter, into forms suit­able for tran­shu­man minds. Ad­vo­cates may agree that these ob­ject-level de­sires will be sub­ject to un­known nor­ma­tive cor­rec­tions by re­flec­tive-equil­ibrium-type con­sid­er­a­tions, but still be­lieve that some form of Fun or stan­dardly de­sir­able out­come is a likely re­sult. There­fore (on this view) it is rea­son­able to speak of value as prob­a­bly mostly con­sist­ing in turn­ing most of the reach­able uni­verse into su­per­in­tel­li­gent life en­joy­ing it­self, cre­at­ing tran­shu­man forms of art, etcetera.

  • Im­me­di­ate goods. E.g., “Cure can­cer” or “Don’t trans­form the world into pa­per­clips.” Such replies ar­guably have prob­lems as ul­ti­mate crite­ria of value from a hu­man stand­point (see linked dis­cus­sion), but for ob­vi­ous rea­sons, lists of im­me­di­ate goods are a com­mon early thought when first con­sid­er­ing the sub­ject.

  • Defla­tion­ary moral er­ror the­ory. There is no good way to con­strue a nor­ma­tive con­cept apart from what par­tic­u­lar peo­ple want. AI pro­gram­mers are just do­ing what they want, and con­fused talk of ‘fair­ness’ or ‘right­ness’ can­not be res­cued. The speaker would nonethe­less per­son­ally pre­fer not to be turned into pa­per­clips. (This mostly ends up at an ‘im­me­di­ate goods’ the­ory in prac­tice, plus some be­liefs rele­vant to the value se­lec­tion de­bate.)

  • Sim­ple pur­pose. Value can eas­ily be iden­ti­fied with X, for some X. X is the main thing we should be con­cerned about pass­ing on to AIs. Seem­ingly de­sir­able things be­sides X are ei­ther (a) im­proper to care about, (b) rel­a­tively unim­por­tant, or (c) in­stru­men­tally im­plied by pur­su­ing X, qua X.

The fol­low­ing ver­sions of desider­ata for AI out­comes would tend to im­ply that the value al­ign­ment /​ value load­ing prob­lem is an en­tirely wrong way of look­ing at the is­sue, which might make it dis­in­gen­u­ous to claim that ‘value’ in ‘value al­ign­ment’ can cover them as a meta­syn­tac­tic vari­able as well:

  • Mo­ral in­ter­nal­ist value. The nor­ma­tive is in­her­ently com­pel­ling to all, or al­most all cog­ni­tively pow­er­ful agents. What­ever is not thus com­pel­ling can­not be nor­ma­tive or a proper ob­ject of hu­man de­sire.

  • AI rights. The pri­mary thing is to en­sure that the AI’s nat­u­ral and in­trin­sic de­sires are re­spected. The ideal is to end up in a di­verse civ­i­liza­tion that re­spects the rights of all sen­tient be­ings, in­clud­ing AIs. (Gen­er­ally linked are the views that no spe­cial se­lec­tion of AI de­sign is re­quired to achieve this, or that spe­cial se­lec­tion of AI de­sign to shape par­tic­u­lar mo­ti­va­tions would it­self vi­o­late AI rights.)

Mo­du­lar­ity of ‘value’

Alignable values

Many is­sues in value al­ign­ment seem to gen­er­al­ize very well across the Reflec­tive Equil­ibrium, Fun The­ory, In­tu­itive Desider­ata, and Defla­tion­ary Er­ror The­ory view­points. In all cases we would have to con­sider sta­bil­ity of self-mod­ifi­ca­tion, the Edge In­stan­ti­a­tion prob­lem in value iden­ti­fi­ca­tion, and most of the rest of ‘stan­dard’ value al­ign­ment the­ory. This seem­ingly good gen­er­al­iza­tion of the re­sult­ing tech­ni­cal prob­lems across such wide-rang­ing view­points, and es­pe­cially that it (ar­guably) cov­ers the case of in­tu­itive desider­ata, is what jus­tifies treat­ing ‘value’ as a meta­syn­tac­tic vari­able in ‘value load­ing prob­lem’.

A neu­tral term for refer­ring to all the val­ues in this class might be ‘al­ignable val­ues’.

Sim­ple purpose

In the sim­ple pur­pose case, the key differ­ence from an Im­me­di­ate Goods sce­nario is that the desider­a­tum is usu­ally ad­vo­cated to be sim­ple enough to negate Com­plex­ity of Value and make value iden­ti­fi­ca­tion easy.

E.g., Juer­gen Sch­mid­hu­ber stated at the 20XX Sin­gu­lar­ity Sum­mit that he thought the only proper and nor­ma­tive goal of any agent was to in­crease com­pres­sion of sen­sory in­for­ma­tion find ex­act quote, ex­act Sum­mit. Con­di­tioned on this be­ing the sum of all nor­ma­tivity, ‘value’ is al­gorith­mi­cally sim­ple. Then the prob­lems of Edge In­stan­ti­a­tion, Un­fore­seen Max­i­mums, and Near­est Un­blocked Neigh­bor are all moot. (Ex­cept per­haps as there is an On­tol­ogy Iden­ti­fi­ca­tion prob­lem for defin­ing ex­actly what con­sti­tutes ‘sen­sory in­for­ma­tion’ for a self-mod­ify­ing agent.)

Even in the sim­ple pur­pose case, the value load­ing prob­lem would still ex­ist (it would still be nec­es­sary to make an AI that cared about the sim­ple pur­pose rather than pa­per­clips) along with as­so­ci­ated prob­lems of re­flec­tive sta­bil­ity (it would be nec­es­sary to make an AI that went on car­ing about X through self-mod­ifi­ca­tion). Nonethe­less, the over­all prob­lem difficulty and im­me­di­ate tech­ni­cal pri­ori­ties would be differ­ent enough that the Sim­ple Pur­pose case seems im­por­tantly dis­tinct from e.g. Fun The­ory on a policy level.

Mo­ral internalism

Some view­points on ‘value’ de­liber­ately re­ject Orthog­o­nal­ity. Strong ver­sions of the moral in­ter­nal­ist po­si­tion in metaethics claim as an em­piri­cal pre­dic­tion that ev­ery suffi­ciently pow­er­ful cog­ni­tive agent will come to pur­sue the same end, which end is to be iden­ti­fied with nor­ma­tivity, and is the only proper ob­ject of hu­man de­sire. If true, this would im­ply that the en­tire value al­ign­ment prob­lem is moot for ad­vanced agents.

Many peo­ple who ad­vo­cate ‘sim­ple pur­poses’ also claim these pur­poses are uni­ver­sally com­pel­ling. In a policy sense, this seems func­tion­ally similar to the Mo­ral In­ter­nal­ist case re­gard­less of the sim­plic­ity or com­plex­ity of the uni­ver­sally com­pel­ling value. Hence an alleged sim­ple uni­ver­sally com­pel­ling pur­pose is cat­e­go­rized for these pur­poses as Mo­ral In­ter­nal­ist rather than Sim­ple Pur­pose.

The spe­cial case of a Sim­ple Pur­pose claimed to be uni­ver­sally in­stru­men­tally con­ver­gent also seems func­tion­ally iden­ti­cal to Mo­ral In­ter­nal­ism from a policy stand­point.)

AI Rights

Some­one might be­lieve as a propo­si­tion of fact that all (ac­cessible) AI de­signs would have ‘in­nate’ de­sires, be­lieve as a propo­si­tion of fact that no AI would gain enough ad­van­tage to wipe out hu­man­ity or pre­vent the cre­ation of other AIs, and as­sert as a mat­ter of moral­ity that a good out­come con­sists of ev­ery­one be­ing free to pur­sue their own value and trade. In this case the value al­ign­ment prob­lem is im­plied to be an en­tirely wrong way to look at the prob­lem, with all as­so­ci­ated tech­ni­cal is­sues moot. Thus, it again might be dis­in­gen­u­ous to have ‘value’ as a meta­syn­tac­tic vari­able try to cover this case.


  • Extrapolated volition (normative moral theory)

    If some­one asks you for or­ange juice, and you know that the re­friger­a­tor con­tains no or­ange juice, should you bring them lemon­ade?

  • Coherent extrapolated volition (alignment target)

    A pro­posed di­rec­tion for an ex­tremely well-al­igned au­tonomous su­per­in­tel­li­gence—do what hu­mans would want, if we knew what the AI knew, thought that fast, and un­der­stood our­selves.

  • 'Beneficial'

    Really ac­tu­ally good. A meta­syn­tac­tic vari­able to mean “fa­vor­ing what­ever the speaker wants ideally to ac­com­plish”, al­though differ­ent speak­ers have differ­ent morals and metaethics.

  • William Frankena's list of terminal values

    Life, con­scious­ness, and ac­tivity; health and strength; plea­sures and satis­fac­tions of all or cer­tain kinds; hap­piness, beat­i­tude, con­tent­ment, etc.; truth; knowl­edge and true opinions…

  • 'Detrimental'

    The op­po­site of benefi­cial.

  • Immediate goods
  • Cosmopolitan value

    In­tu­itively: Value as seen from a broad, em­brac­ing stand­point that is aware of how other en­tities may not always be like us or eas­ily un­der­stand­able to us, yet still worth­while.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.