Linguistic conventions in value alignment

A cen­tral page to list the lan­guage con­ven­tions in value al­ign­ment the­ory. See also Glos­sary (Value Align­ment The­ory).

Lan­guage deal­ing with wants, de­sires, util­ity, prefer­ence, and value.

We need a lan­guage rich enough to dis­t­in­guish at least the fol­low­ing as differ­ent in­ten­sional con­cepts, even if their ex­ten­sions end up be­ing iden­ti­cal:

  • A. What the pro­gram­mers ex­plic­itly, ver­bally said they wanted to achieve by build­ing the AI.

  • B. What the pro­gram­mers word­lessly, in­tu­itively meant; the ac­tual crite­rion they would use for rat­ing the de­sir­a­bil­ity of out­comes, if they could ac­tu­ally look at those out­comes and as­sign rat­ings.

  • C. What pro­gram­mers should want from the AI (from within some view on nor­ma­tivity, should­ness, or right­ness).

  • D. The AI’s ex­plic­itly rep­re­sented cog­ni­tive prefer­ences, if any.

  • E. The prop­erty that run­ning the AI tends to pro­duce in the world; the prop­erty that the AI be­haves in such fash­ion as to bring about.

So far, the fol­low­ing re­served terms have been ad­vo­cated for the sub­ject of value al­ign­ment:

  • Value and valuable to re­fer to C. On views which iden­tify C with B, it thereby refers to B.

  • Op­ti­miza­tion tar­get to mean only E. We can also say, e.g., that nat­u­ral se­lec­tion has an ‘op­ti­miza­tion tar­get’ of in­clu­sive ge­netic fit­ness. ‘Op­ti­miza­tion tar­get’ is meant to be an ex­ceed­ingly gen­eral term that can talk about ir­ra­tional agents and nona­gents.

  • Utility to mean a Von Neu­mann-Mor­gen­stern util­ity func­tion, re­served to talk about agents that be­have like some bounded analogue of ex­pected util­ity op­ti­miz­ers. Utility is ex­plic­itly not as­sumed to be nor­ma­tive. E.g., if speak­ing of a pa­per­clip max­i­mizer, we will say that an out­come has higher util­ity iff it con­tains more pa­per­clips. Thus ‘util­ity’ is re­served to re­fer to D or E.

  • De­sire to mean an­thro­po­mor­phic hu­man-style de­sires, refer­ring to A or B rather than C, D, or E. (‘Wants’ are gen­eral over hu­mans and AIs.)

  • Prefer­ence and pre­fer to be gen­eral terms that can be used for both hu­mans and AIs. ‘Prefer­ence’ refers to B or D rather than A, C, or E. It means ‘what the agent ex­plic­itly and cog­ni­tively wants’ rather than ‘what the agent should want’ or ‘what the agent mis­tak­enly thinks it wants’ or ‘what the agent’s be­hav­ior tends to op­ti­mize’. Some­one can be said to pre­fer their ex­trap­o­lated vo­li­tion to be im­ple­mented rather than their cur­rent de­sires, but if so they must ex­plic­itly, cog­ni­tively pre­fer that, or ac­cept it in an ex­plicit choice be­tween op­tions.

  • Prefer­ence frame­work to be an even more gen­eral term that can re­fer to e.g. meta-util­ity func­tions that change based on ob­ser­va­tions, or to meta-prefer­ences about how one’s own prefer­ences should be ex­trap­o­lated. A ‘prefer­ence frame­work’ should re­fer to con­structs more co­her­ent than the hu­man mass of de­sires and ad-hoc re­flec­tions, but not as strictly re­stricted as a VNM util­ity func­tion. Stu­art Arm­strong’s util­ity in­differ­ence frame­work for value learn­ing is an ex­am­ple of a prefer­ence frame­work that is not a vanilla/​or­di­nary util­ity func­tion.

  • Goal re­mains a generic, un­re­served term that could re­fer to any of A-E, and also par­tic­u­lar things an agent wants to get done for in­stru­men­tal rea­sons.

  • In­tended goal to re­fer to B only.

  • Want re­mains a generic, un­re­served term that could re­fer to hu­mans or other agents, or ter­mi­nal or in­stru­men­tal goals.

‘Ter­mi­nal’ and ‘in­stru­men­tal’ have their stan­dard con­trast­ing mean­ings.


  • Utility

    What is “util­ity” in the con­text of Value Align­ment The­ory?


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.