Linguistic conventions in value alignment

A central page to list the language conventions in value alignment theory. See also Glossary (Value Alignment Theory).

Language dealing with wants, desires, utility, preference, and value.

We need a language rich enough to distinguish at least the following as different intensional concepts, even if their extensions end up being identical:

  • A. What the programmers explicitly, verbally said they wanted to achieve by building the AI.

  • B. What the programmers wordlessly, intuitively meant; the actual criterion they would use for rating the desirability of outcomes, if they could actually look at those outcomes and assign ratings.

  • C. What programmers should want from the AI (from within some view on normativity, shouldness, or rightness).

  • D. The AI’s explicitly represented cognitive preferences, if any.

  • E. The property that running the AI tends to produce in the world; the property that the AI behaves in such fashion as to bring about.

So far, the following reserved terms have been advocated for the subject of value alignment:

  • Value and valuable to refer to C. On views which identify C with B, it thereby refers to B.

  • Optimization target to mean only E. We can also say, e.g., that natural selection has an ‘optimization target’ of inclusive genetic fitness. ‘Optimization target’ is meant to be an exceedingly general term that can talk about irrational agents and nonagents.

  • Utility to mean a Von Neumann-Morgenstern utility function, reserved to talk about agents that behave like some bounded analogue of expected utility optimizers. Utility is explicitly not assumed to be normative. E.g., if speaking of a paperclip maximizer, we will say that an outcome has higher utility iff it contains more paperclips. Thus ‘utility’ is reserved to refer to D or E.

  • Desire to mean anthropomorphic human-style desires, referring to A or B rather than C, D, or E. (‘Wants’ are general over humans and AIs.)

  • Preference and prefer to be general terms that can be used for both humans and AIs. ‘Preference’ refers to B or D rather than A, C, or E. It means ‘what the agent explicitly and cognitively wants’ rather than ‘what the agent should want’ or ‘what the agent mistakenly thinks it wants’ or ‘what the agent’s behavior tends to optimize’. Someone can be said to prefer their extrapolated volition to be implemented rather than their current desires, but if so they must explicitly, cognitively prefer that, or accept it in an explicit choice between options.

  • Preference framework to be an even more general term that can refer to e.g. meta-utility functions that change based on observations, or to meta-preferences about how one’s own preferences should be extrapolated. A ‘preference framework’ should refer to constructs more coherent than the human mass of desires and ad-hoc reflections, but not as strictly restricted as a VNM utility function. Stuart Armstrong’s utility indifference framework for value learning is an example of a preference framework that is not a vanilla/​ordinary utility function.

  • Goal remains a generic, unreserved term that could refer to any of A-E, and also particular things an agent wants to get done for instrumental reasons.

  • Intended goal to refer to B only.

  • Want remains a generic, unreserved term that could refer to humans or other agents, or terminal or instrumental goals.

‘Terminal’ and ‘instrumental’ have their standard contrasting meanings.


  • Utility

    What is “utility” in the context of Value Alignment Theory?


  • AI alignment

    The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.