In the context of value alignment as a subject, the word ‘value’ is a speaker-dependent variable that indicates our ultimate goal—the property or meta-property that the speaker wants or ‘should want’ to see in the final outcome of Earth-originating intelligent life. E.g: , , coherent extrapolated volition, .
Different viewpoints are still being debated on this topic; people even in the limit of ongoing discussion. Some subtypes of potentially internally coherent views may not be sufficiently for even very small AI projects to cooperate on them; if e.g. Alice wants to own the whole world and will go on believing that in the limit of continuing contemplation, this is not a desideratum on which Alice, Bob, and Carol can all cooperate. Thus, using ‘value’ as a potentially speaker-dependent variable isn’t meant to imply that everyone has their own ‘value’ and that no further debate or cooperation is possible; people can and do talk each other out of positions which are then regarded as having been mistaken, and completely incommunicable stances seem unlikely to be reified even into a very small AI project. But since this debate is ongoing, there is not yet any one definition of ‘value’ that can be regarded as settled.. We don’t yet have full knowledge of which views are ‘reasonable’ in the sense that people with good cognitive skills might retain them
Nonetheless, on many of the current views being advocated, it seems like very similar technical problems of value alignment seem to arise in many of them. We would need to figure out how to identify the objects of value to the AI, robustly assure that the AI’s preferences are stable as the AI self-modifies, or create corrigible ways of recovering from errors in the way we tried to identify and specify the objects of value.
To centralize the very similar discussions of these technical problems while the outer debate about reasonable end goals is ongoing, the word ‘value’ acts as a metasyntactic placeholder for different views about the target of value alignment.
Similarly, in the larger value achievement dilemma, the question of what the end goals should be, and policy difficulties of getting ‘good’ goals to be adopted in name by the builders or creators of AI, are factored out as the . The output of this process is taken to be an input into the value loading problem, and ‘value’ is a name referring to this output.
‘Value’ is not assumed to be what the AI is given as its utility function or preference framework. On many views implying that value is complex or otherwise difficult to convey to an AI, the AI may be, e.g., a Genie where some stress is taken off the proposition that the AI exactly understands value and put onto human ability to use the Genie well.
Consider a Genie with an explicit preference framework targeted on afor making . The word ‘value’ in any discussion thereof should still only be used to refer to whatever the AI creators are targeting for real-world outcomes. We would say the ‘value alignment problem’ had been successfully solved to the extent that running the Genie produced high-value outcomes in the sense of the humans’ viewpoint on ‘value’, not to the extent that the outcome matched the Genie’s preference framework for how to follow orders.
Specific views on value
Obviously, a listing like this will only summarize long debates. But that summary at least lets us point to some examples of views that have been advocated, and not indefinitely defer the question of what ‘value’ could possibly refer to.
Again, keep in mind that by technical definition, ‘value’ is what we are using or should use to rate the ultimate real-world consequences of running the AI, not the explicit goals we are giving the AI.
Some of the major views that have been advocated by more than one person are as follows:
Reflective equilibrium. We can talk about ‘what I should want’ as a concept distinct from ‘what I want right now’ by construing some limit of how our present desires would directionally change given more factual knowledge, time to consider more knowledge, better self-awareness, and better self-control. Modeling this process is extrapolation, a reserved term to mean this process in the context of discussing preferences. Value would consist in, e.g., whatever properties a supermajority of humans would agree, in the limit of reflective equilibrium, are desirable. See also.
Standard desires. An object-level view that identifies value with qualities that we currently find very desirable, enjoyable, fun, and preferable, such as(including truth, happiness, aesthetics, love, challenge and achievement, etc.) On the closely related view of Fun Theory, such desires may be further extrapolated, without changing their essential character, into forms suitable for transhuman minds. Advocates may agree that these object-level desires will be subject to unknown normative corrections by reflective-equilibrium-type considerations, but still believe that some form of Fun or standardly desirable outcome is a likely result. Therefore (on this view) it is reasonable to speak of value as probably mostly consisting in turning most of the reachable universe into superintelligent life enjoying itself, creating transhuman forms of art, etcetera.
. E.g., “Cure cancer” or “Don’t transform the world into paperclips.” Such replies arguably have problems as ultimate criteria of value from a human standpoint (see linked discussion), but for obvious reasons, lists of immediate goods are a common early thought when first considering the subject.
Deflationary moral error theory. There is no good way to construe a normative concept apart from what particular people want. AI programmers are just doing what they want, and confused talk of ‘fairness’ or ‘rightness’ cannot be rescued. The speaker would nonetheless personally prefer not to be turned into paperclips. (This mostly ends up at an ‘immediate goods’ theory in practice, plus some beliefs relevant to thedebate.)
Simple purpose. Value can easily be identified with X, for some X. X is the main thing we should be concerned about passing on to AIs. Seemingly desirable things besides X are either (a) improper to care about, (b) relatively unimportant, or (c) instrumentally implied by pursuing X, qua X.
The following versions of desiderata for AI outcomes would tend to imply that the value alignment / value loading problem is an entirely wrong way of looking at the issue, which might make it disingenuous to claim that ‘value’ in ‘value alignment’ can cover them as a metasyntactic variable as well:
Moral internalist value. The normative is inherently compelling to all, or almost all cognitively powerful agents. Whatever is not thus compelling cannot be normative or a proper object of human desire.
AI rights. The primary thing is to ensure that the AI’s natural and intrinsic desires are respected. The ideal is to end up in a diverse civilization that respects the rights of all sentient beings, including AIs. (Generally linked are the views that no special selection of AI design is required to achieve this, or that special selection of AI design to shape particular motivations would itself violate AI rights.)
Modularity of ‘value’
Many issues in value alignment seem to generalize very well across the Reflective Equilibrium, Fun Theory, Intuitive Desiderata, and Deflationary Error Theory viewpoints. In all cases we would have to consider stability of self-modification, the Edge Instantiation problem in value identification, and most of the rest of ‘standard’ value alignment theory. This seemingly good generalization of the resulting technical problems across such wide-ranging viewpoints, and especially that it (arguably) covers the case of intuitive desiderata, is what justifies treating ‘value’ as a metasyntactic variable in ‘value loading problem’.
A neutral term for referring to all the values in this class might be ‘alignable values’.
E.g., Juergen Schmidhuber stated at the 20XX Singularity Summit that he thought the only proper and normative goal of any agent was to increase compression of sensory information find exact quote, exact Summit. Conditioned on this being the sum of all normativity, ‘value’ is algorithmically simple. Then the problems of Edge Instantiation, Unforeseen Maximums, and Nearest Unblocked Neighbor are all moot. (Except perhaps as there is an Ontology Identification problem for defining exactly what constitutes ‘sensory information’ for a .)
Even in the reflective stability (it would be necessary to make an AI that went on caring about X through self-modification). Nonetheless, the overall problem difficulty and immediate technical priorities would be different enough that the Simple Purpose case seems importantly distinct from e.g. Fun Theory on a policy level.case, the would still exist (it would still be necessary to make an AI that cared about the simple purpose rather than paperclips) along with associated problems of
Some viewpoints on ‘value’ deliberately reject Orthogonality. Strong versions of the claim as an empirical prediction that every sufficiently powerful cognitive agent will come to pursue the same end, which end is to be identified with normativity, and is the only proper object of human desire. If true, this would imply that the entire value alignment problem is moot for advanced agents.
Many people who advocate ‘simple purposes’ also claim these purposes are universally compelling. In a policy sense, this seems functionally similar to the Moral Internalist case regardless of the simplicity or complexity of the universally compelling value. Hence an alleged simple universally compelling purpose is categorized for these purposes as Moral Internalist rather than Simple Purpose.
The special case of a Simple Purpose claimed to be universally instrumentally convergent also seems functionally identical to Moral Internalism from a policy standpoint.)
Someone might believe as a proposition of fact that all (accessible) AI designs would have ‘innate’ desires, believe as a proposition of fact that no AI would gain enough advantage to wipe out humanity or prevent the creation of other AIs, and assert as a matter of morality that a good outcome consists of everyone being free to pursue their own value and trade. In this case the value alignment problem is implied to be an entirely wrong way to look at the problem, with all associated technical issues moot. Thus, it again might be disingenuous to have ‘value’ as a metasyntactic variable try to cover this case.
- Extrapolated volition (normative moral theory)
If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?
- Coherent extrapolated volition (alignment target)
A proposed direction for an extremely well-aligned autonomous superintelligence—do what humans would want, if we knew what the AI knew, thought that fast, and understood ourselves.
Really actually good. A metasyntactic variable to mean “favoring whatever the speaker wants ideally to accomplish”, although different speakers have different morals and metaethics.
- William Frankena's list of terminal values
Life, consciousness, and activity; health and strength; pleasures and satisfactions of all or certain kinds; happiness, beatitude, contentment, etc.; truth; knowledge and true opinions…
The opposite of beneficial.
- Immediate goods
- Cosmopolitan value
Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.