An advanced agent can be said to be “totally aligned” when it can assess the exact value of well-described outcomes and hence the exact subjective value of actions, policies, and plans; where value has its overridden meaning of a metasyntactic variable standing in for “whatever we really do or really should value in the world or want from an Artificial Intelligence” (this is the same as “normative” if the speaker believes in normativity). That is: It’s an advanced agent that captures all the distinctions we would make or should make within which outcomes are good or bad; it has “full coverage” of the true or intended goals; it correctly resolves every reflectively consistent degree of freedom.
We don’t need to try and give such an AI simplified orders like, e.g., “try to have a lower impact” because we’re worried about, e.g., a nearest unblocked strategy problem on trying to draw exact boundaries around what constitutes a bad impact. The AI knows everything worth knowing about which impacts are bad, and even if it thinks of a really weird exotic plan, it will still be able to figure out which aspects of this plan match our intended notion of value or a normative notion of value.
If this agent does not systematically underestimate the probability of bad outcomes / overestimate the probability of good outcomes, and its maximization over policies is not subject to adverse subjection, its estimates of expected value will be well-calibrated even from our own outside standpoint.
- Value alignment problem
You want to build an advanced AI with the right values… but how?