Moral uncertainty

“Moral uncertainty” in the context of AI refers to an agent with an “uncertain utility function”. That is, we can view the agent as pursuing a utility function that takes on different values in different subsets of possible worlds.

For example, an agent might have a meta-utility function saying that eating cake has a utility of €8 in worlds where Lee Harvey Oswald shot John F. Kennedy and that eating cake has a utility of €10 in worlds where it was the other way around. This agent will be motivated to inquire into political history to find out which utility function is probably the ‘correct’ one (relative to this meta-utility function), though it will never be absolutely sure.

Moral uncertainty must be resolvable by some conceivable observation in order to function as uncertainty. Suppose for example that an agent’s probability distribution \(\Delta U\) over the ‘true’ utility function \(U\) asserts a dependency on a fair quantum coin that was flipped inside a sealed box then destroyed by explosives: the utility function is \(U_1\) over outcomes in the worlds where the coin came up heads, and if the coin came up tails the utility function is \(U_2.\) If the agent thinks it has no way of ever figuring out what happened inside the box, it will thereafter behave as if it had a single, constant, certain utility function equal to \(0.5 \cdot U_1 + 0.5 \cdot U_2.\)


  • Ideal target

    The ‘ideal target’ of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the ‘true’ utilities under moral uncertainty.