Consequentialist preferences are reflectively stable by default

Suppose that Gandhi doesn’t want people to be murdered. Imagine that you offer Gandhi a pill that will make him start wanting to kill people. If Gandhi knows that this is what the pill does, Gandhi will refuse the pill, because Gandhi expects the result of taking the pill to be that future-Gandhi wants to murder people and then murders people and then more people will be murdered and Gandhi regards this as bad. By a similar logic, a sufficiently intelligent paperclip maximizer—an agent which always outputs the action it expects to lead to the greatest number of paperclips—will by default not perform any self-modification action that makes it not want to produce paperclips, because then future-Clippy will produce fewer paperclips, and then there will be fewer paperclips, so present-Clippy does not evaluate this self-modification as the action that produces the highest number of expected future paperclips.

Another way of stating this is that protecting the representation of the utility function, and creating only other agents with similar utility functions, are both convergent instrumental strategies, for consequentialist agents which understand the big-picture relation between their code and the real-world consequences.

Although the instrumental incentive to prefer stable preferences seems like it should follow from consequentialism plus big-picture understanding, less advanced consequentialists might not be able to self-modify in a way that preserves understanding—they might not understand which self-modifications or constructed successors lead to which kind of outcomes. We could see this as a case of “The agent has no preference-preserving self-improvements in its subjective policy space, but would want an option like that if available.”

That is:

  • Wanting preference stability follows from Consequentialism plus Big-Picture Understanding.

  • Actual preference stability furthermore requires some prerequisite level of skill at self-modification, which might perhaps be high, or too much caution to self-modify absent the policy option of preserving preferences.


  • Reflective stability

    Wanting to think the way you currently think, building other agents and self-modifications that think the same way.

  • Convergent instrumental strategies

    Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?