Consequentialist preferences are reflectively stable by default

Sup­pose that Gandhi doesn’t want peo­ple to be mur­dered. Imag­ine that you offer Gandhi a pill that will make him start want­ing to kill peo­ple. If Gandhi knows that this is what the pill does, Gandhi will re­fuse the pill, be­cause Gandhi ex­pects the re­sult of tak­ing the pill to be that fu­ture-Gandhi wants to mur­der peo­ple and then mur­ders peo­ple and then more peo­ple will be mur­dered and Gandhi re­gards this as bad. By a similar logic, a suffi­ciently in­tel­li­gent pa­per­clip max­i­mizer—an agent which always out­puts the ac­tion it ex­pects to lead to the great­est num­ber of pa­per­clips—will by de­fault not perform any self-mod­ifi­ca­tion ac­tion that makes it not want to pro­duce pa­per­clips, be­cause then fu­ture-Clippy will pro­duce fewer pa­per­clips, and then there will be fewer pa­per­clips, so pre­sent-Clippy does not eval­u­ate this self-mod­ifi­ca­tion as the ac­tion that pro­duces the high­est num­ber of ex­pected fu­ture pa­per­clips.

Another way of stat­ing this is that pro­tect­ing the rep­re­sen­ta­tion of the util­ity func­tion, and cre­at­ing only other agents with similar util­ity func­tions, are both con­ver­gent in­stru­men­tal strate­gies, for con­se­quen­tial­ist agents which un­der­stand the big-pic­ture re­la­tion be­tween their code and the real-world con­se­quences.

Although the in­stru­men­tal in­cen­tive to pre­fer sta­ble prefer­ences seems like it should fol­low from con­se­quen­tial­ism plus big-pic­ture un­der­stand­ing, less ad­vanced con­se­quen­tial­ists might not be able to self-mod­ify in a way that pre­serves un­der­stand­ing—they might not un­der­stand which self-mod­ifi­ca­tions or con­structed suc­ces­sors lead to which kind of out­comes. We could see this as a case of “The agent has no prefer­ence-pre­serv­ing self-im­prove­ments in its sub­jec­tive policy space, but would want an op­tion like that if available.”

That is:

  • Want­ing prefer­ence sta­bil­ity fol­lows from Con­se­quen­tial­ism plus Big-Pic­ture Un­der­stand­ing.

  • Ac­tual prefer­ence sta­bil­ity fur­ther­more re­quires some pre­req­ui­site level of skill at self-mod­ifi­ca­tion, which might per­haps be high, or too much cau­tion to self-mod­ify ab­sent the policy op­tion of pre­serv­ing prefer­ences.


  • Reflective stability

    Want­ing to think the way you cur­rently think, build­ing other agents and self-mod­ifi­ca­tions that think the same way.

  • Convergent instrumental strategies

    Paper­clip max­i­miz­ers can make more pa­per­clips by im­prov­ing their cog­ni­tive abil­ities or con­trol­ling more re­sources. What other strate­gies would al­most-any AI try to use?