Reflectively consistent degree of freedom

A “re­flec­tively con­sis­tent de­gree of free­dom” is when a self-mod­ify­ing AI can have mul­ti­ple pos­si­ble prop­er­ties \(X_i \in X\) such that an AI with prop­erty \(X_1\) wants to go on be­ing an AI with prop­erty \(X_1,\) and an AI with \(X_2\) will ce­teris paribus only choose to self-mod­ify into de­signs that are also \(X_2,\) etcetera.

The archety­pal re­flec­tively con­sis­tent de­gree of free­dom is a Humean de­gree of free­dom, the refec­tive con­sis­tency of many differ­ent pos­si­ble util­ity func­tions. If Gandhi doesn’t want to kill you, and you offer Gandhi a pill that makes him want to kill peo­ple, then Gandhi will re­fuse the pill, be­cause he knows that if he takes the pill then pill-tak­ing-fu­ture-Gandhi will kill peo­ple, and the cur­rent Gandhi rates this out­come low in his prefer­ence func­tion. Similarly, a pa­per­clip max­i­mizer wants to re­main a pa­per­clip max­i­mizer. Since these two pos­si­ble prefer­ence frame­works are both con­sis­tent un­der re­flec­tion, they con­sti­tute a “re­flec­tively con­sis­tent de­gree of free­dom” or “re­flec­tive de­gree of free­dom”.

From a de­sign per­spec­tive, or the stand­point of an AI safety mind­set, the key fact about a re­flec­tively con­sis­tent de­gree of free­dom is that it doesn’t au­to­mat­i­cally self-cor­rect as a re­sult of the AI try­ing to im­prove it­self. The prob­lem “Has trou­ble un­der­stand­ing Gen­eral Rel­a­tivity” or “Can­not beat a hu­man at poker” or “Crashes on see­ing a pic­ture of a dolphin” is some­thing that you might ex­pect to cor­rect au­to­mat­i­cally and with­out speci­fi­cally di­rected effort, as­sum­ing you oth­er­wise im­proved the AI’s gen­eral abil­ity to un­der­stand the world and that it was self-im­prov­ing. “Wants pa­per­clips in­stead of eu­daimo­nia” is not self-cor­rect­ing.

Another way of look­ing at it is that re­flec­tive de­grees of free­dom de­scribe in­for­ma­tion that is not au­to­mat­i­cally ex­tracted or learned given a suffi­ciently smart AI, the way it would au­to­mat­i­cally learn Gen­eral Rel­a­tivity. If you have a con­cept whose bor­ders (mem­ber­ship con­di­tion) re­lies on know­ing about Gen­eral Rel­a­tivity, then when the AI is suffi­ciently smart it will see a sim­ple defi­ni­tion of that con­cept. If the con­cept’s bor­ders in­stead rely on value-laden judg­ments, there may be no al­gorith­mi­cally sim­ple de­scrip­tion of that con­cept, even given lots of knowl­edge of the en­vi­ron­ment, be­cause the Humean de­grees of free­dom need to be in­de­pen­dently speci­fied.

Other prop­er­ties be­sides the prefer­ence func­tion look like they should be re­flec­tively con­sis­tent in similar ways. For ex­am­ple, son of CDT and UDT both seem to be re­flec­tively con­sis­tent in differ­ent ways. So an AI that has, from our per­spec­tive, a ‘bad’ de­ci­sion the­ory (one that leads to be­hav­iors we don’t want), isn’t ‘bugged’ in a way we can rely on to self-cor­rect. (This is one rea­son why MIRI stud­ies de­ci­sion the­ory and not com­puter vi­sion. There’s a sense in which mis­takes in com­puter vi­sion au­to­mat­i­cally fix them­selves, given a suffi­ciently ad­vanced AI, and mis­takes in de­ci­sion the­ory don’t fix them­selves.)

Similarly, Bayesian pri­ors are by de­fault con­sis­tent un­der re­flec­tion—if you’re a Bayesian with a prior, you want to cre­ate copies of your­self that have the same prior or Bayes-up­dated ver­sions of the prior. So ‘bugs’ (from a hu­man stand­point) like be­ing Pas­cal’s Mug­gable might not au­to­mat­i­cally fix them­selves in a way that cor­re­lated with suffi­cient growth in other knowl­edge and gen­eral ca­pa­bil­ity, in the way we might ex­pect a spe­cific mis­taken be­lief about grav­ity to cor­rect it­self in a way that cor­re­lated to suffi­cient gen­eral growth in ca­pa­bil­ity. (This is why MIRI thinks about nat­u­ral­is­tic in­duc­tion and similar ques­tions about prior prob­a­bil­ities.)


  • Humean degree of freedom

    A con­cept in­cludes ‘Humean de­grees of free­dom’ when the in­tu­itive bor­ders of the hu­man ver­sion of that con­cept de­pend on our val­ues, mak­ing that con­cept less nat­u­ral for AIs to learn.

  • Value-laden

    Cure can­cer, but avoid any bad side effects? Cat­e­go­riz­ing “bad side effects” re­quires know­ing what’s “bad”. If an agent needs to load com­plex hu­man goals to eval­u­ate some­thing, it’s “value-laden”.


  • Reflective stability

    Want­ing to think the way you cur­rently think, build­ing other agents and self-mod­ifi­ca­tions that think the same way.