Reflectively consistent degree of freedom

A “reflectively consistent degree of freedom” is when a self-modifying AI can have multiple possible properties \(X_i \in X\) such that an AI with property \(X_1\) wants to go on being an AI with property \(X_1,\) and an AI with \(X_2\) will ceteris paribus only choose to self-modify into designs that are also \(X_2,\) etcetera.

The archetypal reflectively consistent degree of freedom is a Humean degree of freedom, the refective consistency of many different possible utility functions. If Gandhi doesn’t want to kill you, and you offer Gandhi a pill that makes him want to kill people, then Gandhi will refuse the pill, because he knows that if he takes the pill then pill-taking-future-Gandhi will kill people, and the current Gandhi rates this outcome low in his preference function. Similarly, a paperclip maximizer wants to remain a paperclip maximizer. Since these two possible preference frameworks are both consistent under reflection, they constitute a “reflectively consistent degree of freedom” or “reflective degree of freedom”.

From a design perspective, or the standpoint of an AI safety mindset, the key fact about a reflectively consistent degree of freedom is that it doesn’t automatically self-correct as a result of the AI trying to improve itself. The problem “Has trouble understanding General Relativity” or “Cannot beat a human at poker” or “Crashes on seeing a picture of a dolphin” is something that you might expect to correct automatically and without specifically directed effort, assuming you otherwise improved the AI’s general ability to understand the world and that it was self-improving. “Wants paperclips instead of eudaimonia” is not self-correcting.

Another way of looking at it is that reflective degrees of freedom describe information that is not automatically extracted or learned given a sufficiently smart AI, the way it would automatically learn General Relativity. If you have a concept whose borders (membership condition) relies on knowing about General Relativity, then when the AI is sufficiently smart it will see a simple definition of that concept. If the concept’s borders instead rely on value-laden judgments, there may be no algorithmically simple description of that concept, even given lots of knowledge of the environment, because the Humean degrees of freedom need to be independently specified.

Other properties besides the preference function look like they should be reflectively consistent in similar ways. For example, son of CDT and UDT both seem to be reflectively consistent in different ways. So an AI that has, from our perspective, a ‘bad’ decision theory (one that leads to behaviors we don’t want), isn’t ‘bugged’ in a way we can rely on to self-correct. (This is one reason why MIRI studies decision theory and not computer vision. There’s a sense in which mistakes in computer vision automatically fix themselves, given a sufficiently advanced AI, and mistakes in decision theory don’t fix themselves.)

Similarly, Bayesian priors are by default consistent under reflection—if you’re a Bayesian with a prior, you want to create copies of yourself that have the same prior or Bayes-updated versions of the prior. So ‘bugs’ (from a human standpoint) like being Pascal’s Muggable might not automatically fix themselves in a way that correlated with sufficient growth in other knowledge and general capability, in the way we might expect a specific mistaken belief about gravity to correct itself in a way that correlated to sufficient general growth in capability. (This is why MIRI thinks about naturalistic induction and similar questions about prior probabilities.)


  • Humean degree of freedom

    A concept includes ‘Humean degrees of freedom’ when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.

  • Value-laden

    Cure cancer, but avoid any bad side effects? Categorizing “bad side effects” requires knowing what’s “bad”. If an agent needs to load complex human goals to evaluate something, it’s “value-laden”.


  • Reflective stability

    Wanting to think the way you currently think, building other agents and self-modifications that think the same way.