Effability principle

A proposed principle of AI alignment stating, “The more insight you have into the deep structure of an AI’s cognitive operations, the more likely you are to succeed in aligning that AI.”

As an example of increased effability, consider the difference between having the idea of expected utility while building your AI, versus having never heard of expected utility. The idea of expected utility is so well-known that it may not seem salient as an insight anymore, but consider the difference between having this idea and not having it.

Staring at the expected utility principle and how it decomposes into a utility function and a probability distribution, leads to a potentially obvious-sounding but still rather important insight:

Rather than all behaviors and policies and goals needing to be up-for-grabs in order for an agent to adapt itself to a changing and unknown world, the agent can have a stable utility function and changing probability distribution.

E.g., when the agent tries to grab the cheese and discovers that the cheese is too high, we can view this as an update to the agent’s beliefs about how to get cheese, without changing the fact that the agent wants cheese.

Similarly, if we want superhuman performance at playing chess, we can ask for an AI that has a known, stable, understandable preference to win chess games; but a probability distribution that has been refined to greater-than-human accuracy about which policies yield a greater probabilistic expectation of winning chess positions.

Then contrast this to the state of mind where you haven’t decomposed your understanding of cognition into preference-ish parts and belief-ish parts. In this state of mind, for all you know, every aspect of the AI’s behavior, every goal it has, must potentially need to change in order for the AI to deal with a changing world; otherwise the AI will just be stuck executing the same behaviors over and over… right? Obviously, this notion of AI with unchangeable preferences is just a fool’s errand. Any AI like that would be too stupid to make a major difference for good or bad. noteThe idea of Instrumental convergence is also important here; e.g. that scientific curiosity is already an instrumental strategy for ‘make as many paperclips as possible’, rather than an AI needing a separate terminal preference about scientific curiosity in order to ever engage in it.

(This argument has indeed been encountered in the wild many times.)

Probability distributions and utility functions have now been known for a relatively long time and are understood relatively well; people have made many, many attempts to poke at their structure and imagine potential variations and show what goes wrong with those variations. There is now known an enormous family of coherence theorems stating that “Strategies which are not qualitatively dominated can be viewed as coherent with some consistent probability distribution and utility function.” This suggests that we can in a broad sense expect that, as a sufficiently advanced AI’s behavior is more heavily optimized for not qualitatively shooting itself in the foot, that AI will end up exhibiting some aspects of expected-utility reasoning. We have some idea of why a sufficiently advanced AI would have expected-utility-ish things going on somewhere inside it, or at least behave that way so far as we could tell by looking at the AI’s external actions.

So we can say, “Look, if you don’t explicitly write in a utility function, the AI is probably going to end up with something like a utility function anyway, you just won’t know where it is. It seems considerably wiser to know what that utility function says and write in on purpose. Heck, even if you say you explicitly don’t want your AI to have a stable utility function, you’d need to know all the coherence theorems you’re trying to defy by saying that!”

The Effability Principle states (or rather hopes) that as we get marginally more of this general kind of insight into an AI’s operations, we become marginally more likely to be able to align the AI.

The example of expected utility arguably suggests that if there are any more ideas like that lying around, which we don’t yet have, our lack of those ideas may entirely doom the AI alignment project or at least make it far more difficult. We can in principle imagine someone who is just using a big reinforcement learner to try to execute some large pivotal act, who has no idea where the AI is keeping its consequentialist preferences or what those preferences are; and yet this person was so paranoid and had the resources to put in so much monitoring and had so many tripwires and safeguards and was so conservative in how little they tried to do, that they succeeded anyway. But it doesn’t sound like a good idea to try in real life.

The search for increased effability has generally motivated the “Agent Foundations” agenda of research within MIRI. While not the only aspect of AI alignment, a concern is that this kind of deep insight may be a heavily serially-loaded task in which researchers need to develop one idea after another, compared to relatively shallow ideas in AI alignment that require less serial time to create. That is, this kind of research is among the most important kinds of research to start early.

The chief rival to effability is the Supervisability Principle, which, while not directly opposed to effability, tends to focus our understanding of the AI at a much larger grain size. For example, the Supervisability Principle says, “Since the AI’s behaviors are the only thing we can train by direct comparison with something we know to be already aligned, namely human behaviors, we should focus on ensuring the greatest possible fidelity at that point, rather than any smaller pieces whose alignment cannot be directly determined and tested in the same way.” Note that both principles agree that it’s important to understand certain facts about the AI as well as possible, but they disagree about what should be our design priority for rendering maximally understandable.