Big-picture strategic awareness

Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:

  • That it is an AI;

  • Running on a computer;

  • Surrounded by programmers who are themselves modelable agents;

  • Embedded in a complicated real world that can be relevant to achieving the AI’s goals.

For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.

Similarly: If the AI realizes that there are ‘programmer’ things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that’s the first point at which the AI might by default think, “How can I make my programmers decide to not shut me down?” or “How can I avoid the programmers acquiring beliefs that would make them shut me down?” So by this point we’d need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).

This makes big-picture awareness a key advanced agent property, especially as it relates to convergent instrumental strategies and the theory of averting them.

Possible ways in which an agent could acquire big-picture strategic awareness:

  • Explicitly be taught the relevant facts by its programmers;

  • Be sufficiently general to have learned the relevant facts and domains without them being preprogrammed;

  • Be sufficiently good at the specialized domain of self-improvement, to acquire sufficient generality to learn the relevant facts and domains.

By the time big-picture awareness was starting to emerge, you would probably want to have finished developing what seemed like workable initial solutions to the corresponding problems of corrigibility, since the first line of defense is to not have the AI searching for ways to defeat your defenses.

Current machine algorithms seem nowhere near the point of being able to usefully represent the big picture to the point of doing consequentialist reasoning about it, even if we deliberately tried to explain the domain. This is a great obstacle to exhibiting most subproblems of corrigibility within modern AI algorithms in a natural way (aka not as completely rigged demos). Some pioneering work has been done here by Orseau and Armstrong considering reinforcement learners being interrupted, and whether such programs learn to avoid interruption. However, most current work on corrigibility has taken place in an unbounded context for this reason.


  • Advanced agent properties

    How smart does a machine intelligence need to be, for its niceness to become an issue? “Advanced” is a broad term to cover cognitive abilities such that we’d need to start considering AI alignment.