Omnipotence test for AI safety
Suppose your AI suddenly became omniscient and omnipotent—suddenly knew all facts and could directly ordain any outcome as a policy option. Would the executing AI code lead to bad outcomes in that case? If so, why did you write a program that in some sense ‘wanted’ to hurt you and was only held in check by lack of knowledge and capability? Isn’t that a bad way for you to configure computing power? Why not write different code instead?
The Omni Test is that an advanced AI should be expected to remain aligned, or not lead to catastrophic outcomes, or fail safely, even if it suddenly knows all facts and can directly ordain any possible outcome as an immediate choice. The policy proposal is that, among agents meant to act in the rich real world, any predicted behavior where the agent might act destructively if given unlimited power (rather than e.g. pausing for a safe user query) should be treated as a bug.
The Omni Test highlights any reasoning step on which we’ve presumed, in a non-failsafe way, that the agent must not obtain definite knowledge of some fact or that it must not have access to some strategic option. There are epistemic obstacles to our becoming extremely confident of our ability to lower-bound the reaction times or upper-bound the power of an advanced agent.
The deeper idea behind the Omni Test is that any predictable failure in an Omni scenario, or lack of assured reliability, exposes some more general flaw. Suppose NASA found that an alignment of four planets would cause their code to crash and a rocket’s engines to explode. They wouldn’t say, “Oh, we’re not expecting any alignment like that for the next hundred years, so we’re still safe.” They’d say, “Wow, that sure was a major bug in the program.” Correctly designed programs just shouldn’t explode the rocket, period. If any specific scenario exposes a behavior like that, it shows that some general case is not being handled correctly.
The omni-safe mindset says that, rather than trying to guess what facts an advanced agent can’t figure out or what strategic options it can’t have, we just shouldn’t make these guesses of ours load-bearing premises of an agent’s safety. Why design an agent that we expect will hurt us if it knows too much or can do too much?
For example, rather than design an AI that is meant to be monitored for unexpected power gains by programmers who can then press a pause button—which implicitly assumes that no capability gain can happen in fast enough that a programmer wouldn’t have time to react—an omni-safe proposal would design the AI to detect unvetted capability gains and pause until the vetting had occurred. Even if it seemed improbable that some amount of cognitive power could be gained faster than the programmers could react, especially when no such previous sharp power gain had occurred even in the course of a day, etcetera, the omni-safe mindset says to just not build an agent that is unsafe when such background variables have ‘unreasonable’ settings. The correct general behavior is to, e.g., always pause when new capability has been acquired and a programmer has not yet indicated approval of its use. It might not be possible for an AGI design to suddenly use unlimited power optimally, or even use it in any safe way at all, but that’s still no excuse for building an omni-unsafe system; because it ought to be possible to detect that case, say “Something weird just happened!”, and suspend to disk.
Similarly, consider the paradigm of conservative planning. Rather than thinking in terms of blacklisting features of bad plans, we think in terms of whitelisting allowed plans using conservative generalizations. So long as we’re narrowly whitelisting rather than blacklisting, lots of new option space suddenly opening up shouldn’t result in any of those strange new options being taken until the users can whitelist more things.
Similarly, the omni-safe mindset says that any time the AI is running a cognitive search for a way to do something harmful or bypass one of our security measures, we’ve already seen a bug in the design, even if we strongly expect that search to come up empty. Why configure computing power in a way that will search for a way to hurt you? This is clearly the wrong thing to do with computing power; we, the programmers, should have our code do something else instead. This is part of the agenda motivating the study of Corrigibility.
Trying to guess an agent’s actual capabilities is a reasonable project, especially if there are more ‘adversarial’ fallback precautions that revolve around our estimates of the agent’s actual powers. E.g., having designed an agent that doesn’t want to gain unvetted communication channels with the outside world, we might then put the computing hardware into a Faraday cage and make sure there are no cellphone towers around—an ‘adversarial’ precaution meant to block communication even if the first line of defense fails and the AI starts wanting to communicate in an unvetted way. But ‘adversarial’ precautions are secondary lines of defense; a reasonable expectation of omni-safety is the first line of defense. First we assume that all adversarial fallback measures will fail, and design the agent to remain nonharmful or fail safely no matter what new capability or knowledge is gained. Then we assume the first line of defense has failed, and try, if it’s at all possible or realistic, to put up fallback measures that will prevent total catastrophe so long as the agent has realistic amounts of power and can’t violate what we think are ‘the laws of physics’ and so on.
- Non-adversarial principle
At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.