Non-adversarial principle

The ‘Non-Adversarial Principle’ is a proposed design rule for sufficiently advanced Artificial Intelligence stating that:

By design, the human operators and the AGI should never come into conflict.

Special cases of this principle include Niceness is the first line of defense and The AI wants your safety measures.

According to this principle, if the AI has an off-switch, our first thought should not be, “How do we have guards with guns defending this off-switch so the AI can’t destroy it?” but “How do we make sure the AI wants this off-switch to exist?”

If we think the AI is not ready to act on the Internet, our first thought should not be “How do we airgap the AI’s computers from the Internet?” but “How do we construct an AI that wouldn’t try to do anything on the Internet even if it got access?” Afterwards we may go ahead and still not connect the AI to the Internet, but only as a fallback measure. Like the containment shell of a nuclear power plant, the plan shouldn’t call for the fallback measure to ever become necessary. E.g., nuclear power plants have containment shells in case the core melts down. But this is not because we’re planning to have the core melt down on Tuesday and have that be okay because there’s a containment shell.

Why run code that does the wrong thing?

Ultimately, every event inside an AI—every RAM access and CPU instruction—is an event set in motion by our own design. Even if the AI is modifying its own code, the modified code is a causal outcome of the original code (or the code that code wrote etcetera). Everything that happens inside the computer is, in some sense, our fault and our choice. Given that responsibility, we should not be constructing a computation that is trying to hurt us. At the point that computation is running, we’ve already done something foolish—willfully shot ourselves in the foot. Even if the AI doesn’t find any way to do the bad thing, we are, at the very least, wasting computing power.

No aspect of the AI’s design should ever put us in an adversarial position vis-a-vis the AI, or pit the AI’s wits against our wits. If a computation starts looking for a way to outwit us, then the design and methodology has already failed. We just shouldn’t be putting an AI in a box and then having the AI search for ways to get out of the box. If you’re building a toaster, you don’t build one element that heats the toast and then add a tiny refrigerator that cools down the toast.

Not running searches for harmful strategies

Under the agents as searches viewpoint, we can see an AI as embodying computations that search out strategies within a domain. E.g., an AI carrying out the “put one damn strawberry on a damn plate” task is searching the space of imaginable strategies for a strategy such that the consequence of executing that strategy is one strawberry being on the plate.

The non-adversarial principle then says that we should at no point be running a computation which is searching for ways to hurt us.

Suppose you don’t want the AI to go on the Internet. If your first thought is airgapping the AI’s hardware from the Internet or otherwise boxing the AI, then what you’re trying to do is ensure that a computation searching for strategies to reach the Internet returns no result.

If the primary plan is “the AI searches, but doesn’t find anything”, then we need to worry that the AI’s search considers some class of strategies we didn’t imagine. We are pitting our own wits against a potentially weird or superhuman search. Even if we did successfully rule out all possibilities for escape, it would be hard for us to ever be confident that we had done so. It would be very hard to rule out the conceptual possibility of some entirely unexpected side-channel attack like the AI producing GSM cellphone signals by cleverly modulating its memory accesses, or the AI persuading its human operators to let it out of the box. (Except that neither of those two particular possibilities would be unexpected; any specific strategy we write down will be the kind of possibility that humans consider.)

Even if the AI’s search does come up empty, you were tempting fate and wasting computing power by instantiating that search in the first place.

Oppositional measures are fallbacks

After we ensure that the escape-search computation is not running in the first place, we can think about fallback oppositional lines of defense; e.g. putting the AI’s processors inside a Faraday cage.

But the plan is not that the AI tries to escape and we successfully prevent the escape. If we imagine the AI trying to escape, we ought to feel very surprised in our imagination. If the AI actually does try to escape, we should be crying ‘What?’ and tracking down which strong-seeming assumption has failed.

No matter what clever design we use or how strong it seems, we’d probably still be wise to airgap an AI under development, not say “Oh it’s supposed to be safe!” and connect it to the Internet and saunter on. But that doesn’t mean we rely on the airgap. It doesn’t mean the plan calls for the airgap to be necessary.

Nuclear power plants have concrete containment units in case the core melts down; but the design is not that they melt down. The master plan doesn’t say “And then on Tuesday the core melts down, but that’s fine because of the containment unit.” By design, that enormous concrete shell isn’t supposed to actually ever become necessary. And then we build it anyway, because the best-laid plans etcetera.

Similarly, when designing an AI, we should pretend that the airgap doesn’t exist or that the AI will suddenly get Internet access anyway on Tuesday; our primary thought should be to design AI that doesn’t need an airgap to be safe. And then we add the airgap, making sure that we’re not thinking the equivalent of “Oh, it doesn’t really matter if the core melts down, because we’ve got a containment structure there anyway.”

Challenges in implementing non-adversarialism

The main difficulties foreseen so far for implementing the non-adversarial principle, tend to center around Instrumental convergence plus Nearest unblocked strategy behavior.

For example, if you build a shutdown button for a Task AGI that suspends the AI to disk when pressed, the nonadversarial principle implies you must also ensure:

  • That the AI wants there to be a shutdown button;

  • That the AI wants to be suspended to disk after this button is pressed;

  • That the AI wants the state of this shutdown button to reflect the dynamic decisions of the human operators;

  • That the AI does not want to influence the operators to decide to not press the switch, or to press it;

  • That the AI does not want anything besides an orderly suspend-to-disk to happen, or not happen, after this button is pressed.


  • The AI does not think about or make plans involving the shutdown button, e.g. because that domain was not whitelisted for cognition.

  • None of the AI’s other models end up reflecting the existence of the shutdown button or none of its other plans end up taking into account that part of the model.

The difficulties here center around “You can’t fetch the coffee if you’re dead”. This reasoning is very general, so even if we try to make it not apply at one point, it tends to pop up somewhere else:

  • If you naively try to add in a special-case clause to the utility function for wanting a shutdown button to exist, the AI wants the shutdown button to not be pressed.

  • If you successfully add a special case saying that the AI wants the button to be pressed if the humans want that button to be pressed, the AI wants the humans to not want to press the button.

  • If you naively try to add in a special clause for the AI wanting to shut down after the button is pressed, the AI wants to create a subagent to make sure the coffee gets pressed anyway.

  • If you try to make an AI that doesn’t think about the shutdown button or model it at all, this seems potentially difficult because in reality the best hypothesis to explain the world does contain a shutdown button. A general search for good hypotheses may tend to create cognitive tokens that represent the shutdown button, and it’s not clear (yet) how this could in general be prevented by trying to divide the world into domains.

More generally: by default a lot of high-level searches we do want to run, have subsearches we’d prefer not to run. If we run an agent that searches in general for ways to fetch the coffee, that search would, by default and if smart enough, also search for ways to prevent itself from being shut down.

How exactly to implement the non-adversarial principle is thus a major open problem. We may need to be more clever about shaping which computations give rise to which other computations than the default “Search for any action in any domain which achieves X.”

See also


  • Omnipotence test for AI safety

    Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that wants to hurt you and is held back only by lacking the power?

  • Niceness is the first line of defense

    The first line of defense in dealing with any partially superhuman AI system advanced enough to possibly be dangerous is that it does not want to hurt you or defeat your safety measures.

  • Directing, vs. limiting, vs. opposing

    Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)

  • The AI must tolerate your safety measures

    A corollary of the nonadversarial principle is that “The AI must tolerate your safety measures.”

  • Generalized principle of cognitive alignment

    When we’re asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.


  • Principles in AI alignment

    A ‘principle’ of AI alignment is a very general design goal like ‘understand what the heck is going on inside the AI’ that has informed a wide set of specific design proposals.