Non-adversarial principle

The ‘Non-Ad­ver­sar­ial Prin­ci­ple’ is a pro­posed de­sign rule for suffi­ciently ad­vanced Ar­tifi­cial In­tel­li­gence stat­ing that:

By de­sign, the hu­man op­er­a­tors and the AGI should never come into con­flict.

Spe­cial cases of this prin­ci­ple in­clude Nice­ness is the first line of defense and The AI wants your safety mea­sures.

Ac­cord­ing to this prin­ci­ple, if the AI has an off-switch, our first thought should not be, “How do we have guards with guns defend­ing this off-switch so the AI can’t de­stroy it?” but “How do we make sure the AI wants this off-switch to ex­ist?”

If we think the AI is not ready to act on the In­ter­net, our first thought should not be “How do we air­gap the AI’s com­put­ers from the In­ter­net?” but “How do we con­struct an AI that wouldn’t try to do any­thing on the In­ter­net even if it got ac­cess?” After­wards we may go ahead and still not con­nect the AI to the In­ter­net, but only as a fal­lback mea­sure. Like the con­tain­ment shell of a nu­clear power plant, the plan shouldn’t call for the fal­lback mea­sure to ever be­come nec­es­sary. E.g., nu­clear power plants have con­tain­ment shells in case the core melts down. But this is not be­cause we’re plan­ning to have the core melt down on Tues­day and have that be okay be­cause there’s a con­tain­ment shell.

Why run code that does the wrong thing?

Ul­ti­mately, ev­ery event in­side an AI—ev­ery RAM ac­cess and CPU in­struc­tion—is an event set in mo­tion by our own de­sign. Even if the AI is mod­ify­ing its own code, the mod­ified code is a causal out­come of the origi­nal code (or the code that code wrote etcetera). Every­thing that hap­pens in­side the com­puter is, in some sense, our fault and our choice. Given that re­spon­si­bil­ity, we should not be con­struct­ing a com­pu­ta­tion that is try­ing to hurt us. At the point that com­pu­ta­tion is run­ning, we’ve already done some­thing fool­ish—willfully shot our­selves in the foot. Even if the AI doesn’t find any way to do the bad thing, we are, at the very least, wast­ing com­put­ing power.

No as­pect of the AI’s de­sign should ever put us in an ad­ver­sar­ial po­si­tion vis-a-vis the AI, or pit the AI’s wits against our wits. If a com­pu­ta­tion starts look­ing for a way to out­wit us, then the de­sign and method­ol­ogy has already failed. We just shouldn’t be putting an AI in a box and then hav­ing the AI search for ways to get out of the box. If you’re build­ing a toaster, you don’t build one el­e­ment that heats the toast and then add a tiny re­friger­a­tor that cools down the toast.

Not run­ning searches for harm­ful strategies

Un­der the agents as searches view­point, we can see an AI as em­body­ing com­pu­ta­tions that search out strate­gies within a do­main. E.g., an AI car­ry­ing out the “put one damn straw­berry on a damn plate” task is search­ing the space of imag­in­able strate­gies for a strat­egy such that the con­se­quence of ex­e­cut­ing that strat­egy is one straw­berry be­ing on the plate.

The non-ad­ver­sar­ial prin­ci­ple then says that we should at no point be run­ning a com­pu­ta­tion which is search­ing for ways to hurt us.

Sup­pose you don’t want the AI to go on the In­ter­net. If your first thought is air­gap­ping the AI’s hard­ware from the In­ter­net or oth­er­wise box­ing the AI, then what you’re try­ing to do is en­sure that a com­pu­ta­tion search­ing for strate­gies to reach the In­ter­net re­turns no re­sult.

If the pri­mary plan is “the AI searches, but doesn’t find any­thing”, then we need to worry that the AI’s search con­sid­ers some class of strate­gies we didn’t imag­ine. We are pit­ting our own wits against a po­ten­tially weird or su­per­hu­man search. Even if we did suc­cess­fully rule out all pos­si­bil­ities for es­cape, it would be hard for us to ever be con­fi­dent that we had done so. It would be very hard to rule out the con­cep­tual pos­si­bil­ity of some en­tirely un­ex­pected side-chan­nel at­tack like the AI pro­duc­ing GSM cel­l­phone sig­nals by clev­erly mod­u­lat­ing its mem­ory ac­cesses, or the AI per­suad­ing its hu­man op­er­a­tors to let it out of the box. (Ex­cept that nei­ther of those two par­tic­u­lar pos­si­bil­ities would be un­ex­pected; any spe­cific strat­egy we write down will be the kind of pos­si­bil­ity that hu­mans con­sider.)

Even if the AI’s search does come up empty, you were tempt­ing fate and wast­ing com­put­ing power by in­stan­ti­at­ing that search in the first place.

Op­po­si­tional mea­sures are fallbacks

After we en­sure that the es­cape-search com­pu­ta­tion is not run­ning in the first place, we can think about fal­lback op­po­si­tional lines of defense; e.g. putting the AI’s pro­ces­sors in­side a Fara­day cage.

But the plan is not that the AI tries to es­cape and we suc­cess­fully pre­vent the es­cape. If we imag­ine the AI try­ing to es­cape, we ought to feel very sur­prised in our imag­i­na­tion. If the AI ac­tu­ally does try to es­cape, we should be cry­ing ‘What?’ and track­ing down which strong-seem­ing as­sump­tion has failed.

No mat­ter what clever de­sign we use or how strong it seems, we’d prob­a­bly still be wise to air­gap an AI un­der de­vel­op­ment, not say “Oh it’s sup­posed to be safe!” and con­nect it to the In­ter­net and saunter on. But that doesn’t mean we rely on the air­gap. It doesn’t mean the plan calls for the air­gap to be nec­es­sary.

Nu­clear power plants have con­crete con­tain­ment units in case the core melts down; but the de­sign is not that they melt down. The mas­ter plan doesn’t say “And then on Tues­day the core melts down, but that’s fine be­cause of the con­tain­ment unit.” By de­sign, that enor­mous con­crete shell isn’t sup­posed to ac­tu­ally ever be­come nec­es­sary. And then we build it any­way, be­cause the best-laid plans etcetera.

Similarly, when de­sign­ing an AI, we should pre­tend that the air­gap doesn’t ex­ist or that the AI will sud­denly get In­ter­net ac­cess any­way on Tues­day; our pri­mary thought should be to de­sign AI that doesn’t need an air­gap to be safe. And then we add the air­gap, mak­ing sure that we’re not think­ing the equiv­a­lent of “Oh, it doesn’t re­ally mat­ter if the core melts down, be­cause we’ve got a con­tain­ment struc­ture there any­way.”

Challenges in im­ple­ment­ing non-adversarialism

The main difficul­ties fore­seen so far for im­ple­ment­ing the non-ad­ver­sar­ial prin­ci­ple, tend to cen­ter around In­stru­men­tal con­ver­gence plus Near­est un­blocked strat­egy be­hav­ior.

For ex­am­ple, if you build a shut­down but­ton for a Task AGI that sus­pends the AI to disk when pressed, the non­ad­ver­sar­ial prin­ci­ple im­plies you must also en­sure:

  • That the AI wants there to be a shut­down but­ton;

  • That the AI wants to be sus­pended to disk af­ter this but­ton is pressed;

  • That the AI wants the state of this shut­down but­ton to re­flect the dy­namic de­ci­sions of the hu­man op­er­a­tors;

  • That the AI does not want to in­fluence the op­er­a­tors to de­cide to not press the switch, or to press it;

  • That the AI does not want any­thing be­sides an or­derly sus­pend-to-disk to hap­pen, or not hap­pen, af­ter this but­ton is pressed.


  • The AI does not think about or make plans in­volv­ing the shut­down but­ton, e.g. be­cause that do­main was not whitelisted for cog­ni­tion.

  • None of the AI’s other mod­els end up re­flect­ing the ex­is­tence of the shut­down but­ton or none of its other plans end up tak­ing into ac­count that part of the model.

The difficul­ties here cen­ter around “You can’t fetch the coffee if you’re dead”. This rea­son­ing is very gen­eral, so even if we try to make it not ap­ply at one point, it tends to pop up some­where else:

  • If you naively try to add in a spe­cial-case clause to the util­ity func­tion for want­ing a shut­down but­ton to ex­ist, the AI wants the shut­down but­ton to not be pressed.

  • If you suc­cess­fully add a spe­cial case say­ing that the AI wants the but­ton to be pressed if the hu­mans want that but­ton to be pressed, the AI wants the hu­mans to not want to press the but­ton.

  • If you naively try to add in a spe­cial clause for the AI want­ing to shut down af­ter the but­ton is pressed, the AI wants to cre­ate a sub­agent to make sure the coffee gets pressed any­way.

  • If you try to make an AI that doesn’t think about the shut­down but­ton or model it at all, this seems po­ten­tially difficult be­cause in re­al­ity the best hy­poth­e­sis to ex­plain the world does con­tain a shut­down but­ton. A gen­eral search for good hy­pothe­ses may tend to cre­ate cog­ni­tive to­kens that rep­re­sent the shut­down but­ton, and it’s not clear (yet) how this could in gen­eral be pre­vented by try­ing to di­vide the world into do­mains.

More gen­er­ally: by de­fault a lot of high-level searches we do want to run, have sub­searches we’d pre­fer not to run. If we run an agent that searches in gen­eral for ways to fetch the coffee, that search would, by de­fault and if smart enough, also search for ways to pre­vent it­self from be­ing shut down.

How ex­actly to im­ple­ment the non-ad­ver­sar­ial prin­ci­ple is thus a ma­jor open prob­lem. We may need to be more clever about shap­ing which com­pu­ta­tions give rise to which other com­pu­ta­tions than the de­fault “Search for any ac­tion in any do­main which achieves X.”

See also


  • Omnipotence test for AI safety

    Would your AI pro­duce dis­as­trous out­comes if it sud­denly gained om­nipo­tence and om­ni­science? If so, why did you pro­gram some­thing that wants to hurt you and is held back only by lack­ing the power?

  • Niceness is the first line of defense

    The first line of defense in deal­ing with any par­tially su­per­hu­man AI sys­tem ad­vanced enough to pos­si­bly be dan­ger­ous is that it does not want to hurt you or defeat your safety mea­sures.

  • Directing, vs. limiting, vs. opposing

    Get­ting the AI to com­pute the right ac­tion in a do­main; ver­sus get­ting the AI to not com­pute at all in an un­safe do­main; ver­sus try­ing to pre­vent the AI from act­ing suc­cess­fully. (Pre­fer 1 & 2.)

  • The AI must tolerate your safety measures

    A corol­lary of the non­ad­ver­sar­ial prin­ci­ple is that “The AI must tol­er­ate your safety mea­sures.”

  • Generalized principle of cognitive alignment

    When we’re ask­ing how we want the AI to think about an al­ign­ment prob­lem, one source of in­spira­tion is try­ing to have the AI mir­ror our own thoughts about that prob­lem.


  • Principles in AI alignment

    A ‘prin­ci­ple’ of AI al­ign­ment is a very gen­eral de­sign goal like ‘un­der­stand what the heck is go­ing on in­side the AI’ that has in­formed a wide set of spe­cific de­sign pro­pos­als.