Directing, vs. limiting, vs. opposing

‘Direct­ing’ ver­sus ‘limit­ing’ ver­sus ‘op­pos­ing’ is a pro­posed con­cep­tual dis­tinc­tion be­tween 3 ways of get­ting good out­comes and avoid­ing bad out­comes, when run­ning a suffi­ciently ad­vanced Ar­tifi­cial In­tel­li­gence:

  • Direc­tion means the AGI wants to do the right thing in a do­main;

  • Limi­ta­tion is the AGI not think­ing or act­ing in places where it’s not al­igned;

  • Op­po­si­tion is when we try to pre­vent the AGI from suc­cess­fully do­ing the wrong thing, as­sum­ing that it would act wrongly given the power to do so.

For ex­am­ple:

  • A suc­cess­fully di­rected AI, given full In­ter­net ac­cess, will do benefi­cial things rather than detri­men­tal things us­ing In­ter­net ac­cess, be­cause it wants to do good and un­der­stands suffi­ciently well which ac­tions have good or bad out­comes;

  • A limited AI, sud­denly given an In­ter­net feed, will not do any­thing with that In­ter­net ac­cess, be­cause its pro­gram­mers haven’t whitelisted this new do­main for be­ing okay to think about;

  • Op­po­si­tion is air­gap­ping the AI from the In­ter­net and then putting the AI’s pro­ces­sors in­side a Fara­day cage, in the hope that even if the AI wants to get to the In­ter­net, the AI won’t be able to pro­duce GSM cel­l­phone sig­nals by mod­u­lat­ing its mem­ory ac­cesses.

A fourth cat­e­gory not re­ducible to the other three might be sta­bi­liz­ing, e.g. nu­mer­i­cal sta­bil­ity of float­ing-point al­gorithms, not hav­ing mem­ory leaks in the code, etcetera. Th­ese are is­sues that a suffi­ciently ad­vanced AI would fix in it­self au­to­mat­i­cally, but an in­suffi­ciently ad­vanced AI might not, which causes prob­lems ei­ther if early er­rors in­tro­duce changes that are re­flec­tively sta­ble later, or if we are in­tend­ing to run the AI in in­suffi­ciently-ad­vanced mode.

Not run­ning ad­ver­sar­ial searches

Un­der the agents as searches view, we can see an agent as em­body­ing com­pu­ta­tions that search out ac­tions within a do­main. The non-ad­ver­sar­ial prin­ci­ple then says that we should not be run­ning a search which is look­ing for ways to hurt us.

We can achieve this desider­a­tum in one of two ways:

  • The AI is suffi­ciently nice /​ value-al­igned that it’s okay for the AI to think about this do­main /​ run searches here; what­ever the AI is search­ing for, we don’t mind if that search suc­ceeds.

  • The AI is not al­igned in this do­main and there­fore should not be run­ning searches here—we didn’t finish the al­ign­ment here, and there­fore, we shouldn’t be run­ning this com­pu­ta­tion.

If both of these mea­sures fail, then as a fal­lback sec­ond line of defense, we could try to make a bad search re­turn a null re­sult (the AI doesn’t find any way to do the harm­ful thing).

For ex­am­ple: The non-ad­ver­sar­ial prin­ci­ple im­plies that if you build a shut­down but­ton for the AI that sus­pends the AI to disk when pressed, you must also en­sure:

  • That the AI wants there to be a shut­down but­ton;

  • That the AI wants the state of this shut­down switch to re­flect the dy­namic de­ci­sions of the hu­man op­er­a­tors;

  • That the AI does not want to in­fluence the op­er­a­tors to press the switch or not press it;

  • That the AI does not want to cause any­thing be­sides an or­derly sus­pend-to-disk to hap­pen af­ter this switch is pressed.

Or:

  • That the AI doesn’t want to think about the shut­down but­ton at all (e.g. be­cause it’s not in­side a whitelisted cog­ni­tive do­main).

The first set of mea­sures would be ‘di­rec­tion’—if those mea­sures work, it’s okay for the AI to think here be­cause we think those thoughts are pointed in the right di­rec­tion. The sec­ond ap­proach would be limi­ta­tion—we don’t think the AI’s com­pu­ta­tions are al­igned in­side this do­main yet, so un­til then, we shouldn’t be run­ning com­pu­ta­tions there.

As a fi­nal, strictly fal­lback line of defense, you could put a big red lever in the com­put­ing cen­ter that shut off all the elec­tric­ity even if the AI was failing to sus­pend to disk. (But you’re not rely­ing on this work­ing; it’s just there as a maybe-it-might-work mea­sure that might buy you a lit­tle more prob­a­ble safety even if the pri­mary lines of defense failed.)

Re­la­tion to other non-ad­ver­sar­ial ideas

The di­rec­tion/​limi­ta­tion/​op­po­si­tion dis­tinc­tion can help state other ideas from the AI safety mind­set. For ex­am­ple:

The prin­ci­ple nice­ness is the first line of defense can be rephrased as fol­lows: When de­sign­ing an AGI, we should imag­ine that all ‘op­po­si­tional’ mea­sures are ab­sent or failed, and think only about ‘di­rec­tion’ and ‘limi­ta­tion’. Any op­po­si­tional mea­sures are then added on top of that, just in case.

Similarly, the Om­nipo­tence test for AI safety says that when think­ing through our pri­mary de­sign for al­ign­ment, we should think as if the AGI just will get In­ter­net ac­cess on some ran­dom Tues­day. This says that we should de­sign an AGI that is limited by not want­ing to act in newly opened do­mains with­out some pro­gram­mer ac­tion, rather than rely­ing on the AI to be un­able to reach the In­ter­net un­til we’ve finished al­ign­ing it.

Parents:

  • AI safety mindset

    Ask­ing how AI de­signs could go wrong, in­stead of imag­in­ing them go­ing right.

  • Non-adversarial principle

    At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.