Omnipotence test for AI safety

Sup­pose your AI sud­denly be­came om­ni­scient and om­nipo­tent—sud­denly knew all facts and could di­rectly or­dain any out­come as a policy op­tion. Would the ex­e­cut­ing AI code lead to bad out­comes in that case? If so, why did you write a pro­gram that in some sense ‘wanted’ to hurt you and was only held in check by lack of knowl­edge and ca­pa­bil­ity? Isn’t that a bad way for you to con­figure com­put­ing power? Why not write differ­ent code in­stead?

The Omni Test is that an ad­vanced AI should be ex­pected to re­main al­igned, or not lead to catas­trophic out­comes, or fail safely, even if it sud­denly knows all facts and can di­rectly or­dain any pos­si­ble out­come as an im­me­di­ate choice. The policy pro­posal is that, among agents meant to act in the rich real world, any pre­dicted be­hav­ior where the agent might act de­struc­tively if given un­limited power (rather than e.g. paus­ing for a safe user query) should be treated as a bug.

Safety mindset

The Omni Test high­lights any rea­son­ing step on which we’ve pre­sumed, in a non-failsafe way, that the agent must not ob­tain definite knowl­edge of some fact or that it must not have ac­cess to some strate­gic op­tion. There are epistemic ob­sta­cles to our be­com­ing ex­tremely con­fi­dent of our abil­ity to lower-bound the re­ac­tion times or up­per-bound the power of an ad­vanced agent.

The deeper idea be­hind the Omni Test is that any pre­dictable failure in an Omni sce­nario, or lack of as­sured re­li­a­bil­ity, ex­poses some more gen­eral flaw. Sup­pose NASA found that an al­ign­ment of four planets would cause their code to crash and a rocket’s en­g­ines to ex­plode. They wouldn’t say, “Oh, we’re not ex­pect­ing any al­ign­ment like that for the next hun­dred years, so we’re still safe.” They’d say, “Wow, that sure was a ma­jor bug in the pro­gram.” Cor­rectly de­signed pro­grams just shouldn’t ex­plode the rocket, pe­riod. If any spe­cific sce­nario ex­poses a be­hav­ior like that, it shows that some gen­eral case is not be­ing han­dled cor­rectly.

The omni-safe mind­set says that, rather than try­ing to guess what facts an ad­vanced agent can’t figure out or what strate­gic op­tions it can’t have, we just shouldn’t make these guesses of ours load-bear­ing premises of an agent’s safety. Why de­sign an agent that we ex­pect will hurt us if it knows too much or can do too much?

For ex­am­ple, rather than de­sign an AI that is meant to be mon­i­tored for un­ex­pected power gains by pro­gram­mers who can then press a pause but­ton—which im­plic­itly as­sumes that no ca­pa­bil­ity gain can hap­pen in fast enough that a pro­gram­mer wouldn’t have time to re­act—an omni-safe pro­posal would de­sign the AI to de­tect un­vet­ted ca­pa­bil­ity gains and pause un­til the vet­ting had oc­curred. Even if it seemed im­prob­a­ble that some amount of cog­ni­tive power could be gained faster than the pro­gram­mers could re­act, es­pe­cially when no such pre­vi­ous sharp power gain had oc­curred even in the course of a day, etcetera, the omni-safe mind­set says to just not build an agent that is un­safe when such back­ground vari­ables have ‘un­rea­son­able’ set­tings. The cor­rect gen­eral be­hav­ior is to, e.g., always pause when new ca­pa­bil­ity has been ac­quired and a pro­gram­mer has not yet in­di­cated ap­proval of its use. It might not be pos­si­ble for an AGI de­sign to sud­denly use un­limited power op­ti­mally, or even use it in any safe way at all, but that’s still no ex­cuse for build­ing an omni-un­safe sys­tem; be­cause it ought to be pos­si­ble to de­tect that case, say “Some­thing weird just hap­pened!”, and sus­pend to disk.

Similarly, con­sider the paradigm of con­ser­va­tive plan­ning. Rather than think­ing in terms of black­list­ing fea­tures of bad plans, we think in terms of whitelist­ing al­lowed plans us­ing con­ser­va­tive gen­er­al­iza­tions. So long as we’re nar­rowly whitelist­ing rather than black­list­ing, lots of new op­tion space sud­denly open­ing up shouldn’t re­sult in any of those strange new op­tions be­ing taken un­til the users can whitelist more things.

Similarly, the omni-safe mind­set says that any time the AI is run­ning a cog­ni­tive search for a way to do some­thing harm­ful or by­pass one of our se­cu­rity mea­sures, we’ve already seen a bug in the de­sign, even if we strongly ex­pect that search to come up empty. Why con­figure com­put­ing power in a way that will search for a way to hurt you? This is clearly the wrong thing to do with com­put­ing power; we, the pro­gram­mers, should have our code do some­thing else in­stead. This is part of the agenda mo­ti­vat­ing the study of Cor­rigi­bil­ity.

Try­ing to guess an agent’s ac­tual ca­pa­bil­ities is a rea­son­able pro­ject, es­pe­cially if there are more ‘ad­ver­sar­ial’ fal­lback pre­cau­tions that re­volve around our es­ti­mates of the agent’s ac­tual pow­ers. E.g., hav­ing de­signed an agent that doesn’t want to gain un­vet­ted com­mu­ni­ca­tion chan­nels with the out­side world, we might then put the com­put­ing hard­ware into a Fara­day cage and make sure there are no cel­l­phone tow­ers around—an ‘ad­ver­sar­ial’ pre­cau­tion meant to block com­mu­ni­ca­tion even if the first line of defense fails and the AI starts want­ing to com­mu­ni­cate in an un­vet­ted way. But ‘ad­ver­sar­ial’ pre­cau­tions are sec­ondary lines of defense; a rea­son­able ex­pec­ta­tion of omni-safety is the first line of defense. First we as­sume that all ad­ver­sar­ial fal­lback mea­sures will fail, and de­sign the agent to re­main non­harm­ful or fail safely no mat­ter what new ca­pa­bil­ity or knowl­edge is gained. Then we as­sume the first line of defense has failed, and try, if it’s at all pos­si­ble or re­al­is­tic, to put up fal­lback mea­sures that will pre­vent to­tal catas­tro­phe so long as the agent has re­al­is­tic amounts of power and can’t vi­o­late what we think are ‘the laws of physics’ and so on.

Parents:

  • Non-adversarial principle

    At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.