Zermelo-Fraenkel provability oracle

The Zer­melo-Fraenkel prov­abil­ity or­a­cle is the end­point of the fol­low­ing con­ver­sa­tion:

Alice: Let’s build an AI that only an­swers ques­tions. That sounds pretty safe to me. In fact, I don’t un­der­stand why all you lot keep talk­ing about build­ing AIs that don’t just an­swer ques­tions, when the ques­tion-an­swer­ing sys­tems sound so much safer to me.

Bob: Sup­pose you screw up the goal sys­tem and the AI starts want­ing to do some­thing be­sides an­swer ques­tions? This sys­tem is only as safe as your abil­ity to make an AI with a sta­ble goal sys­tem that ac­tu­ally im­ple­ments your in­tended goal of “not do­ing any­thing ex­cept an­swer­ing ques­tions”, which is difficult to for­mal­ize.

Alice: But surely it’s safer? Surely it’s eas­ier to de­velop a goal sys­tem that just safely an­swers ques­tions?

Bob: How much safer? How much eas­ier? If we hy­poth­e­size re­strict­ing the abil­ities of the AI in this way, we’re re­duc­ing the po­ten­tial pay­off if we ac­tu­ally suc­ceed. An AI that can act in the world and build its own nan­otech­nol­ogy can po­ten­tially pro­tect us from other AIs be­com­ing su­per­in­tel­li­gent, at least as a tem­po­rary fix to the im­me­di­ate dilemma and the im­me­di­ate dan­ger of death, if not as an eter­nal solu­tion. An AI that can act in the real world us­ing nan­otech­nol­ogy would be, in our par­lance, a pivotal achieve­ment. Get­ting from a ques­tion-an­swer­ing AI to a safer world seems a lot harder, un­less the ques­tion you ask the AI is “How to do I build an in­tel­li­gent, di­rected nan­otech­nolog­i­cal sys­tem?” in which case you might as not well bother pre­tend­ing it’s an Or­a­cle. So we’re los­ing some­thing by sup­pos­ing that the sys­tem only an­swers ques­tions—we’re los­ing some of the pivotal-ness of the hy­po­thet­i­cal suc­cess. If you’ll par­don the ex­treme ex­am­ple, if the Or­a­cle sys­tem is only 0.01% safer or 99.9% as difficult to build as a Ge­nie, then it definitely wouldn’t be worth the de­creased prob­a­bil­ity of be­ing able to win the larger dilemma us­ing a suc­cess­ful Or­a­cle in­stead of a suc­cess­ful Ge­nie. So the ques­tion is, how use­ful is the Or­a­cle com­pared to the Ge­nie, and given that you’re go­ing to use it that way, how much safer is it, or eas­ier to build, com­pared to the Ge­nie? An “Or­a­cle” that im­me­di­ately writes the code for a Ge­nie is equally use­ful, but it’s definitely not any eas­ier to build safely.

Alice: But there are sub­stan­tial gains in safety, or in ease of build­ing the sys­tem to ad­e­quate safety, from adopt­ing a ques­tion-an­swer­ing de­sign. Like, imag­ine that the com­puter is in a sealed cham­ber on the Moon, with­out any at­tached robots or robot arms, and just a sin­gle ra­dio down­link to send the an­swers to the ques­tions we ask. Then we’ll be safe even if the goal de­sign fails and it turns against us, be­cause, you know, it’ll be stuck on the Moon with­out any hands. To me this ob­vi­ously seems a lot safer than, say, a Ge­nie run­ning around in the world with its own nan­otech­nol­ogy. Less use­ful as­sum­ing it works, yes, but also a lot less dan­ger­ous as­sum­ing it fails.

Bob: Are you sure you’re prop­erly re­spect­ing the in­tel­li­gence of some­thing vastly smarter than you? We’re not smart enough to con­sider all the strate­gies it might imag­ine. What if there’s some cre­ative thing you can do even if you’re in a sealed cham­ber on the Moon? There are ra­dio waves leav­ing the cham­ber that de­pend on the flows of elec­trons in tran­sis­tors -

Alice: Okay, a sealed cham­ber and we’re run­ning it in three lay­ers of sand­boxed simu­la­tion and we haven’t told the AI about the hard­ware it’s run­ning on or about any of the em­u­la­tion lay­ers. I think it’d be pretty hard for the AI to ma­nipu­late the ra­dio waves in that con­di­tion. It just wouldn’t be able to figure out a re­la­tion be­tween any­thing it could con­trol, and the ra­dio waves go­ing out, and even if it could mod­u­late the ra­dio waves some tiny amount, do you re­ally think that could some­how end the world?

finish the conversation

Work in progress

bob: it can ma­nipu­late the human

al­ice: we won’t tell it about humans

bob: it may be able to de­duce an awful lot by look­ing at its own code.

al­ice: okay it gives a bare an­swer, like just a few bits.

bob: timing problem

al­ice: time it, ob­vi­ously.

bob: what the heck do you think you can do with just a few bits that’s in­ter­est­ing?

al­ice: we’ll ask it which of two poli­cies is the bet­ter policy for get­ting the world out of the AI hole.

bob: but how can you trust it?

al­ice: it can’t co­or­di­nate with fu­ture AIs, but hu­mans are trust­wor­thy be­cause we’re hon­or­able.

bob: cog­ni­tive un­con­tain­abil­ity zaps you! it can to­tally co­or­di­nate. we have proof thereof. so any ques­tion it an­swers could just be steer­ing the fu­ture to­ward a hos­tile AI. you can’t trust it if you screwed up the util­ity func­tion.

al­ice: okay fine, it out­puts proofs of pre­de­ter­mined the­o­rems in ZF that go into a ver­ifier which out­puts 0 or 1 and is then de­stroyed by ther­mite.

bob: I agree that this com­plete sys­tem might ac­tu­ally work.

al­ice: yay!

bob: but un­for­tu­nately it’s now use­less.

al­ice: use­less? we could ask it if the Rie­mann Hy­poth­e­sis is prov­able in ZF -

bob: and how is that go­ing to save the world, ex­actly?

al­ice: it could gen­er­ally ad­vance math progress.

bob: and now we’re back at the “con­tribut­ing one more thingy to a big pool” that peo­ple do when they’re try­ing to pro­mote their non-pivotal ex­cit­ing fun thing to pivotal­ness.

al­ice: that’s not fair. it could ad­vance math a lot. be­yond what hu­mans could do for two hun­dred years.

bob: but the kind of math we do at MIRI frankly doesn’t de­pend on whether RH is true. it’s very rare that we know ex­actly what the­o­rem we might want to prove, and we don’t re­ally know whether it’s a con­se­quence of ZF, and if we do know, we can move on. that’s ba­si­cally never hap­pened in our whole his­tory, ex­cept for maybe 7 days at a time. if we’d had your magic box the whole time, there are cer­tain fields of math that would be decades ahead, but our field would be maybe a week or a month ahead. most of our work is in think­ing up the the­o­rems, not prov­ing them.

al­ice: so what if it could sug­gest the­o­rems to prove? like show you the form of a tiling agent?

bob: if it’s show­ing us the out­line of a frame­work for AI and prov­ing some­thing about it, I’m now very wor­ried that it’s go­ing to man­age to smug­gle in some deadly trick of de­ci­sion the­ory that lets it take over some­how. we’ve found all sorts of gotchas that could po­ten­tially do that, like Pas­cal’s Mug­ging, prob­a­ble en­vi­ron­ment hack­ing, black­mail… by un­re­strict­ing it to be able to sug­gest which the­o­rems it can prove, you’ve given it a rich out­put chan­nel and a wide range of op­tions for in­fluenc­ing fu­ture AI de­sign.


  • sup­ports method­ol­ogy of fore­see­able difficul­ties and its rule of ex­plic­i­t­i­za­tion and cri­tique: when some­thing looks easy or safe to some­one, it of­ten isn’t.

  • in­tu­itive no­tions of what sort of re­stricted or limited AI should be “much eas­ier to build” are of­ten wrong, and this screws up the first in­tu­ition of how much safety we’re gain­ing in ex­change for pu­ta­tively re­lin­quish­ing some capabilities

  • by the time you cor­rect this bias, the amount of ca­pa­bil­ity you ac­tu­ally have to re­lin­quish in or or­der to get a bunch of safety and de­vel­op­ment ease, of­ten makes the pu­ta­tive AI not have any ob­vi­ous use

  • it is wise to be ex­plicit about how you in­tend to use the re­stricted AI to save the world, be­cause try­ing to think of con­crete de­tails here will of­ten re­veal that any­thing you think of ei­ther (1) isn’t ac­tu­ally very safe or (2) isn’t ac­tu­ally very use­ful.


  • Oracle

    Sys­tem de­signed to safely an­swer ques­tions.

  • Boxed AI

    Idea: what if we limit how AI can in­ter­act with the world. That’ll make it safe, right??