Boxed AI

AI-box­ing is the the­ory that deals in ma­chine in­tel­li­gences that are allegedly safer due to allegedly hav­ing ex­tremely re­stricted ma­nipu­la­ble chan­nels of causal in­ter­ac­tion with the out­side uni­verse.

AI-box­ing the­ory in­cludes:

  • The straight­for­ward prob­lem of build­ing elab­o­rate sand­boxes (com­put­ers and simu­la­tion en­vi­ron­ments de­signed not to have any ma­nipu­la­ble chan­nels of causal in­ter­ac­tion with the out­side uni­verse).

  • Fore­see­able difficul­ties whereby the re­main­ing, limited chan­nels of in­ter­ac­tion may be ex­ploited to ma­nipu­late the out­side uni­verse, es­pe­cially the hu­man op­er­a­tors.

  • The at­tempt to de­sign prefer­ence frame­works that are not in­cen­tivized to go out­side the Box, not in­cen­tivized to ma­nipu­late the out­side uni­verse or hu­man op­er­a­tors, and in­cen­tivized to an­swer ques­tions ac­cu­rately or perform what­ever other ac­tivity is allegedly to be performed in­side the box.

The cen­tral difficulty of AI box­ing is to de­scribe a chan­nel which can­not be used to ma­nipu­late the hu­man op­er­a­tors, but which pro­vides in­for­ma­tion rele­vant enough to be pivotal or game-chang­ing rel­a­tive to larger events. For ex­am­ple, it seems not un­think­able that we could safely ex­tract, from a boxed AI setup, re­li­able in­for­ma­tion that pre­speci­fied the­o­rems had been proved within Zer­melo-Fraenkel set the­ory, but there is no known way to save the world if only we could some­times know that pre­speci­fied the­o­rems had been re­li­ably proven in Zer­melo-Fraenkel set the­ory.


  • Zermelo-Fraenkel provability oracle

    We might be able to build a sys­tem that can safely in­form us that a the­o­rem has a proof in set the­ory, but we can’t see how to use that ca­pa­bil­ity to save the world.


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.