Safe plan identification and verification

Safe plan iden­ti­fi­ca­tion is the prob­lem of how to give a Task AGI train­ing cases, an­swered queries, ab­stract in­struc­tions, etcetera such that (a) the AGI can thereby iden­tify out­comes in which the task was fulfilled, (b) the AGI can gen­er­ate an okay plan for get­ting to some such out­comes with­out bad side effects, and (c) the user can ver­ify that the re­sult­ing plan is ac­tu­ally okay via some se­ries of fur­ther ques­tions or user query­ing. This is the su­per­prob­lem that in­cludes task iden­ti­fi­ca­tion, as much value iden­ti­fi­ca­tion as is needed to have some idea of the gen­eral class of post-task wor­lds that the user thinks are okay, any fur­ther tweaks like low-im­pact plan­ning or flag­ging in­duc­tive am­bi­gui­ties, etcetera. This su­per­prob­lem is dis­t­in­guished from the en­tire prob­lem of build­ing a Task AGI be­cause there’s fur­ther is­sues like cor­rigi­bil­ity, be­hav­iorism, build­ing the AGI in the first place, etcetera. The safe plan iden­ti­fi­ca­tion su­per­prob­lem is about com­mu­ni­cat­ing the task plus user prefer­ences about side effects and im­ple­men­ta­tion, such that this in­for­ma­tion al­lows the AGI to iden­tify a safe plan and for the user to know that a safe plan has been iden­ti­fied.


  • Do-What-I-Mean hierarchy

    Suc­ces­sive lev­els of “Do What I Mean” or AGIs that un­der­stand their users in­creas­ingly well


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.