Relevant limited AI

It is an open prob­lem to pro­pose a limited AI that would be rele­vant to the value achieve­ment dilemma—an agent cog­ni­tively con­strained along some di­men­sions that ren­der it much safer, but still able to perform some task use­ful enough to pre­vent catas­tro­phe.

Ba­sic difficulty

Con­sider an Or­a­cle AI that is so con­strained as to be al­lowed only to out­put proofs in HOL of in­put the­o­rems; these proofs are then ver­ified by a sim­ple and se­cure-seem­ing ver­ifier in a sand­box whose ex­act code is un­known to the Or­a­cle, and this ver­ifier out­puts 1 if the proof is true and 0 oth­er­wise, then dis­cards the proof-data. Sup­pose also that the Or­a­cle is in a shielded box, etcetera.

It’s pos­si­ble that this Prov­abil­ity Or­a­cle has been so con­strained that it is cog­ni­tively con­tain­able (it has no classes of op­tions we don’t know about). If the ver­ifier is un­hack­able, it gives us trust­wor­thy knowl­edge that a the­o­rem is prov­able. But this limited sys­tem is not ob­vi­ously use­ful in a way that en­ables hu­man­ity to ex­tri­cate it­self from its larger dilemma. No­body has yet stated a plan which could save the world if only we had a su­per­hu­man ca­pac­ity to de­tect which the­o­rems were prov­able in Zer­melo-Fraenkel set the­ory.

Say­ing “The solu­tion is for hu­man­ity to only build Prov­abil­ity Or­a­cles!” does not re­solve the value achieve­ment dilemma be­cause hu­man­ity does not have the co­or­di­na­tion abil­ity to ‘choose’ to de­velop only one kind of AI over the in­definite fu­ture, and the Prov­abil­ity Or­a­cle has no ob­vi­ous use that pre­vents non-Or­a­cle AIs from ever be­ing de­vel­oped. Thus our larger value achieve­ment dilemma would re­main un­solved. It’s not ob­vi­ous how the Prov­abil­ity Or­a­cle would even con­sti­tute sig­nifi­cant strate­gic progress.

Open problem

De­scribe a cog­ni­tive task or real-world task for a AI to carry out, that makes great progress upon the value achieve­ment dilemma if ex­e­cuted cor­rectly, and that can be done with a limited AI that:

  1. Has a real-world solu­tion state that is ex­cep­tion­ally easy to pin­point us­ing a util­ity func­tion, thereby avoid­ing some of edge in­stan­ti­a­tion, un­fore­seen max­i­mums, con­text change, pro­gram­mer max­i­miza­tion, and the other pit­falls of ad­vanced safety, if there is oth­er­wise a trust­wor­thy solu­tion for low-im­pact AI; or

  2. Seems ex­cep­tion­ally im­ple­mentable us­ing a known-al­gorithm non-self-im­prov­ing agent, thereby avert­ing prob­lems of sta­ble self-mod­ifi­ca­tion, if there is oth­er­wise a trust­wor­thy solu­tion for a known-al­gorithm non-self-im­prov­ing agent; or

  3. Con­strains the agent’s op­tion space so dras­ti­cally as to make the strat­egy space not be rich (and the agent hence con­tain­able), while still con­tain­ing a trust­wor­thy, oth­er­wise un­find­able solu­tion to some challenge that re­solves the larger dilemma.

Ad­di­tional difficulties

(Fill in this sec­tion later; all the things that go wrong when some­body ea­gerly says some­thing along the lines of “We just need AI that does X!”)


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.