Strong cognitive uncontainability


Sup­pose some­body from the 10th cen­tury were asked how some­body from the 20th cen­tury might cool their house. While they would be able to un­der­stand the prob­lem and offer some solu­tions, maybe even clever solu­tions (“Lo­cate your house some­place with cooler weather”, “di­vert wa­ter from the stream to flow through your liv­ing room”) the 20th cen­tury’s ac­tual solu­tion of ‘air con­di­tion­ing’ is not available to them as a strat­egy. Not just be­cause they don’t think fast enough or aren’t clever enough, but be­cause an air con­di­tioner takes ad­van­tage of phys­i­cal laws they don’t know about. Even if they some­how ran­domly imag­ined an air con­di­tioner’s ex­act blueprint, they wouldn’t ex­pect that de­sign to op­er­ate as an air con­di­tioner un­til they were told about the re­la­tion of pres­sure to tem­per­a­ture, how elec­tric­ity can power a com­pres­sor mo­tor, and so on.

By defi­ni­tion, a strongly un­con­tain­able agent can con­ceive strate­gies that go through causal do­mains you can’t cur­rently model, and it has op­tions ac­cess­ing those strate­gies; there­fore it may ex­e­cute high-value solu­tions such that, even be­ing told the ex­act strat­egy, you would not as­sign those solu­tions high ex­pected effi­cacy with­out be­ing told fur­ther back­ground facts.

At least in this sense, the 20th cen­tury is ‘strongly cog­ni­tively un­con­tain­able’ rel­a­tive to the 10th cen­tury: We can solve the prob­lem of how to cool homes us­ing a strat­egy that would not be rec­og­niz­able in ad­vance to a 10th-cen­tury ob­server.

Ar­guably, most real-world prob­lems, if we to­day ad­dressed them us­ing the full power of mod­ern sci­ence and tech­nol­ogy (i.e. we were will­ing to spend a lot of money on tech and maybe run a pre­dic­tion mar­ket on the rele­vant facts) would have best solu­tions that couldn’t be ver­ified in the 10th-cen­tury.

We can imag­ine a cog­ni­tively pow­er­ful agent be­ing strongly un­con­tain­able in some do­mains but not oth­ers. Since ev­ery cog­ni­tive agent is con­tain­able on for­mal games of tic-tac-toe (at least so far as we can imag­ine, and so long as there isn’t a real-world op­po­nent to ma­nipu­late), strong un­con­tain­abil­ity can­not be a uni­ver­sal prop­erty of an agent across all for­mal and in­for­mal do­mains.

Gen­eral arguments

Ar­gu­ments in fa­vor of strong un­con­tain­abil­ity tend to re­volve around ei­ther:

  • The rich­ness and par­tial un­known­ness of a par­tic­u­lar do­main. (E.g. hu­man psy­chol­ogy seems very com­pli­cated; has a lot of un­known path­ways; and pre­vi­ously dis­cov­ered ex­ploits of­ten seemed very sur­pris­ing; there­fore we should ex­pect strong un­con­tain­abil­ity on the do­main of hu­man psy­chol­ogy.)

  • Out­side-view in­duc­tion on pre­vi­ous abil­ity ad­van­tages de­rived from cog­ni­tive ad­van­tages. (The 10th cen­tury couldn’t con­tain the 20th cen­tury even though all par­ties in­volved were biolog­i­cal Homo sapi­ens; what makes us think we’re the first gen­er­a­tion to have the real true laws of the uni­verse in our minds?)

Ar­gu­ments against strong un­con­tain­abil­ity tend to re­volve around:

  • The ap­par­ent known­ness of a par­tic­u­lar do­main. (E.g., since we have ob­served the rules of chem­istry with great pre­ci­sion and know their ori­gin in the un­der­ly­ing molec­u­lar dy­nam­ics, we can be­lieve that even an ar­bi­trar­ily smart agent should not be able to turn lead into gold us­ing non-ra­dioac­tive chem­i­cal reagents.)

  • Back­ward rea­son­ing from the Fermi Para­dox, which gives us weak ev­i­dence bound­ing the ca­pa­bil­ities of the most pow­er­ful agents pos­si­ble in our uni­verse. (E.g., even though there might be sur­prises re­main­ing in the ques­tion of how to stan­dardly model physics, any sur­prise yield­ing Faster-Than-Light travel to a pre­vi­ously un-trav­eled point makes the Fermi Para­dox harder to ex­plain.)

Key propositions

  • Can Or­a­cles be con­tained in­side a com­pu­ta­tional sand­box? That is, is there some re­stric­tion of in­put-out­put chan­nels and of other en­vi­ron­men­tal in­ter­ac­tions such that:

  • The rich­ness of the ‘hu­man psy­chol­ogy’ do­main is averted;

  • Re­main­ing causal in­ter­ac­tions with the out­side uni­verse have an op­tion set too small and flat to con­tain in­ter­est­ing op­tions.

  • How solid is our cur­rent knowl­edge of the phys­i­cal uni­verse?

  • To what ex­tent should we ex­pect an ad­vanced agency (e.g. ma­chine su­per­in­tel­li­gences a mil­lion years later) to be bound­able us­ing our pre­sent phys­i­cal un­der­stand­ing?

  • Can we rea­son­ably rule out un­known phys­i­cal do­mains be­ing ac­cessed by a com­pu­ta­tion­ally sand­boxed AI?

  • What is the high­est rea­son­able prob­a­bil­ity that could, un­der op­ti­mal con­di­tions, be as­signed to hav­ing gen­uinely con­tained an AI in­side a com­pu­ta­tional sand­box, if it is not al­lowed any rich out­put chan­nels? Is it more like 20% or 80%?

  • Are there use­ful do­mains con­cep­tu­ally closed to hu­mans’ in­ter­nal un­der­stand­ing?

  • Will a ma­chine su­per­in­tel­li­gence have ‘power we know not’ in the sense that it can’t be ex­plained to us even af­ter we’ve seen it (ex­cept in the triv­ial sense that we could simu­late an­other mind un­der­stand­ing it us­ing ex­ter­nal stor­age and Tur­ing-like rules), as with a chim­panzee en­coun­ter­ing an air con­di­tioner?


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.