Behaviorist genie

A be­hav­iorist ge­nie is an AI that has been averted from mod­el­ing minds in more de­tail than some whitelisted class of mod­els.

This is pos­si­bly a good idea be­cause many pos­si­ble difficul­ties seem to be as­so­ci­ated with the AI hav­ing a suffi­ciently ad­vanced model of hu­man minds or AI minds, in­clud­ing:

…and yet an AI that is ex­tremely good at un­der­stand­ing ma­te­rial ob­jects and tech­nol­ogy (just not other minds) would still be ca­pa­ble of some im­por­tant classes of pivotal achieve­ment.

A be­hav­iorist ge­nie would still re­quire most of ge­nie the­ory and cor­rigi­bil­ity to be solved. But it’s plau­si­ble that the re­stric­tion away from mod­el­ing hu­mans, pro­gram­mers, and some types of re­flec­tivity, would col­lec­tively make it sig­nifi­cantly eas­ier to make a safe form of this ge­nie.

Thus, a be­hav­iorist ge­nie is one of fairly few open can­di­dates for “AI that is re­stricted in a way that ac­tu­ally makes it safer to build, with­out it be­ing so re­stricted as to be in­ca­pable of game-chang­ing achieve­ments”.

Nonethe­less, limit­ing the de­gree to which the AI can un­der­stand cog­ni­tive sci­ence, other minds, its own pro­gram­mers, and it­self is a very se­vere re­stric­tion that would pre­vent a num­ber of ob­vi­ous ways to make progress on the AGI sub­prob­lem and the value iden­ti­fi­ca­tion prob­lem even for com­mands given to Task AGIs (Ge­nies). Fur­ther­more, there could per­haps be eas­ier types of ge­nies to build, or there might be grave difficul­ties in re­strict­ing the model class to some space that is use­ful with­out be­ing dan­ger­ous.

Re­quire­ments for implementation

Broadly speak­ing, two pos­si­ble clusters of be­hav­iorist-ge­nie de­sign are:

  • A cleanly de­signed, po­ten­tially self-mod­ify­ing ge­nie that can in­ter­nally de­tect mod­el­ing prob­lems that threaten to be­come mind-mod­el­ing prob­lems, and route them into a spe­cial class of al­low­able mind-mod­els.

  • A known-al­gorithm non-self-im­prov­ing AI, whose com­plete set of ca­pa­bil­ities have been care­fully crafted and limited, which was shaped to not have much ca­pa­bil­ity when it comes to mod­el­ing hu­mans (or dis­tant su­per­in­tel­li­gences).

Break­ing the first case down into more de­tail, the po­ten­tial desider­ata for a be­hav­ioris­tic de­sign are:

  • (a) avoid­ing mind­crime when mod­el­ing humans

  • (b) not mod­el­ing dis­tant su­per­in­tel­li­gences or alien civilizations

  • (c) avoid­ing pro­gram­mer manipulation

  • (d) avoid­ing mind­crime in in­ter­nal processes

  • (e) mak­ing self-im­prove­ment some­what less ac­cessible.

Th­ese are differ­ent goals, but with some over­lap be­tween them. Some of the things we might need:

  • A work­ing Non­per­son pred­i­cate that was gen­eral enough to screen the en­tire hy­poth­e­sis space AND that was re­silient against loop­holes AND passed enough okay com­pu­ta­tions to screen the en­tire hy­poth­e­sis space

  • A work­ing Non­per­son pred­i­cate that was gen­eral enough to screen the en­tire space of po­ten­tial self-mod­ifi­ca­tions and sub­pro­grams AND was re­silient against loop­holes AND passed enough okay com­pu­ta­tions to com­pose the en­tire AI

  • An al­lowed class of hu­man mod­els, that was clearly safe in the sense of not be­ing sapi­ent, AND a re­li­able way to tell ev­ery time the AI was try­ing to model a hu­man (in­clud­ing mod­el­ing some­thing else that was par­tially af­fected by hu­mans, etc) (pos­si­bly with the pro­gram­mers as a spe­cial case that al­lowed a more so­phis­ti­cated model of some pro­gram­mer in­ten­tions, but still not one good enough to psy­cholog­i­cally ma­nipu­late the pro­gram­mers)

  • A way to tell when­ever the AI was try­ing to model a dis­tant civ­i­liza­tion, which shut down the mod­el­ing at­tempt or avoided the in­cen­tive to model (this might not re­quire heal­ing a bunch of en­tan­gle­ments, since there are no visi­ble aliens and there­fore their ex­clu­sion shouldn’t mess up other parts of the AI’s model)

  • A re­flec­tively sta­ble way to sup­port any of the above, which are tech­ni­cally epistemic exclusions

In the KANSI case, we’d pre­sum­ably be ‘nat­u­rally’ work­ing with limited model classes (on the as­sump­tion that ev­ery­thing the AI is us­ing is be­ing mon­i­tored, has a known al­gorithm, and has a known model class) and the goal would just be to pre­vent the KANSI agent from spilling over and cre­at­ing other hu­man mod­els some­where else, which might fit well into a gen­eral agenda against self-mod­ifi­ca­tion and sub­agent cre­ation. Similarly, if ev­ery new sub­ject is be­ing iden­ti­fied and whitelisted by hu­man mon­i­tors, then just not whitelist­ing the topic of mod­el­ing dis­tant su­per­in­tel­li­gences or de­vis­ing strate­gies for pro­gram­mer ma­nipu­la­tion, might get most of the job done to an ac­cept­able level if the un­der­ly­ing whitelist is never be­ing evaded (even emer­gently). This would re­quire a lot of suc­cess­fully main­tained vigilance and hu­man mon­i­tor­ing, though, es­pe­cially if the KANSI agent is try­ing to al­lo­cate a new hu­man-mod­el­ing do­main once per sec­ond and ev­ery in­stance has to be man­u­ally checked.


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.