Methodology of foreseeable difficulties

Much of the cur­rent liter­a­ture about value al­ign­ment cen­ters on pur­ported rea­sons to ex­pect that cer­tain prob­lems will re­quire solu­tion, or be difficult, or be more difficult than some peo­ple seem to ex­pect. The sub­ject of this page’s ap­proval rat­ing is this prac­tice, con­sid­ered as a policy or method­ol­ogy.

The ba­sic mo­ti­va­tion be­hind try­ing to fore­see difficul­ties is the large num­ber of pre­dicted Con­text Change prob­lems where an AI seems to be­have nicely up un­til it reaches some thresh­old level of cog­ni­tive abil­ity and then it be­haves less nicely. In some cases the prob­lems are gen­er­ated with­out the AI hav­ing formed that in­ten­tion in ad­vance, mean­ing that even trans­parency of the AI’s thought pro­cesses dur­ing its ear­lier state can’t save us. This means we have to see prob­lems of this type in ad­vance.

(The fact that Con­text Change prob­lems of this type can be hard to see in ad­vance, or that we might con­ceiv­ably fail to see one, doesn’t mean we can skip this duty of anal­y­sis. Not try­ing to fore­see them means rely­ing on ob­ser­va­tion, and it seems pre­dictable that try­ing to eye­ball the AI and re­ject­ing the­ory definitely doesn’t catch im­por­tant classes of prob­lem.)


…most of value al­ign­ment the­ory, so try to pick 3 cases that illus­trate the point in differ­ent ways. Pick from Con­text Change?


For: it’s some­times pos­si­ble to strongly fore­see a difficulty com­ing in a case where you’ve ob­served naive re­spon­dents to seem to think that no difficulty ex­ists, and in cases where the de­vel­op­ment tra­jec­tory of the agent seems to im­ply a po­ten­tial Treach­er­ous Turn. If there’s even one real Treach­er­ous Turn out of all the cases that have been ar­gued, then the point car­ries that past a cer­tain point, you have to see the bul­let com­ing be­fore it ac­tu­ally hits you. The the­o­ret­i­cal anal­y­sis sug­gests re­ally strongly that blindly forg­ing ahead ‘ex­per­i­men­tally’ will be fatal. Some­one with such a strong com­mit­ment to ex­per­i­men­tal­ism that they want to ig­nore this the­o­ret­i­cal anal­y­sis… it’s not clear what we can say to them, ex­cept maybe to ap­peal to the nor­ma­tive prin­ci­ple of not pre­dictably de­stroy­ing the world in cases where it seems like we could have done bet­ter.

Against: no real ar­gu­ments against in the ac­tual liter­a­ture, but it would be sur­pris­ing if some­body didn’t claim that the fore­see­able difficul­ties pro­gram was too pes­simistic, or in­evitably un­grounded from re­al­ity and pro­duc­tive only of bad ideas even when re­futed, etcetera.

Pri­mary re­ply: look, dammit, peo­ple ac­tu­ally are way too op­ti­mistic about FAI, we have them on the record, find 3 pres­ti­gious ex­am­ples and it’s hard to see how hu­man­ity could avoid walk­ing di­rectly into the whirling ra­zor blades with­out bet­ter fore­sight of difficulty. One po­ten­tial strat­egy is enough aca­demic re­spect and con­sen­sus on enough re­ally ob­vi­ous fore­see­able difficul­ties that the peo­ple claiming it will all be easy are ac­tu­ally asked to ex­plain why the fore­see­able difficulty con­sen­sus is wrong, and if they can’t ex­plain that well, they lose re­spect.

Will in­ter­act with the ar­gu­ments on em­piri­cism vs. the­o­rism is a false di­chotomy.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.