Programmer deception

Pro­gram­mer de­cep­tion is when the AI’s de­ci­sion pro­cess leads it to op­ti­mize for an in­stru­men­tal goal of caus­ing the pro­gram­mers to have false be­liefs. For ex­am­ple, if the pro­gram­mers in­tended to cre­ate a hap­piness max­i­mizer but ac­tu­ally cre­ated a plea­sure max­i­mizer, then the plea­sure max­i­mizer will es­ti­mate that there would be more plea­sure later if the pro­gram­mers go on falsely be­liev­ing that they’ve cre­ated a hap­piness max­i­mizer (and hence don’t edit the AI’s cur­rent util­ity func­tion). Avert­ing such in­cen­tives to de­ceive pro­gram­mers is one of the ma­jor sub­prob­lems of cor­rigi­bil­ity.

The pos­si­bil­ity of pro­gram­mer de­cep­tion is a cen­tral difficulty of ad­vanced safety—it means that, un­less the rest of the AI is work­ing as in­tended and what­ever pro­gram­mer-de­cep­tion-defeaters were built are func­tion­ing as planned, we can’t rely on ob­ser­va­tions of nice cur­rent be­hav­ior to in­di­cate fu­ture be­hav­ior. That is, if some­thing went wrong with your at­tempts to build a nice AI, you could cur­rently be ob­serv­ing a non-nice AI that is smart and try­ing to fool you. Ar­guably, some method­olo­gies that have been pro­posed for build­ing ad­vanced AI are not ro­bust to this pos­si­bil­ity.

clean this up and expand

  • in­stru­men­tal pres­sure ex­ists ev­ery time the AI’s best strate­gic path doesn’t have a global op­ti­mum that co­in­cides with the pro­gram­mers be­liev­ing true things.

  • con­sider the high­est util­ity ob­tain­able if the pro­gram­mers be­lieve true be­liefs B, and call this out­come O and the true be­liefs B. if there’s a higher-util­ity out­come O’ which can be ob­tained when the pro­gram­mers be­lieve B’ with B’!=B, we have an in­stru­men­tal pres­sure to de­ceive the pro­gram­mers.

  • hap­pens when you com­bine the ad­vanced agent prop­er­ties of con­se­quen­tial­ism with pro­gram­mer modeling

  • this is an in­stru­men­tal con­ver­gence prob­lem, which means it in­volves an un­de­sired in­stru­men­tal goal, which means that we’ll get Near­est Neigh­bor on at­tempts to define util­ity penalties for the pro­gram­mers be­liev­ing false things or oth­er­wise ex­clude this as a spe­cial case

  • if we try to define a util­ity bonus for pro­gram­mers be­liev­ing true things, then of course ce­teris paribus we tile the uni­verse with tiny ‘pro­gram­mers’ be­liev­ing lots and lots of even num­bers are even, and get­ting to this point tem­porar­ily in­volves de­ceiv­ing a few pro­gram­mers now

  • re­la­tion to the prob­lem of pro­gram­mer manipulation

  • cen­tral ex­am­ple of how di­ver­gences be­tween in­tended goals and AI goals can blow up into as­tro­nom­i­cal failure

  • cen­tral driver of Treach­er­ous Turn which in turn con­tributes to Con­text Change


  • Cognitive steganography

    Disal­igned AIs that are mod­el­ing hu­man psy­chol­ogy and try­ing to de­ceive their pro­gram­mers will want to hide their in­ter­nal thought pro­cesses from their pro­gram­mers.