Querying the AGI user

If we’re sup­pos­ing that an ad­vanced agent is check­ing some­thing Po­ten­tially Bad with its user to find out if the thing is Con­sid­ered Bad by that user, we need to worry about the fol­low­ing generic is­sues:

  • Can the AI tell which things are Po­ten­tially Bad in a way that in­cludes all things that are Ac­tu­ally Bad?

  • Can the user re­li­ably tell which Po­ten­tially Bad things are Ac­tu­ally Bad?

  • Does the AI, emer­gently or de­liber­ately, seek out Po­ten­tially Bad things that the user will not la­bel as Con­sid­ered Bad, thereby po­ten­tially op­ti­miz­ing for Ac­tu­ally Bad things that the user mis­la­bels as Not Bad? (E.g., if the agent learns to avoid new tries similar to those already la­beled bad, we’re ex­clud­ing the Con­sid­ered Bad space, but what’s left may still con­tain Ac­tu­ally Bad things via near­est un­blocked strat­egy or a similar phe­nomenon.)

  • Is the crite­rion for Po­ten­tially Bad so broad, and Ac­tu­ally Bad things hard enough to re­li­ably pri­ori­tize within that space, that 10% of the time an Ac­tu­ally Bad thing will not be in the top 1,000 Po­ten­tially Bad things the user can af­ford the time to check?

  • Can the AI suc­cess­fully com­mu­ni­cate to the user the de­tails of what set off the flag for Po­ten­tial Bad­ness, or even com­mu­ni­cate to the user ex­actly what was flagged as Po­ten­tially Bad, if this is an im­por­tant part of the user mak­ing the de­ci­sion?

  • Do the AI’s com­mu­ni­ca­tion goals risk op­ti­miz­ing the user?

  • Are the de­tails of Po­ten­tial Bad­ness or even the sub­ject of Po­ten­tial Bad­ness so in­scrutable as to be im­pen­e­tra­ble? (E.g., AlphaGo try­ing to ex­plain to a hu­man why a Go move is po­ten­tially bad, or for that mat­ter, a Go pro­fes­sional try­ing to ex­plain to an am­a­teur why a Go move is po­ten­tially bad—we might just be left with blind trust, at which point we might as well just tell the AI not to do Po­ten­tially Bad things rather than ask­ing it to pointlessly check with the user.)

  • Does the AI, emer­gently or in­stru­men­tally, op­ti­mize for the user not la­bel­ing things as Po­ten­tially Bad, thereby po­ten­tially lead­ing to user de­cep­tion?


  • Task-directed AGI

    An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.