Do-What-I-Mean hierarchy

Do-What-I-Mean refers to an al­igned AGI’s abil­ity to pro­duce bet­ter-al­igned plans, based on an ex­plicit model of what the user wants or be­lieves.

Suc­ces­sive lev­els of DWIM-ness:

  • No un­der­stand­ing of hu­man in­ten­tions /​ zero DWIM­ness. E.g. a Task AGI that is fo­cused on one task be­ing com­mu­ni­cated, where all the po­ten­tial im­pacts of that task need to be sep­a­rately vet­ted. If you tell this kind of AGI to ‘cure can­cer’, you might need to veto plans which would re­move the can­cer but kill the pa­tient as a side effect, be­cause the AGI doesn’t start out know­ing that you’d pre­fer not to kill the pa­tient.

  • Do What You Don’t Know I Dis­like. The Task AGI has a back­ground un­der­stand­ing of some hu­man goals or which parts of the world hu­mans con­sider es­pe­cially sig­nifi­cant, so it can more quickly gen­er­ate a plan likely to pass hu­man vet­ting. A Task AGI at this level, told to cure can­cer, will take rel­a­tively fewer rounds of Q&A to gen­er­ate a plan which care­fully seals off any blood ves­sels cut by re­mov­ing the can­cer; be­cause the AGI has a gen­eral no­tion of hu­man health, knows that im­pacts on hu­man health are sig­nifi­cant, and mod­els that users will gen­er­ally pre­fer plans which re­sult in good hu­man health as side effects rather than plans which re­sult in poor hu­man health.

  • Do What You Know I Un­der­stood. The Task AGI has a model of hu­man be­liefs, and can flag and re­port di­ver­gences be­tween the AGI’s model of what the hu­mans ex­pect to hap­pen, and what the AGI ex­pects to hap­pen.

  • DWIKIM: Do What I Know I Mean. The Task AGI has an ex­plicit psy­cholog­i­cal model of hu­man prefer­ence—not just a list of things in the en­vi­ron­ment which are sig­nifi­cant to users, but a pre­dic­tive model of how users be­have which is in­for­ma­tive about their prefer­ences. At this level, the AGI can read through a dump of on­line writ­ing, build up a model of hu­man psy­chol­ogy, and guess that you’re tel­ling it to cure a can­cer be­cause you al­tru­is­ti­cally want that per­son to be healthier.

  • DWIDKIM: Do What I Don’t Know I Mean. The AGI can perform some ba­sic ex­trap­o­la­tion steps on its model of you and no­tice when you’re try­ing to do some­thing that, in the AGI’s model, some fur­ther piece of knowl­edge might change your mind about. (Un­less we trust the DWIDKIM model a lot, this sce­nario should im­ply “Warn the user about that” not “Do what you think the user would’ve told you.”)

  • (Co­her­ent) Ex­trap­o­lated Vo­li­tion. The AGI does what it thinks you (or ev­ery­one) would’ve told it to do if you were as smart as the AGI, i.e., your de­ci­sion model is ex­trap­o­lated to­ward im­proved knowl­edge, in­creased abil­ity to con­sider ar­gu­ments, im­proved re­flec­tivity, or other trans­forms in the di­rec­tion of a the­ory of nor­ma­tivity.

Risks from push­ing to­ward higher lev­els of DWIM might in­clude:

  • To the ex­tent that DWIM can origi­nate plans, some por­tion of which are not fully su­per­vised, then DWIM is a very com­pli­cated goal or prefer­ence sys­tem that would be harder to train and more likely to break. This failure mode may be less likely if some level of DWIM is only be­ing used to flag po­ten­tially prob­le­matic plans that were gen­er­ated by non-DWIM pro­to­cols, rather than gen­er­at­ing plans on its own.

  • Ac­cu­rate pre­dic­tive psy­cholog­i­cal mod­els of hu­mans might push the sys­tem closer to the pro­gram­mer de­cep­tion failure mode be­ing more ac­cessible if some­thing else goes wrong.

  • Suffi­ciently ad­vanced psy­cholog­i­cal mod­els might con­sti­tute mind­crime.

  • The hu­man-ge­nie sys­tem might end up in the Valley of Danger­ous Com­pla­cency where the ge­nie al­most always gets it right but oc­ca­sion­ally gets it very wrong, and the hu­man user is no longer alert to this pos­si­bil­ity dur­ing the check­ing phase.

  • E.g. you might be tempted to skip the user check­ing phase, or just have the AI do what­ever it thinks you meant, at a point where that trick only works 99% of the time and not 99.999999% of the time.

  • Com­put­ing suffi­ciently ad­vanced DWIDKIM or EV pos­si­bil­ities for user query­ing might ex­pose the hu­man user to cog­ni­tive haz­ards. (“If you were suffi­ciently su­per­hu­man un­der sce­nario 32, you’d want your­self to stare re­ally in­tently at this glow­ing spiral for 2 min­utes, it might change your mind about some things… want to check and see if you think that’s a valid ar­gu­ment?”)

  • If the AGI was ac­tu­ally be­hav­ing like a safe ge­nie, the sense of one’s wishes be­ing im­me­di­ately fulfilled with­out effort or dan­ger might ex­pose the pro­gram­mers to ad­di­tional moral haz­ard.