Do-What-I-Mean hierarchy

Do-What-I-Mean refers to an aligned AGI’s ability to produce better-aligned plans, based on an explicit model of what the user wants or believes.

Successive levels of DWIM-ness:

  • No understanding of human intentions /​ zero DWIMness. E.g. a Task AGI that is focused on one task being communicated, where all the potential impacts of that task need to be separately vetted. If you tell this kind of AGI to ‘cure cancer’, you might need to veto plans which would remove the cancer but kill the patient as a side effect, because the AGI doesn’t start out knowing that you’d prefer not to kill the patient.

  • Do What You Don’t Know I Dislike. The Task AGI has a background understanding of some human goals or which parts of the world humans consider especially significant, so it can more quickly generate a plan likely to pass human vetting. A Task AGI at this level, told to cure cancer, will take relatively fewer rounds of Q&A to generate a plan which carefully seals off any blood vessels cut by removing the cancer; because the AGI has a general notion of human health, knows that impacts on human health are significant, and models that users will generally prefer plans which result in good human health as side effects rather than plans which result in poor human health.

  • Do What You Know I Understood. The Task AGI has a model of human beliefs, and can flag and report divergences between the AGI’s model of what the humans expect to happen, and what the AGI expects to happen.

  • DWIKIM: Do What I Know I Mean. The Task AGI has an explicit psychological model of human preference—not just a list of things in the environment which are significant to users, but a predictive model of how users behave which is informative about their preferences. At this level, the AGI can read through a dump of online writing, build up a model of human psychology, and guess that you’re telling it to cure a cancer because you altruistically want that person to be healthier.

  • DWIDKIM: Do What I Don’t Know I Mean. The AGI can perform some basic extrapolation steps on its model of you and notice when you’re trying to do something that, in the AGI’s model, some further piece of knowledge might change your mind about. (Unless we trust the DWIDKIM model a lot, this scenario should imply “Warn the user about that” not “Do what you think the user would’ve told you.”)

  • (Coherent) Extrapolated Volition. The AGI does what it thinks you (or everyone) would’ve told it to do if you were as smart as the AGI, i.e., your decision model is extrapolated toward improved knowledge, increased ability to consider arguments, improved reflectivity, or other transforms in the direction of a theory of normativity.

Risks from pushing toward higher levels of DWIM might include:

  • To the extent that DWIM can originate plans, some portion of which are not fully supervised, then DWIM is a very complicated goal or preference system that would be harder to train and more likely to break. This failure mode may be less likely if some level of DWIM is only being used to flag potentially problematic plans that were generated by non-DWIM protocols, rather than generating plans on its own.

  • Accurate predictive psychological models of humans might push the system closer to the programmer deception failure mode being more accessible if something else goes wrong.

  • Sufficiently advanced psychological models might constitute mindcrime.

  • The human-genie system might end up in the Valley of Dangerous Complacency where the genie almost always gets it right but occasionally gets it very wrong, and the human user is no longer alert to this possibility during the checking phase.

  • E.g. you might be tempted to skip the user checking phase, or just have the AI do whatever it thinks you meant, at a point where that trick only works 99% of the time and not 99.999999% of the time.

  • Computing sufficiently advanced DWIDKIM or EV possibilities for user querying might expose the human user to cognitive hazards. (“If you were sufficiently superhuman under scenario 32, you’d want yourself to stare really intently at this glowing spiral for 2 minutes, it might change your mind about some things… want to check and see if you think that’s a valid argument?”)

  • If the AGI was actually behaving like a safe genie, the sense of one’s wishes being immediately fulfilled without effort or danger might expose the programmers to additional moral hazard.

Parents: