Distinguish which advanced-agent properties lead to the foreseeable difficulty

Any gen­eral pro­ject of pro­duc­ing a large ed­ifice of good think­ing should try to break down the ideas into mod­u­lar pieces, dis­t­in­guish premises from con­clu­sions, and clearly la­bel which rea­son­ing steps are be­ing used. Ap­plied to AI al­ign­ment the­ory, one of the things this sug­gests is that if you pro­pose any sort of po­ten­tially difficult or dan­ger­ous fu­ture be­hav­ior from an AI, you should dis­t­in­guish what par­tic­u­lar kinds of ad­vance­ment or cog­ni­tive in­tel­li­gence are sup­posed to pro­duce this difficulty. In other words, sup­posed fore­see­able difficul­ties should come with pro­posed ad­vanced agent prop­er­ties that match up to them.


  • Advanced safety

    An agent is re­ally safe when it has the ca­pac­ity to do any­thing, but chooses to do what the pro­gram­mer wants.