Distinguish which advanced-agent properties lead to the foreseeable difficulty
Any general project of producing a large edifice of good thinking should try to break down the ideas into modular pieces, distinguish premises from conclusions, and clearly label which reasoning steps are being used. Applied to AI alignment theory, one of the things this suggests is that if you propose any sort of potentially difficult or dangerous future behavior from an AI, you should distinguish what particular kinds of advancement or cognitive intelligence are supposed to produce this difficulty. In other words, supposed foreseeable difficulties should come with proposed advanced agent properties that match up to them.
- Advanced safety
An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.