Methodology of foreseeable difficulties

Much of the current literature about value alignment centers on purported reasons to expect that certain problems will require solution, or be difficult, or be more difficult than some people seem to expect. The subject of this page’s approval rating is this practice, considered as a policy or methodology.

The basic motivation behind trying to foresee difficulties is the large number of predicted Context Change problems where an AI seems to behave nicely up until it reaches some threshold level of cognitive ability and then it behaves less nicely. In some cases the problems are generated without the AI having formed that intention in advance, meaning that even transparency of the AI’s thought processes during its earlier state can’t save us. This means we have to see problems of this type in advance.

(The fact that Context Change problems of this type can be hard to see in advance, or that we might conceivably fail to see one, doesn’t mean we can skip this duty of analysis. Not trying to foresee them means relying on observation, and it seems predictable that trying to eyeball the AI and rejecting theory definitely doesn’t catch important classes of problem.)


…most of value alignment theory, so try to pick 3 cases that illustrate the point in different ways. Pick from Context Change?


For: it’s sometimes possible to strongly foresee a difficulty coming in a case where you’ve observed naive respondents to seem to think that no difficulty exists, and in cases where the development trajectory of the agent seems to imply a potential Treacherous Turn. If there’s even one real Treacherous Turn out of all the cases that have been argued, then the point carries that past a certain point, you have to see the bullet coming before it actually hits you. The theoretical analysis suggests really strongly that blindly forging ahead ‘experimentally’ will be fatal. Someone with such a strong commitment to experimentalism that they want to ignore this theoretical analysis… it’s not clear what we can say to them, except maybe to appeal to the normative principle of not predictably destroying the world in cases where it seems like we could have done better.

Against: no real arguments against in the actual literature, but it would be surprising if somebody didn’t claim that the foreseeable difficulties program was too pessimistic, or inevitably ungrounded from reality and productive only of bad ideas even when refuted, etcetera.

Primary reply: look, dammit, people actually are way too optimistic about FAI, we have them on the record, find 3 prestigious examples and it’s hard to see how humanity could avoid walking directly into the whirling razor blades without better foresight of difficulty. One potential strategy is enough academic respect and consensus on enough really obvious foreseeable difficulties that the people claiming it will all be easy are actually asked to explain why the foreseeable difficulty consensus is wrong, and if they can’t explain that well, they lose respect.

Will interact with the arguments on empiricism vs. theorism is a false dichotomy.


  • Advanced safety

    An agent is really safe when it has the capacity to do anything, but chooses to do what the programmer wants.