Generalized principle of cognitive alignment

A generalization of the Non-adversarial principle is that whenever we are asking how we want an AI algorithm to execute with respect to some alignment or safety issue, we might ask how we ourselves are thinking about that problem, and whether we can have the AI think conjugate thoughts. This may sometimes seem like a much more complicated or dangerous-seeming approach than simpler avenues, but it’s often a source of useful inspiration.

For example, with respect to the shutdown problem, this principle might lead us to ask: “Is there some way we can have the AI truly understand that its own programmers may have built the wrong AI, including the wrong definition of exactly what it means to have ‘built the wrong AI’, such that the AI thinks it cannot recover the matter by optimizing any kind of preference already built into it, so that the AI itself wants to shut down before having a great impact, because when the AI sees the programmers trying to press the button or contemplates the possibility of the programmers pressing the button, updating on this information causes the AI to expect its further operation to have a net bad impact in some sense that it can’t overcome through any kind of clever strategy besides just shutting down?”

This in turn might imply a complicated mind-state we’re not sure how to get right, such that we would prefer a simpler approach to shutdownability along the lines of a perfected utility indifference scheme. If we’re shutting down the AI at all, it means something has gone wrong, which implies that something else may have gone wrong earlier before we noticed. That seems like a bad time to have the AI be enthusiastic about shutting down even better than in its original design (unless we can get the AI to understand even that part too, the danger of that kind of ‘improvement’, during its normal operation).

Trying for maximum cognitive alignment isn’t always a good idea; but it’s almost always worth trying to think through a safety problem from that perspective for inspiration on what we’d ideally want the AI to be doing. It’s often a good idea to move closer to that ideal when this doesn’t introduce greater complication or other problems.


  • Non-adversarial principle

    At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.