Generalized principle of cognitive alignment

A gen­er­al­iza­tion of the Non-ad­ver­sar­ial prin­ci­ple is that when­ever we are ask­ing how we want an AI al­gorithm to ex­e­cute with re­spect to some al­ign­ment or safety is­sue, we might ask how we our­selves are think­ing about that prob­lem, and whether we can have the AI think con­ju­gate thoughts. This may some­times seem like a much more com­pli­cated or dan­ger­ous-seem­ing ap­proach than sim­pler av­enues, but it’s of­ten a source of use­ful in­spira­tion.

For ex­am­ple, with re­spect to the shut­down prob­lem, this prin­ci­ple might lead us to ask: “Is there some way we can have the AI truly un­der­stand that its own pro­gram­mers may have built the wrong AI, in­clud­ing the wrong defi­ni­tion of ex­actly what it means to have ‘built the wrong AI’, such that the AI thinks it can­not re­cover the mat­ter by op­ti­miz­ing any kind of prefer­ence already built into it, so that the AI it­self wants to shut down be­fore hav­ing a great im­pact, be­cause when the AI sees the pro­gram­mers try­ing to press the but­ton or con­tem­plates the pos­si­bil­ity of the pro­gram­mers press­ing the but­ton, up­dat­ing on this in­for­ma­tion causes the AI to ex­pect its fur­ther op­er­a­tion to have a net bad im­pact in some sense that it can’t over­come through any kind of clever strat­egy be­sides just shut­ting down?”

This in turn might im­ply a com­pli­cated mind-state we’re not sure how to get right, such that we would pre­fer a sim­pler ap­proach to shut­down­abil­ity along the lines of a perfected util­ity in­differ­ence scheme. If we’re shut­ting down the AI at all, it means some­thing has gone wrong, which im­plies that some­thing else may have gone wrong ear­lier be­fore we no­ticed. That seems like a bad time to have the AI be en­thu­si­as­tic about shut­ting down even bet­ter than in its origi­nal de­sign (un­less we can get the AI to un­der­stand even that part too, the dan­ger of that kind of ‘im­prove­ment’, dur­ing its nor­mal op­er­a­tion).

Try­ing for max­i­mum cog­ni­tive al­ign­ment isn’t always a good idea; but it’s al­most always worth try­ing to think through a safety prob­lem from that per­spec­tive for in­spira­tion on what we’d ideally want the AI to be do­ing. It’s of­ten a good idea to move closer to that ideal when this doesn’t in­tro­duce greater com­pli­ca­tion or other prob­lems.


  • Non-adversarial principle

    At no point in con­struct­ing an Ar­tifi­cial Gen­eral In­tel­li­gence should we con­struct a com­pu­ta­tion that tries to hurt us, and then try to stop it from hurt­ing us.