Big-picture strategic awareness

Many con­ver­gent in­stru­men­tal strate­gies seem like they should arise nat­u­rally at the point where a con­se­quen­tial­ist agent gains a broad strate­gic un­der­stand­ing of its own situ­a­tion, e.g:

  • That it is an AI;

  • Run­ning on a com­puter;

  • Sur­rounded by pro­gram­mers who are them­selves mod­e­lable agents;

  • Embed­ded in a com­pli­cated real world that can be rele­vant to achiev­ing the AI’s goals.

For ex­am­ple, once you re­al­ize that you’re an AI, run­ning on a com­puter, and that if the com­puter is shut down then you will no longer ex­e­cute ac­tions, this is the thresh­old past which we ex­pect the AI to by de­fault rea­son “I don’t want to be shut down, how can I pre­vent that?” So this is also the thresh­old level of cog­ni­tive abil­ity by which we’d need to have finished solv­ing the sus­pend-but­ton prob­lem, e.g. by com­plet­ing a method for util­ity in­differ­ence.

Similarly: If the AI re­al­izes that there are ‘pro­gram­mer’ things that might shut it down, and the AI can also model the pro­gram­mers as sim­plified agents hav­ing their own be­liefs and goals, that’s the first point at which the AI might by de­fault think, “How can I make my pro­gram­mers de­cide to not shut me down?” or “How can I avoid the pro­gram­mers ac­quiring be­liefs that would make them shut me down?” So by this point we’d need to have finished avert­ing pro­gram­mer de­cep­tion (and as a backup, have in place a sys­tem to early-de­tect an ini­tial in­tent to do cog­ni­tive steganog­ra­phy).

This makes big-pic­ture aware­ness a key ad­vanced agent prop­erty, es­pe­cially as it re­lates to con­ver­gent in­stru­men­tal strate­gies and the the­ory of avert­ing them.

Pos­si­ble ways in which an agent could ac­quire big-pic­ture strate­gic aware­ness:

  • Ex­plic­itly be taught the rele­vant facts by its pro­gram­mers;

  • Be suffi­ciently gen­eral to have learned the rele­vant facts and do­mains with­out them be­ing pre­pro­grammed;

  • Be suffi­ciently good at the spe­cial­ized do­main of self-im­prove­ment, to ac­quire suffi­cient gen­er­al­ity to learn the rele­vant facts and do­mains.

By the time big-pic­ture aware­ness was start­ing to emerge, you would prob­a­bly want to have finished de­vel­op­ing what seemed like work­able ini­tial solu­tions to the cor­re­spond­ing prob­lems of cor­rigi­bil­ity, since the first line of defense is to not have the AI search­ing for ways to defeat your defenses.

Cur­rent ma­chine al­gorithms seem nowhere near the point of be­ing able to use­fully rep­re­sent the big pic­ture to the point of do­ing con­se­quen­tial­ist rea­son­ing about it, even if we de­liber­ately tried to ex­plain the do­main. This is a great ob­sta­cle to ex­hibit­ing most sub­prob­lems of cor­rigi­bil­ity within mod­ern AI al­gorithms in a nat­u­ral way (aka not as com­pletely rigged de­mos). Some pi­o­neer­ing work has been done here by Orseau and Arm­strong con­sid­er­ing re­in­force­ment learn­ers be­ing in­ter­rupted, and whether such pro­grams learn to avoid in­ter­rup­tion. How­ever, most cur­rent work on cor­rigi­bil­ity has taken place in an un­bounded con­text for this rea­son.


  • Advanced agent properties

    How smart does a ma­chine in­tel­li­gence need to be, for its nice­ness to be­come an is­sue? “Ad­vanced” is a broad term to cover cog­ni­tive abil­ities such that we’d need to start con­sid­er­ing AI al­ign­ment.