Total alignment

An ad­vanced agent can be said to be “to­tally al­igned” when it can as­sess the ex­act value of well-de­scribed out­comes and hence the ex­act sub­jec­tive value of ac­tions, poli­cies, and plans; where value has its over­rid­den mean­ing of a meta­syn­tac­tic vari­able stand­ing in for “what­ever we re­ally do or re­ally should value in the world or want from an Ar­tifi­cial In­tel­li­gence” (this is the same as “nor­ma­tive” if the speaker be­lieves in nor­ma­tivity). That is: It’s an ad­vanced agent that cap­tures all the dis­tinc­tions we would make or should make within which out­comes are good or bad; it has “full cov­er­age” of the true or in­tended goals; it cor­rectly re­solves ev­ery re­flec­tively con­sis­tent de­gree of free­dom.

We don’t need to try and give such an AI sim­plified or­ders like, e.g., “try to have a lower im­pact” be­cause we’re wor­ried about, e.g., a near­est un­blocked strat­egy prob­lem on try­ing to draw ex­act bound­aries around what con­sti­tutes a bad im­pact. The AI knows ev­ery­thing worth know­ing about which im­pacts are bad, and even if it thinks of a re­ally weird ex­otic plan, it will still be able to figure out which as­pects of this plan match our in­tended no­tion of value or a nor­ma­tive no­tion of value.

If this agent does not sys­tem­at­i­cally un­der­es­ti­mate the prob­a­bil­ity of bad out­comes /​ over­es­ti­mate the prob­a­bil­ity of good out­comes, and its max­i­miza­tion over poli­cies is not sub­ject to ad­verse sub­jec­tion, its es­ti­mates of ex­pected value will be well-cal­ibrated even from our own out­side stand­point.