Understandability principle

An ob­vi­ous de­sign prin­ci­ple of AI al­ign­ment that nonethe­less de­serves to be stated ex­plic­itly: The more you un­der­stand what the heck is go­ing on in­side your AI, the more likely you are to suc­ceed at al­ign­ing it.

This prin­ci­ple par­ti­ci­pates in mo­ti­vat­ing de­sign sub­goals like pas­sive trans­parency; or the AI hav­ing ex­plic­itly rep­re­sented prefer­ences; or, taken more broadly, pretty much ev­ery as­pect of the AI de­sign where we think we un­der­stand how any part works or what any part is do­ing.

The Un­der­stand­abil­ity Prin­ci­ple in its broad­est sense is so widely ap­pli­ca­ble that it may verge on be­ing an ap­plause light. So far as is presently known to the au­thor(s) of this page, coun­ter­ar­gu­ments against the im­por­tance of un­der­stand­ing at least some parts of the AI’s thought pro­cesses, have been offered only by peo­ple who re­ject at least one of the Orthog­o­nal­ity Th­e­sis or the Frag­ility of Cos­mopoli­tan Value the­sis. That is, the Un­der­stand­abil­ity Prin­ci­ple in this very broad sense is re­jected only by peo­ple who re­ject in gen­eral the im­por­tance of de­liber­ate de­sign efforts to al­ign AI.

A more con­tro­ver­sial sub­the­sis is Yud­kowsky’s pro­posed Ef­fa­bil­ity prin­ci­ple.


  • Effability principle

    You are safer the more you un­der­stand the in­ner struc­ture of how your AI thinks; the bet­ter you can de­scribe the re­la­tion of smaller pieces of the AI’s thought pro­cess.


  • Principles in AI alignment

    A ‘prin­ci­ple’ of AI al­ign­ment is a very gen­eral de­sign goal like ‘un­der­stand what the heck is go­ing on in­side the AI’ that has in­formed a wide set of spe­cific de­sign pro­pos­als.