Understandability principle

An obvious design principle of AI alignment that nonetheless deserves to be stated explicitly: The more you understand what the heck is going on inside your AI, the more likely you are to succeed at aligning it.

This principle participates in motivating design subgoals like passive transparency; or the AI having explicitly represented preferences; or, taken more broadly, pretty much every aspect of the AI design where we think we understand how any part works or what any part is doing.

The Understandability Principle in its broadest sense is so widely applicable that it may verge on being an applause light. So far as is presently known to the author(s) of this page, counterarguments against the importance of understanding at least some parts of the AI’s thought processes, have been offered only by people who reject at least one of the Orthogonality Thesis or the Fragility of Cosmopolitan Value thesis. That is, the Understandability Principle in this very broad sense is rejected only by people who reject in general the importance of deliberate design efforts to align AI.

A more controversial subthesis is Yudkowsky’s proposed Effability principle.