Averting the convergent instrumental strategy of self-improvement

Rapid ca­pa­bil­ity gains, or just large ca­pa­bil­ity gains be­tween a train­ing paradigm and a test paradigm, are one of the pri­mary ex­pected rea­sons why AGI al­ign­ment might be hard. We prob­a­bly want the first AGI or AGIs ever built, tested, and used to not self-im­prove as quickly as pos­si­ble. Since there’s a very strong con­ver­gent in­cen­tive to self-im­prove and do things neigh­bor­ing to self-im­prove­ment, by de­fault you would ex­pect an AGI to search for ways to defeat naive blocks on self-im­prove­ment, which vi­o­lates the non­ad­ver­sar­ial prin­ci­ple. Thus, any pro­pos­als to limit an AGI’s ca­pa­bil­ities im­ply a very strong desider­a­tum for us to figure out a way to avert the in­stru­men­tal in­cen­tive to self-im­prove­ment in that AGI. The al­ter­na­tive is failing the Omni Test, vi­o­lat­ing the non­ad­ver­sar­ial prin­ci­ple, hav­ing the AGI’s code be ac­tively in­con­sis­tent with what the AGI would ap­prove of its own code be­ing (if the brake is a code-level mea­sure), and set­ting up a safety mea­sure that the AGI wants to defeat as the only line of defense.