Averting the convergent instrumental strategy of self-improvement
Rapid capability gains, or just large capability gains between a training paradigm and a test paradigm, are one of the primary expected reasons why AGI alignment might be hard. We probably want the first AGI or AGIs ever built, tested, and used to not self-improve as quickly as possible. Since there’s a very strong convergent incentive to self-improve and do things neighboring to self-improvement, by default you would expect an AGI to search for ways to defeat naive blocks on self-improvement, which violates the nonadversarial principle. Thus, any proposals to limit an AGI’s capabilities imply a very strong desideratum for us to figure out a way to avert the instrumental incentive to self-improvement in that AGI. The alternative is failing the Omni Test, violating the nonadversarial principle, having the AGI’s code be actively inconsistent with what the AGI would approve of its own code being (if the brake is a code-level measure), and setting up a safety measure that the AGI wants to defeat as the only line of defense.
Parents:
- Corrigibility
“I can’t let you do that, Dave.”