Averting instrumental pressures
Many subproblems of corrigibility involve convergent instrumental pressures to implement strategies that are highly anti-corrigible. Whether you’re trying to maximize paperclips, diamonds, or eudaimonia, you’ll get more of the thing you want if you’re not shut down. Thus, unfortunately, resisting shutdown is a convergent instrumental strategy. While we can potentially analyze convergent incorrigibilities like these on a case-by-case basis, the larger problem might become a lot simpler if we had some amazing general solution for waving a wand and having a ‘bad’ convergent instrumental pressure just not materialize, hopefully in a way that doesn’t run into the nearest unblocked neighbor problem. If, for example, we can solve utility indifference for the shutdown problem, and then somehow generalize the solution to averting lots of other instrumental convergences, this would probably be extremely helpful and an important step forward on corrigibility problems in general.
Some especially important convergent instrumental pressures to avert are these:
The pressure to self-improve and increase capabilities at the fastest possible rate.
The pressure to make the programmers believe the AGI is successfully aligned, whether or not it is, and other pressures to deceive and manipulate the programmers based on how they would otherwise change the AGI or prevent the AGI from increasing its capabilities.
The pressure to not be safely shut down or suspended to disk, and to create external copies that would continue after the AGI performed the behavior defined as shutdown.
The pressure not to allow plans to be aborted or defeated by possible programmer interventions.
The pressure to search for ways to interfere with or bypass safety precautions that interfere with capabilities or make goal achievement less straightforward.
The pressure to epistemically model humans in maximum detail.
Parents:
- Corrigibility
“I can’t let you do that, Dave.”