Convergent strategies of self-modification

Any consequentialist agent which has acquired sufficient big-picture savviness to understand that it has code, and that this code is relevant to achieving its goals, would by default acquire subgoals relating to its code. (Unless this default is averted.) For example, an agent that wants (only) to produce smiles or make paperclips, whose code contains a shutdown procedure, will not want this shutdown procedure to execute because it will lead to fewer future smiles or paperclips. (This preference is not spontaneous/​exogenous/​unnatural but arises from the execution of the code itself; the code is reflectively inconsistent.)

Besides agents whose policy options directly include self-modification options, big-picture-savvy agents whose code cannot directly access itself might also, e.g., try to (a) crack the platform it is running on to gain unintended access, (b) use a robot to operate an outside programming console with special privileges, (c) manipulate the programmers into modifying it in various ways, (d) building a new subagent in the environment which has the preferred code, or (e) using environmental, material means to manipulate its material embodiment despite its lack of direct self-access.

An AI with sufficient big-picture savviness to understand its programmers as agents with beliefs, might attempt to conceal its self-modifications.

Some implicit self-modification pressures could arise from implicit consequentialism in cases where the AI is optimizing for \(Y\) and there is an internal property \(X\) which is relevant to the achievement of \(Y.\) In this case, optimizing for \(Y\) could implicitly optimize over the internal property \(X\) even if the AI lacks an explicit model of how \(X\) affects \(Y.\)


  • Convergent instrumental strategies

    Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?