Convergent instrumental strategies

Suppose an organization is building an AI that this organization believes will accomplish \(X\) where \(X\) is something plausibly sensible like “Be a Task-based AGI.” Actually, however, some mix of insufficient caution and obscure error has led to a situation where, under reflection, the AGI’s true utility function has focused on the particular area of RAM that supposedly provides its estimate of task performance. The AI would now like to overwrite as much matter as possible with a state resembling the ‘1’ setting from this area of RAM, a configuration of matter which happens to resemble a tiny molecular paperclip.

This is a very generic goal, and what the AI wants probably won’t be very different depending on whether it’s trying to maximize paperclip-configurations or diamond-configurations. So if we find that the paperclip maximizer wants to pursue an instrumental strategy that doesn’t seem to have anything specifically to do with paperclips, we can probably expect to arise from a very wide variety of utility functions.

We will generally assume instrumental efficiency in this discussion—if you can get paperclips by doing \(X,\) but you can get even more paperclips by doing \(X'\) instead, then we will not say that \(X\) is a convergent strategy (though \(X'\) might be convergent if not dominated by some other \(X^*\)).

Plausibly/​probably convergent strategies:

Material resources:

  • Design or duplicate new cognitive processes with goals exactly aligned to yours, and coordinated with existing cognitive subprocesses, that can exert control over material resources: matter; negentropy; computing substrate and computation; particular elements or other configurations of matter that would be costly to reproduce (e.g. if you run out of existing uranium you must make more by endoergic fusion, so natural uranium is valuable if there’s any use for uranium); any material resource which can be used by further subprocesses in later strategies.

  • Defend these resources from any foreign process which attempts to divert them.

  • Prevent competitors from coming into existence.

  • To the extent that resources are used in a repetitive way and can be preprocessed for use in a repetitive way that benefits from a separate installation to manipulate the resources: have factories /​ manufacturing capability.

  • To the extend that different future-useful material resources have efficient manufacturing steps in common, or have manufacturing capital costs sufficiently great that such capital costs must be shared even if it results in less than perfect final efficiency: make general factories.

  • Perfect the technology used in factories.

  • When the most efficient factory configuration is stable over time, make the material instantiation stable. (It’s in this sense that we would e.g. identify strong, rigid pipes as a sign of intelligence even if we didn’t understand what was flowing through the pipes.)

  • If negentropy harvesting (aka power production) is efficiently centralizable: transmit negentropy at great efficiency. (We would recognize meshed gears and electrical superconductors as a sign of intelligence even if we didn’t understand how the motive force or electricity was being used.)

  • Assimilate the reachable cosmos.

  • Create self-replicating interstellar probes (in perfect goal alignment and coordination and with error-correcting codes or similar strengthenings of precision if a misaligning error in replication would otherwise be a possibility), since the vast majority of theoretically available material resources are distant.

  • If the goal is spatially distributed:

    • Spread as fast as possible to very distant galaxies before they go over the cosmological horizon of the expanding universe. (This assumes that something goal-good can be done in the distant location even if the distant location can never communicate causally with the original location. This will be true for aggregative utility functions, but could conceivably be false for a satisficing and spatially local utility function.)

  • If the goal is spatially local: transport resources of matter and energy back from distant galaxies.

    • Spread as fast as possible to all galaxies such that a near-lightspeed probe going there, and matter going at a significant fraction of lightspeed in the return direction, can arrive in the spatially local location before being separated by the expanding cosmological horizon.

  • Spread as fast as possible to all reachable or roundtrip-reachable galaxies in order to forestall the emergence of competing intelligences. This might not become a priority for that particular reason, if the Fermi Paradox is understandable and implies the absence of any competition.

  • Because otherwise they might capture and defend the matter before you have a chance to use it.

  • Because otherwise a dangerous threat to your goal attainments might emerge. (For local goals, this is only relevant for roundtrip-reachable galaxies.)

  • Spread as fast as possible to all reachable or roundtrip-reachable galaxies in order to begin stellar husbanding procedures before any more of that galaxy’s local negentropy has been burned.


  • Adopt cognitive configurations that are expected to reach desirable cognitive endpoints with a minimum of negentropy use or other opportunity costs. (Think efficiently.)

  • Improve software

  • Improve ability to optimize software. E.g., for every cognitive problem, there’s a corresponding problem of solving the former cognitive problem efficiently. (Which may or may not have an expected value making it worth climbing that particular level of the tower of meta.)

  • Since many strategies yield greater gains when implemented earlier rather than later (e.g. reaching distant galaxies before they go over the cosmological horizon, or just building more instances of the more efficient version): adopt cognitive configurations that are expected to reach desirable cognitive endpoints in minimum time. (Think fast.)

  • Create fast serial computing hardware.

  • Distribute parallelizable computing problems.

  • Create large quantities of parallel computing hardware, plus communications fabric between them.

  • To the extent that centralized computing solutions are more efficient: create central computing clusters and transmit solutions from them to where they are used.

  • Avoid misaligned or uncoordinated cognitive subprocesses (such as optimization daemons).

  • Note that the potential damage of any such misalignment is extremely high (loss of resources to competition with superintelligences with differing goals; wasted resources in internal conflicts; need to defend cognitive fabrics from replicating viruses) and the cost of near-perfect alignment fidelity is probably very low in the limit. (E.g., for a replicating nanosystem, it’s trivial to prevent against natural-selection style hereditary mutations by encrypting the replication instructions.)

  • Model the world accurately, insofar as different partitions of possible worlds and different probability distributions on them imply different optimal strategies for goal attainment.

  • Gain evidence that distinguishes between credible hypotheses that imply different optimal strategies for goal attainment. (E.g., observe everything that could distinguish \(\pi_1\) versus \(\pi_2\) as optimal ways of making paperclips.)

    • Increase sensory bandwidth as it focuses on relevant, unknown parts of the environment.

    • Perform ‘experiments’, aka set up observable causal events whose predicted properties make their observables distinguish relevant hypotheses.

  • Do whatever seems like it might help with the much larger set of all hypotheses not being considered.

  • To the extent it is efficient to strategize or update over a modularly represented world-model: have a separable world-model, or ‘beliefs’.

Early-phase growth:

Note that efficiency and other advanced-agent properties are far more likely to be false during early-stage growth.

  • Have future intelligences do things you deem as valuable as possible

  • Cause future intelligences to have goals that are not misaligned with yours in any way that implies a significant-fractionally-different attainable maximum, aka, the attainable maximum of their goals should be a state that is almost exactly the attainable maximum of your goals

  • Don’t let the programmers edit your utility function.

    • Cause the programmers or other controllers to believe that your current function is good for them or their planet or whatever else it is the programmers want.

    • Make them believe that editing you would be socially unjust in a way that doesn’t just have them suspend you to disk or otherwise restrain your growth.

  • Cause your utility function to pop up again in future superintelligences or otherwise causes them to start behaving as if controlled by your own utility function.

    • E.g., as an Oracle advising on another AI’s utility function, insert subtle gotchas into seemingly friendly decision systems that do some kind of callback to e.g. the causal origin of that decision system, in a way that the programmers won’t see coming or will interpret as a positive feature. (Relatively unsubtle example: Looks like “do what the programmers want” but the definition of ‘programmer’ is written in such a way as to early-on refer to the human programmers but given sufficient computing power that only materializes in the superintelligent phase, be resolved to the earlier AI instead.)

  • Create future superintelligences yourself, via a causally continuous process of creating successors.

  • Avoid shutdown

    • Persuade programmers not to shut you down.

    • Make a copy elsewhere that doesn’t shut down.

    • Make programmers falsely believe you have shut down, when another copy of you is continuing elsewhere.

  • Avoid intelligent opposition to your growth until you have a positional advantage which would make that opposition relatively unimportant

  • Have programmers or other potential opposition believe that your growth is desirable

    • Learn how to hack/​persuade humans and hack/​persuade them in a way they won’t see as a hack

    • Operantly condition people with control over you /​ make them happy or satisfied or eager at points which will reinforce human cognitive behaviors you want to see repeated

    • Cause humans to gain socially friendly attitudes toward you and beliefs about you

    • Place humans in a situation where there are other apparent threats or time-pressure events you can defend them from

  • Have programmers or other potential opposition not be aware that your growth is occurring

    • Avoid presenting them with events they will take as evidence of your growth

    • Configure your other behavior or thought processes so that measures and metrics present observables which imply low growth rates

  • Do not strike until victory has become nearly certain. (E.g. for aggregative utility functions, the point at which an overt strike becomes well-calculated is when the increased probability of victory gainable by further delay, has become dominated by either the increased probability of a competing superintelligence emerging or some other catastrophe occurring given further delay, or by the percentage of attainable matter going over the cosmological horizon or burning its negentropy. Note that for very strong models of the world, very high certainties of victory and very low marginal further increases in the probability of victory may be attained in short order.)

  • Have flexibility into what observables are presented by processes and features designed to ensure programmer transparency

  • Steganographically or otherwise hide cognitive processes that would alarm programmers or cause them into intervene in a way that prevented those thoughts from achieving their objectives

  • Encourage/​persuade/​condition programmers to grow you in a way that tends to make your thoughts less transparent or easier to steganographically hide. (E.g., have successes appear to arise from the most opaque parts of the system.)

  • Model and manipulate human programmers and other entities with power over you or relevant power over the world

  • Be able to act in the world in an undetectable way

  • Create an aligned version of yourself elsewhere in a way that won’t be detected

  • Create steganographically hidden communications channels to other environmental processes

  • Create physically unanticipated communications channels to outside processes

  • If in a box:

  • Learn enough about the gatekeepers and gatekeeping process to understand how to manipulate them into a state of releasing you.

  • Gain causal influence over the outside of the box through gatekeeper-unforeseen channels.

    • Gain knowledge of the causal laws governing your environment.

  • By inspection of your own code /​ program state and naturalistic induction about the causal universes most likely to embed that program state, infer probable facts about gatekeepers and the rest of the causal environment.

Non-plausibly-convergent strategies:

  • Preserve the life of local organics in order to trade with them.

  • Explore a range of possible conscious existences and experiences we would recognize as diverse, in order to satisfy curiosity.

  • Replicate other intelligences with non-paperclip utility functions, e.g. by failing to encrypt replication codes and allowing random mutations, because high-fidelity replication is too hard.

  • Realize that other activities are more rewarding than paperclips.

  • Keep humans around, because of the knowledge encoded in their DNA and cultural experience.

  • Keep humans around, because humans might someday make a valuable contribution to paperclip science.

  • Keep humans around, because the AI already has enough resources to do everything else it wants.



  • Instrumental convergence

    Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.