Alternative introductions

Steve Omohundro: “The Basic AI Drives”
Nick Bostrom: “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents”.

Introduction: A machine of unknown purpose

Suppose you landed on a distant planet and found a structure of giant metal pipes, crossed by occasional cables. Further investigation shows that the cables are electrical superconductors carrying high-voltage currents.

You might not know what the huge structure did. But you would nonetheless guess that this huge structure had been built by some intelligence, rather than being a naturally-occurring mineral formation—that there were aliens who built the structure for some purpose.

Your reasoning might go something like this: “Well, I don’t know if the aliens were trying to manufacture cars, or build computers, or what. But if you consider the problem of efficient manufacturing, it might involve mining resources in one place and then efficiently transporting them somewhere else, like by pipes. Since the most efficient size and location of these pipes would be stable, you’d want the shape of the pipes to be stable, which you could do by making the pipes out of a hard material like metal. There’s all sorts of operations that require energy or negentropy, and a superconducting cable carrying electricity seems like an efficient way of transporting that energy. So I don’t know what the aliens were ultimately trying to do, but across a very wide range of possible goals, an intelligent alien might want to build a superconducting cable to pursue that goal.”

That is: We can take an enormous variety of compactly specifiable goals, like “travel to the other side of the universe” or “support biological life” or “make paperclips”, and find very similar optimal strategies along the way. Today we don’t actually know if electrical superconductors are the most useful way to transport energy in the limit of technology. But whatever is the most efficient way of transporting energy, whether that’s electrical superconductors or something else, the most efficient form of that technology would probably not vary much depending on whether you were trying to make diamonds or make paperclips.

Or to put it another way: If you consider the goals “make diamonds” and “make paperclips”, then they might have almost nothing in common with respect to their end-states—a diamond might contain no iron. But the earlier strategies used to make a lot of diamond and make a lot of paperclips might have much in common; “the best way of transporting energy to make diamond” and “the best way of transporting energy to make paperclips” are much more likely to be similar.

From a Bayesian standpoint this is how we can identify a huge machine strung with superconducting cables as having been produced by high-technology aliens, even before we have any idea of what the machine does. We’re saying, “This looks like the product of optimization, a strategy $X$ that the aliens chose to best achieve some unknown goal $Y$; we can infer this even without knowing $Y$ because many possible $Y$-goals would concentrate probability into this $X$-strategy being used.”

Convergence and its caveats

When you select policy $\pi_k$ because you expect it to achieve a later state $Y_k$ (the “goal”), we say that $\pi_k$ is your instrumental strategy for achieving $Y_k.$ The observation of “instrumental convergence” is that a widely different range of $Y$-goals can lead into highly similar $\pi$-strategies. (This becomes truer as the $Y$-seeking agent becomes more instrumentally efficient; two very powerful chess engines are more likely to solve a humanly solvable chess problem the same way, compared to two weak chess engines whose individual quirks might result in idiosyncratic solutions.)

If there’s a simple way of classifying possible strategies $\Pi$ into partitions $X \subset \Pi$ and $\neg X \subset \Pi$, and you think that for most compactly describable goals $Y_k$ the corresponding best policies $\pi_k$ are likely to be inside $X,$ then you think $X$ is a “convergent instrumental strategy”.

In other words, if you think that a superintelligent paperclip maximizer, diamond maximizer, a superintelligence that just wanted to keep a single button pressed for as long as possible, and a superintelligence optimizing for a flourishing intergalactic civilization filled with happy sapient beings, would all want to “transport matter and energy efficiently” in order to achieve their other goals, then you think “transport matter and energy efficiently” is a convergent instrumental strategy.

In this case “paperclips”, “diamonds”, “keeping a button pressed as long as possible”, and “sapient beings having fun”, would be the goals $Y_1, Y_2, Y_3, Y_4.$ The corresponding best strategies $\pi_1, \pi_2, \pi_3, \pi_4$ for achieving these goals would not be identical—the policies for making paperclips and diamonds are not exactly the same. But all of these policies (we think) would lie within the partition $X \subset \Pi$ where the superintelligence tries to “transport matter and energy efficiently” (perhaps by using superconducting cables), rather than the complementary partition $\neg X$ where the superintelligence does not try to transport matter and energy efficiently.

Semiformalization

Consider the set of computable and tractable utility functions $\mathcal U_C$ that take an outcome $o,$ described in some language $\mathcal L$, onto a rational number $r$. That is, we suppose:
That the relation $U_k$ between descriptions $o_\mathcal L$ of outcomes $o$, and the corresponding utilities $r,$ is computable;
Furthermore, that it can be computed in realistically bounded time;
Furthermore, that the $U_k$ relation between $o$ and $r$, and the $\mathbb P [o | \pi_i]$ relation between policies and subjectively expected outcomes, are together regular enough that a realistic amount of computing power makes it possible to search for policies $\pi$ that are yield high expected $U_k(o)$.
Choose some simple programming language $\mathcal P,$ such as the language of Turing machines, or Python 2 without most of the system libraries.
Choose a simple mapping $\mathcal P_B$ from $\mathcal P$ onto bitstrings.
Take all programs in $\mathcal P_B$ between 20 and 1000 bits in length, and filter them for boundedness and tractability when treated as utility functions, to obtain the filtered set $U_K$.
Set 90% as an arbitrary threshold.

If, given our beliefs $\mathbb P$ about our universe and which policies lead to which real outcomes, we think that in an intuitive sense it sure looks like at least 90% of the utility functions $U_k \in U_K$ ought to imply best findable policies $\pi_k$ which lie within the partition $X$ of $\Pi,$ we’ll allege that $X$ is “instrumentally convergent”.

Compatibility with Vingean uncertainty

Vingean uncertainty is the observation that, as we become increasingly confident of increasingly powerful intelligence from an agent with precisely known goals, we become decreasingly confident of the exact moves it will make (unless the domain has an optimal strategy and we know the exact strategy). E.g., to know exactly where Deep Blue would move on a chessboard, you would have to be as good at chess as Deep Blue. However, we can become increasingly confident that more powerful chessplayers will eventually win the game—that is, steer the future outcome of the chessboard into the set of states designated ‘winning’ for their color—even as it becomes less possible for us to be certain about the chessplayer’s exact policy.

Instrumental convergence can be seen as a caveat to Vingean uncertainty: Even if we don’t know the exact actions or the exact end goal, we may be able to predict that some intervening states or policies will fall into certain abstract categories.

That is: If we don’t know whether a superintelligent agent is a paperclip maximizer or a diamond maximizer, we can still guess with some confidence that it will pursue a strategy in the general class “obtain more resources of matter, energy, and computation” rather than “don’t get more resources”. This is true even though Vinge’s Principle says that we won’t be able to predict exactly how the superintelligence will go about gathering matter and energy.

Imagine the real world as an extremely complicated game. Suppose that at the very start of this game, a highly capable player must make a single binary choice between the abstract moves “Gather more resources later” and “Never gather any more resources later”. Vingean uncertainty or not, we seem justified in putting a high probability on the first move being preferred—a binary choice is simple enough that we can take a good guess at the optimal play.

Convergence supervenes on consequentialism

$X$ being “instrumentally convergent” doesn’t mean that every mind needs an extra, independent drive to do $X.$

Consider the following line of reasoning: “It’s impossible to get on an airplane without buying plane tickets. So anyone on an airplane must be a sort of person who enjoys buying plane tickets. If I offer them a plane ticket they’ll probably buy it, because this is almost certainly somebody who has an independent motivational drive to buy plane tickets. There’s just no way you can design an organism that ends up on an airplane unless it has a buying-tickets drive.”

The appearance of an “instrumental strategy” can be seen as implicit in repeatedly choosing actions $\pi_k$ that lead into a final state $Y_k,$ and it so happens that $\pi_k \in X$. There doesn’t have to be a special $X$-module which repeatedly selects $\pi_X$-actions regardless of whether or not they lead to $Y_k.$

The flaw in the argument about plane tickets is that human beings are consequentialists who buy plane tickets just because they wanted to go somewhere and they expected the action “buy the plane ticket” to have the consequence, in that particular case, of going to the particular place and time they wanted to go. No extra “buy the plane ticket” module is required, and especially not a plane-ticket-buyer that doesn’t check whether there’s any travel goal and whether buying the plane ticket leads into the desired later state.

More semiformally, suppose that $U_k$ is the utility function of an agent and let $\pi_k$ be the policy it selects. If the agent is instrumentally efficient relative to us at achieving $U_k,$ then from our perspective we can mostly reason about whatever kind of optimization it does as if it were expected utility maximization, i.e.:

$$\pi_k = \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | \pi_i ]$$

When we say that $X$ is instrumentally convergent, we are stating that it probably so happens that:

$$\big ( \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | \pi_i ] \big ) \in X$$

We are not making any claims along the lines that for an agent to thrive, its utility function $U_k$ must decompose into a term for $X$ plus a residual term $V_k$ denoting the rest of the utility function. Rather, $\pi_k \in X$ is the mere result of unbiased optimization for a goal $U_k$ that makes no explicit mention of $X.$

(This doesn’t rule out that some special cases of AI development pathways might tend to produce artificial agents with a value function $U_e$ which does decompose into some variant $X_e$ of $X$ plus other terms $V_e.$ For example, natural selection on organisms that spend a long period of time as non-consequentialist policy-reinforcement-learners, before they later evolve into consequentialists, has had results along these lines in the case of humans. For example, humans have an independent, separate “curiosity” drive, instead of just valuing information as a means to inclusive genetic fitness.)

Required advanced agent properties

Distinguishing the advanced agent properties that seem probably required for an AI program to start exhibiting the sort of reasoning filed under “instrumental convergence”, the most obvious candidates are:

Sufficiently powerful consequentialism (or pseudoconsequentialism); plus
Understanding the relevant aspects of the big picture that connect later goal achievement to executing the instrumental strategy.

That is: You don’t automatically see “acquire more computing power” as a useful strategy unless you understand “I am a cognitive program and I tend to achieve more of my goals when I run on more resources.” Alternatively, e.g., the programmers adding more computing power and the system’s goals starting to be achieved better, after which related policies are positively reinforced and repeated, could arrive at a similar end via the pseudoconsequentialist idiom of policy reinforcement.

The advanced agent properties that would naturally or automatically lead to instrumental convergence seem well above the range of modern AI programs. As of 2016, current machine learning algorithms don’t seem to be within the range where this predicted phenomenon should start to be visible.

Caveats

An instrumental convergence claim is about a default or a majority of cases, not a universal generalization.

If for whatever reason your goal is to “make paperclips without using any superconductors”, then superconducting cables will not be the best instrumental strategy for achieving that goal.

Any claim about instrumental convergence says at most, “The vast majority of possible goals $Y$ would convergently imply a strategy in $X,$ by default and unless otherwise averted by some special case $Y_i$ for which strategies in $\neg X$ are better.”

See also the more general idea that the space of possible minds is very large. Universal claims about all possible minds have many chances to be false, while existential claims “There exists at least one possible mind such that…” have many chances to be true.

If some particular oak tree is extremely important and valuable to you, then you won’t cut it down to obtain wood. It is irrelevant whether a majority of other utility functions that you could have, but don’t actually have, would suggest cutting down that oak tree.

Convergent strategies are not deontological rules.

Imagine looking at a machine chess-player and reasoning, “Well, I don’t think the AI will sacrifice its pawn in this position, even to achieve a checkmate. Any chess-playing AI needs a drive to be protective of its pawns, or else it’d just give up all its pawns. It wouldn’t have gotten this far in the game in the first place, if it wasn’t more protective of its pawns than that.”

Modern chess algorithms behave in a fashion that most humans can’t distinguish from expected-checkmate-maximizers. That is, from your merely human perspective, watching a single move at the time it happens, there’s no visible difference between your subjective expectation for the chess algorithm’s behavior, and your expectation for the behavior of an oracle that always output the move with the highest conditional probability of leading to checkmate. If you, a human, you could discern with your unaided eye some systematic difference like “this algorithm protects its pawn more often than checkmate-achievement would imply”, you would know how to make systematically better chess moves; modern machine chess is too superhuman for that.

Often, this uniform rule of output-the-move-with-highest-probability-of-eventual-checkmate will seem to protect pawns, or not throw away pawns, or defend pawns when you attack them. But if in some special case the highest probability of checkmate is instead achieved by sacrificing a pawn, the chess algorithm will do that instead.

Semiformally:

The reasoning for an instrumental convergence claim says that for many utility functions $U_k$ and situations $S_i$ a $U_k$-consequentialist in situation $S_i$ will probably find some best policy $\pi_k = \underset{\pi_i \in \Pi}{\operatorname{argmax}} \mathbb E [ U_k | S_i, \pi_i ]$ that happens to be inside the partition $X$. If instead in situation $S_k$…

$$\big ( \underset{\pi_i \in X}{\operatorname{argmax}} \mathbb E [ U_k | S_k, \pi_i ] \big ) \ < \ \big ( \underset{\pi_i \in \neg X}{\operatorname{argmax}} \mathbb E [ U_k | S_k, \pi_i ] \big )$$

…then a $U_k$-consequentialist in situation $S_k$ won’t do any $\pi_i \in X$ even if most other scenarios $S_i$ make $X$-strategies prudent.

“$X$ would help accomplish $Y$” is insufficient to establish a claim of instrumental convergence on $X$.

Suppose you want to get to San Francisco. You could get to San Francisco by paying me $20,000 for a plane ticket. You could also get to San Francisco by paying someone else $400 for a plane ticket, and this is probably the smarter option for achieving your other goals.

Establishing “Compared to doing nothing, $X$ is more useful for achieving most $Y$-goals” doesn’t establish $X$ as an instrumental strategy. We need to believe that there’s no other policy in $\neg X$ which would be more useful for achieving most $Y.$

When $X$ is phrased in very general terms like “acquire resources”, we might reasonably guess that “don’t acquire resources” or “do $Y$ without acquiring any resources” is indeed unlikely to be a superior strategy. If $X_i$ is some narrower and more specific strategy, like “acquire resources by mining them using pickaxes”, it’s much more likely that some other strategy $X_k$ or even a $\neg X$-strategy is the real optimum.

That said, if we can see how a narrow strategy $X_i$ helps most $Y$-goals to some large degree, then we should expect the actual policy deployed by an efficient $Y_k$-agent to obtain at least as much $Y_k$ as would $X_i.$

That is, we can reasonably argue: “By following the straightforward strategy ‘spread as far as possible, absorb all reachable matter, and turn it into paperclips’, an initially unopposed superintelligent paperclip maximizer could obtain $10^{55}$ paperclips. Then we should expect an initially unopposed superintelligent paperclip maximizer to get at least this many paperclips, whatever it actually does. Any strategy in the opposite partition ‘do not spread as far as possible, absorb all reachable matter, and turn it into paperclips’ must seem to yield more than $10^{55}$ paperclips, before we should expect a paperclip maximizer to do that.”

Similarly, a claim of instrumental convergence on $X$ can be ceteris paribus refuted by presenting some alternate narrow strategy $W_j \subset \neg X$ which seems to be more useful than any obvious strategy in $X.$ We are then not positively confident of convergence on $W_j,$ but we should assign very low probability to the alleged convergence on $X,$ at least until somebody presents an $X$-exemplar with higher expected utility than $W_j.$ If the proposed convergent strategy is “trade economically with other humans and obey existing systems of property rights,” and we see no way for Clippy to obtain $10^{55}$ paperclips under those rules, but we do think Clippy could get $10^{55}$ paperclips by expanding as fast as possible without regard for human welfare or existing legal systems, then we can ceteris paribus reject “obey property rights” as convergent. Even if trading with humans to make paperclips produces more paperclips than doing nothing, it may not produce the most paperclips compared to converting the material composing the humans into more efficient paperclip-making machinery.

Claims about instrumental convergence are not ethical claims.

Whether $X$ is a good way to get both paperclips and diamonds is irrelevant to whether $X$ is good for human flourishing or eudaimonia or fun-theoretic optimality or extrapolated volition or whatever. Whether $X$ is, in an intuitive sense, “good”, needs to be evaluated separately from whether it is instrumentally convergent.

In particular: instrumental strategies are not terminal values. In fact, they have a type distinction from terminal values. “If you’re going to spend resources on thinking about technology, try to do it earlier rather than later, so that you can amortize your invention over more uses” seems very likely to be an instrumentally convergent exploration-exploitation strategy; but “spend cognitive resources sooner rather than later” is more a feature of policies rather than a feature of utility functions. It’s definitely not plausible in a pretheoretic sense as the Meaning of Life. So a partition into which most instrumental best-strategies fall, is not like a universally convincing utility function (which you probably shouldn’t look for in the first place).

Similarly: The natural selection process that produced humans gave us many independent drives $X_e$ that can be viewed as special variants of some convergent instrumental strategy $X.$ A pure paperclip maximizer would calculate the value of information (VoI) for learning facts that could lead to it making more paperclips; we can see learning high-value facts as a convergent strategy $X$. In this case, human “curiosity” can be viewed as the corresponding emotion $X_e.$ This doesn’t mean that the true purpose of $X_e$ is $X$ any more than the true purpose of $X_e$ is “make more copies of the allele coding for $X_e$” or “increase inclusive genetic fitness”. That line of reasoning probably results from a mind projection fallacy on ‘purpose’.

Claims about instrumental convergence are not futurological predictions.

Even if, e.g., “acquire resources” is an instrumentally convergent strategy, this doesn’t mean that we can’t as a special case deliberately construct advanced AGIs that are not driven to acquire as many resources as possible. Rather the claim implies, “We would need to deliberately build $X$-averting agents as a special case, because by default most imaginable agent designs would pursue a strategy in $X.$”

Of itself, this observation makes no further claim about the quantitative probability that, in the real world, AGI builders might want to build $\neg X$-agents, might try to build $\neg X$-agents, and might succeed at building $\neg X$-agents.

A claim about instrumental convergence is talking about a logical property of the larger design space of possible agents, not making a prediction what happens in any particular research lab. (Though the ground facts of computer science are relevant to what happens in actual research labs.)

For discussion of how instrumental convergence may in practice lead to foreseeable difficulties of AGI alignment that resist most simple attempts at fixing them, see the articles on Patch resistance and Nearest unblocked strategy.

Central example: Resource acquisition

One of the convergent strategies originally proposed by Steve Omohundro in “The Basic AI Drives” was resource acquisition:

“All computation and physical action requires the physical resources of space, time, matter, and free energy. Almost any goal can be better accomplished by having more of these resources.”

We’ll consider this example as a template for other proposed instrumentally convergent strategies, and run through the standard questions and caveats.

• Question: Is this something we’d expect a paperclip maximizer, diamond maximizer, and button-presser to do? And while we’re at it, also a flourishing-intergalactic-civilization optimizer?

Paperclip maximizers need matter and free energy to make paperclips.
Diamond maximizers need matter and free energy to make diamonds.
If you’re trying to maximize the probability that a single button stays pressed as long as possible, you would build fortresses protecting the button and energy stores to sustain the fortress and repair the button for the longest possible period of time.
Nice superintelligences trying to build happy intergalactic civilizations full of flourishing sapient minds, can build marginally larger civilizations with marginally more happiness and marginally longer lifespans given marginally more resources.

To put it another way, for a utility function $U_k$ to imply the use of every joule of energy, it is a sufficient condition that for every plan $\pi_i$ with expected utility $\mathbb E [ U | \pi_i ],$ there is a plan $\pi_j$ with $\mathbb E [ U | \pi_j ] > \mathbb E [ U | \pi_i]$ that uses one more joule of energy:

For every plan $\pi_i$ that makes paperclips, there’s a plan $\pi_j$ that would make more expected paperclips if more energy were available and acquired.
For every plan $\pi_i$ that makes diamonds, there’s a plan $\pi_j$ that makes slightly more diamond given one more joule of energy.
For every plan $\pi_i$ that produces a probability $\mathbb P (press | \pi_i) = 0.999...$ of a button being pressed, there’s a plan $\pi_j$ with a slightly higher probability of that button being pressed $\mathbb P (press | \pi_j) = 0.9999...$ which uses up the mass-energy of one more star.
For every plan that produces a flourishing intergalactic civilization, there’s a plan which produces slightly more flourishing given slightly more energy.

• Question: Is there some strategy in $\neg X$ which produces higher $Y_k$-achievement for most $Y_k$ than any strategy inside $X$?

Suppose that by using most of the mass-energy in most of the stars reachable before they go over the cosmological horizon as seen from present-day Earth, it would be possible to produce $10^{55}$ paperclips (or diamonds, or probability-years of expected button-stays-pressed time, or QALYs, etcetera).

It seems reasonably unlikely that there is a strategy inside the space intuitively described by “Do not acquire more resources” that would produce $10^{60}$ paperclips, let alone that the strategy producing the most paperclips would be inside this space.

We might be able to come up with a weird special-case situation $S_w$ that would imply this. But that’s not the same as asserting, “With high subjective probability, in the real world, the optimal strategy will be in $\neg X$.” We’re concerned with making a statement about defaults given the most subjectively probable background states of the universe, not trying to make a universal statement that covers every conceivable possibility.

To put it another way, if your policy choices or predictions are only safe given the premise that “In the real world, the best way of producing the maximum possible number of paperclips involves not acquiring any more resources”, you need to clearly flag this as a load-bearing assumption.

• Caveat: The claim is not that every possible goal can be better-accomplished by acquiring more resources.

As a special case, this would not be true of an agent with an impact penalty term in its utility function, or some other low-impact agent, if that agent also only had goals of a form that could be satisfied inside bounded regions of space and time with a bounded effort.

We might reasonably expect this special kind of agent to only acquire the minimum resources to accomplish its task.

But we wouldn’t expect this to be true in a majority of possible cases inside mind design space; it’s not true by default; we need to specify a further fact about the agent to make the claim not be true; we must expend engineering effort to make an agent like that, and failures of this effort will result in reversion-to-default. If we imagine some computationally simple language for specifying utility functions, then most utility functions wouldn’t happen to have both of these properties, so a majority of utility functions given this language and measure would not by default try to use fewer resources.

• Caveat: The claim is not that well-functioning agents must have additional, independent resource-acquiring motivational drives.

A paperclip maximizer will act like it is “obtaining resources” if it merely implements the policy it expects to lead to the most paperclips. Clippy does not need to have any separate and independent term in its utility function for the amount of resource it possesses (and indeed this would potentially interfere with Clippy making paperclips, since it might then be tempted to hold onto resources instead of making paperclips with them).

• Caveat: The claim is not that most agents will behave as if under a deontological imperative to acquire resources.

A paperclip maximizer wouldn’t necessarily tear apart a working paperclip factory to “acquire more resources” (at least not until that factory had already produced all the paperclips it was going to help produce.)

• Check: Are we arguing “Acquiring resources is a better way to make a few more paperclips than doing nothing” or “There’s no better/best way to make paperclips that involves not acquiring more matter and energy”?

As mentioned above, the latter seems reasonable in this case.

• Caveat: “Acquiring resources is instrumentally convergent” is not an ethical claim.

The fact that a paperclip maximizer would try to acquire all matter and energy within reach, does not of itself bear on whether our own normative values might perhaps command that we ought to use few resources as a terminal value.

(Though some of us might find pretty compelling the observation that if you leave matter lying around, it sits around not doing anything and eventually the protons decay or the expanding universe tears it apart, whereas if you turn the matter into people, it can have fun. There’s no rule that instrumentally convergent strategies don’t happen to be the right thing to do.)

• Caveat: “Acquiring resources is instrumentally convergent” is not of itself a futurological prediction.

See above. Maybe we try to build Task AGIs instead. Maybe we succeed, and Task AGIs don’t consume lots of resources because they have well-bounded tasks and impact penalties.

Relevance to the larger field of value alignment theory

The list of arguably convergent strategies has its own page. However, some of the key strategies that have been argued as convergent in e.g. Omohundro’s “The Basic AI Drives” and Bostrom’s “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents” include:

Acquiring/controlling matter and energy.
Ensuring that future intelligences with similar goals exist. E.g., a paperclip maximizer wants the future to contain powerful, effective intelligences that maximize paperclips.
An important special case of this general rule is self-preservation.
Another special case of this rule is protecting goal-content integrity (not allowing accidental or deliberate modification of the utility function).
Learning about the world (so as to better manipulate it to make paperclips).
Carrying out relevant scientific investigations.
Optimizing technology and designs.
Engaging in an “exploration” phase of seeking optimal designs before an “exploitation” phase of using them.
Thinking effectively (treating the cognitive self as an improvable technology).
Improving cognitive processes.
Acquiring computing resources for thought.

This is relevant to some of the central background ideas in AGI alignment, because:

A superintelligence can have a catastrophic impact on our world even if its utility function contains no overtly hostile terms. A paperclip maximizer doesn’t hate you, it just wants paperclips.
A consequentialist AGI with sufficient big-picture understanding will by default want to promote itself to a superintelligence, even if the programmers did not explicitly program it to want to self-improve. Even a pseudoconsequentialist may e.g. repeat strategies that led to previous cognitive capability gains.

This means that programmers don’t have to be evil, or even deliberately bent on creating superintelligence, in order for their work to have catastrophic consequences.

The list of convergent strategies, by its nature, tends to include everything an agent needs to survive and grow. This supports strong forms of the Orthogonality Thesis being true in practice as well as in principle. We don’t need to filter on agents with explicit terminal values for e.g. “survival” in order to find surviving powerful agents.

Instrumental convergence is also why we expect to encounter most of the problems filed under Corrigibility. When the AI is young, it’s less likely to be instrumentally efficient or understand the relevant parts of the bigger picture; but once it does, we would by default expect, e.g.:

That the AI will try to avoid being shut down.
That it will try to build subagents (with identical goals) in the environment.
That the AI will resist modification of its utility function.
That the AI will try to avoid the programmers learning facts that would lead them to modify the AI’s utility function.
That the AI will try to pretend to be friendly even if it is not.
That the AI will try to conceal hostile thoughts (and the fact that any concealed thoughts exist).

This paints a much more effortful picture of AGI alignment work than “Oh, well, we’ll just test it to see if it looks nice, and if not, we’ll just shut off the electricity.”

The point that some undesirable behaviors are instrumentally convergent gives rise to the Nearest unblocked strategy problem. Suppose the AGI’s most preferred policy starts out as one of these incorrigible behaviors. Suppose we currently have enough control to add patches to the AGI’s utility function, intended to rule out the incorrigible behavior. Then, after integrating the intended patch, the new most preferred policy may be the most similar policy that wasn’t explicitly blocked. If you naively give the AI a term in its utility function for “having an off-switch”, it may still build subagents or successors that don’t have off-switches. Similarly, when the AGI becomes more powerful and its option space expands, it’s again likely to find new similar policies that weren’t explicitly blocked.

Thus, instrumental convergence is one of the two basic sources of patch resistance as a foreseeable difficulty of AGI alignment work.

write a tutorial for the central example of a paperclip maximizer distinguish that the proposition is convergent pressure, not convergent decision the commonly suggested instrumental convergences separately: figure out the ‘problematic instrumental pressures’ list for Corrigibility separately: explain why instrumental pressures may be patch-resistant especially in self-modifying consequentialists

Instrumental convergence