Instrumental goals are almost-equally as tractable as terminal goals

One counterargument to the Orthogonality Thesis asserts that agents with terminal preferences for goals like e.g. resource acquisition will always be much better at those goals than agents which merely try to acquire resources on the way to doing something else, like making paperclips. Therefore, by filtering on real-world competent agents, we filter out all agents which do not have terminal preferences for acquiring resources.

A reply is that “figuring out how to do \(W_4\) on the way to \(W_3\), on the way to \(W_2\), on the way to \(W_1\), without that particular way of doing \(W_4\) stomping on your ability to later achieve \(W_2\)” is such a ubiquitous idiom of cognition or supercognition that (a) any competent agent must already do that all the time, and (b) it doesn’t seem like adding one more straightforward target \(W_0\) to the end of the chain should usually result in greatly increased computational costs or greatly diminished ability to optimize \(W_4\).

E.g. contrast the necessary thoughts of a paperclip maximizer acquiring resources in order to turn them into paperclips, and an agent with a terminal goal of acquiring and hoarding resources.

The paperclip maximizer has a terminal utility function \(U_0\) which counts the number of paperclips in the universe (or rather, paperclip-seconds in the universe’s history). The paperclip maximizer then identifies a sequence of subgoals and sub-subgoals \(W_1, W_2, W_3...W_N\) corresponding to increasingly fine-grained strategies for making paperclips, each of which is subject to the constraint that it doesn’t stomp on the previous elements of the goal hierarchy. (For simplicity of exposition we temporarily pretend that each goal has only one subgoal rather than a family of conjunctive and disjunctive subgoals.)

More concretely, we can imagine that \(W_1\) is “get matter under my control (in a way that doesn’t stop me from making paperclips with it)”, that is, if we were to consider the naive or unconditional description \(W_1'\) “get matter under my ‘control’ (whether or not I can make paperclips with it)”, we are here interested in a subset of states \(W_1 \subset W_1'\) such that \(\mathbb E[U_0|W_1]\) is high. Then \(W_2\) might be “explore the universe to find matter (in such a way that it doesn’t interfere with bringing that matter under control or turning it into paperclips)”, \(W_3\) might be “build interstellar probes (in such a way that …)”, and as we go further into the hierarchy we will find \(W_{10}\) “gather all the materials for an interstellar probe in one place (in such a way that …)”, \(W_{20}\) “lay the next 20 sets of rails for transporting the titanium cart”, and \(W_{25}\) “move the left controller upward”.

Of course by the time we’re that deep in the hierarchy, any efficient planning algorithm is making some use of real independences where we can reason relatively myopically about how to lay train tracks without worrying very much about what the cart of titanium is being used for. (Provided that the strategies are constrained enough in domain to not include any strategies that stomp distant higher goals, e.g. the strategy “build an independent superintelligence that just wants to lay train tracks”; if the system were optimizing that broadly it would need to check distant consequences and condition on them.)

The reply would then be that, in general, any feat of superintelligence requires making a ton of big, medium-sized, and little strategies all converge on a single future state in virtue of all of those strategies having been selected sufficiently well to optimize the expectation \(\mathbb E[U|W_1,W_2,...].\) for some \(U.\) A ton of little and medium-sized strategies must have all managed not to collide with each other or with larger big-picture considerations. If you can’t do this much then you can’t win a game of Go or build a factory or even walk across the room without your limbs tangling up.

Then there doesn’t seem to be any good reason to expect an agent which is instead optimizing directly the utility function \(U_1\) which is “acquire and hoard resources” to do a very much better job of optimizing \(W_{10}\) or \(W_{25}.\) When \(W_{25}\) already needs to be conditioned in such a way as to not stomp on all the higher goals \(W_2, W_3, ...\) it just doesn’t seem that much less constraining to target \(U_1\) versus \(U_0, W_1.\) Most of the cognitive labor in the sequence does not seem like it should be going into checking for \(U_0\) at the end instead of checking for \(U_1\) at the end. It should be going into, e.g., figuring out how to make any kind of interstellar probe and figuring out how to build factories.

It has not historically been the case that the most computationally efficient way to play chess is to have competing agents inside the chess algorithm trying to optimize different unconditional utility functions and bidding on the right to make moves in order to pursue their own local goal of “protect the queen, regardless of other long-term consequences” or “control the center, regardless of other long-term consequences”. What we are actually trying to get is the chess move such that, conditioning on that chess move and the sort of future chess moves we are likely to make, our chance of winning is the highest. The best modern chess algorithms do their best to factor in anything that affects long-range consequences whenever they know about those consequences. The best chess algorithms don’t try to factor things into lots of colliding unconditional urges, because sometimes that’s not how “the winning move” factors. You can extremely often do better by doing a deeper consequentialist search that conditions multiple elements of your strategy on longer-term consequences in a way that prevents your moves from stepping on each other. It’s not very much of an exaggeration to say that this is why humans with brains that can imagine long-term consequences are smarter than, say, armadillos.

Sometimes there are subtleties we don’t have the computing power to notice, we can’t literally actually condition on the future. But “to make paperclips, acquire resources and use them to make paperclips” versus “to make paperclips, acquire resources regardless of whether they can be used to make paperclips” is not subtle. We’d expect a superintelligence that was efficient relative to humans to understand and correct at least those divergences between \(W_1\) and \(W_1'\) that a human could see, using at most the trivial amount of computing power represented by a human brain. To the extent that particular choices are being selected-on over a domain that is likely to include choices with huge long-range consequences, one expends the computing power to check and condition on the long-range consequences; but a supermajority of choices shouldn’t require checks of this sort; and even choices about how to design train tracks that do require longer-range checks are not going to be very much more tractable depending on whether the distant top of the goal hierarchy is something like “make paperclips” or “hoard resources”.

Even supposing that there could be 5% more computational cost associated with checking instrumental strategies for stepping on “promote fun-theoretic eudaimonia”, which might ubiquitously involve considerations like “make sure none of the computational processes you use to do this are themselves sentient”, this doesn’t mean you can’t have competent agents that go ahead and spend 5% more computation. Iit’s simply the correct choice to build subagents that expend 5% more computation to maintain coordination on achieving eudaimonia, rather than building subagents that expend 5% less computation to hoard resources and never give them back. It doesn’t matter if the second kind of agents are less “costly” in some myopic sense, they are vastly less useful and indeed actively destructive. So nothing that is choosing so as to optimize its expectation of \(U_0\) will build a subagent that generally optimizes its own expectation of \(U_1.\)


  • Orthogonality Thesis

    Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?