# Introduction

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.

The strong form of the Orthogonality Thesis says that there’s no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.

Suppose some strange alien came to Earth and credibly offered to pay us one million dollars’ worth of new wealth every time we created a paperclip. We’d encounter no special intellectual difficulty in figuring out how to make lots of paperclips.

That is, minds would readily be able to reason about:

• How many paperclips would result, if I pursued a policy $$\pi_0$$?

• How can I search out a policy $$\pi$$ that happens to have a high answer to the above question?

The Orthogonality Thesis asserts that since these questions are not computationally intractable, it’s possible to have an agent that tries to make paperclips without being paid, because paperclips are what it wants. The strong form of the Orthogonality Thesis says that there need be nothing especially complicated or twisted about such an agent.

The Orthogonality Thesis is a statement about computer science, an assertion about the logical design space of possible cognitive agents. Orthogonality says nothing about whether a human AI researcher on Earth would want to build an AI that made paperclips, or conversely, want to make a nice AI. The Orthogonality Thesis just asserts that the space of possible designs contains AIs that make paperclips. And also AIs that are nice, to the extent there’s a sense of “nice” where you could say how to be nice to someone if you were paid a billion dollars to do that, and to the extent you could name something physically achievable to do.

This contrasts to inevitablist theses which might assert, for example:

• “It doesn’t matter what kind of AI you build, it will turn out to only pursue its own survival as a final end.”

• “Even if you tried to make an AI optimize for paperclips, it would reflect on those goals, reject them as being stupid, and embrace a goal of valuing all sapient life.”

The reason to talk about Orthogonality is that it’s a key premise in two highly important policy-relevant propositions:

• It is possible to build a nice AI.

• It is possible to screw up when trying to build a nice AI, and if you do, the AI will not automatically decide to be nice instead.

Orthogonality does not require that all agent designs be equally compatible with all goals. E.g., the agent architecture AIXI-tl can only be formulated to care about direct functions of its sensory data, like a reward signal; it would not be easy to rejigger the AIXI architecture to care about creating massive diamonds in the environment (let alone any more complicated environmental goals). The Orthogonality Thesis states “there exists at least one possible agent such that…” over the whole design space; it’s not meant to be true of every particular agent architecture and every way of constructing agents.

Orthogonality is meant as a descriptive statement about reality, not a normative assertion. Orthogonality is not a claim about the way things ought to be; nor a claim that moral relativism is true (e.g. that all moralities are on equally uncertain footing according to some higher metamorality that judges all moralities as equally devoid of what would objectively constitute a justification). Claiming that paperclip maximizers can be constructed as cognitive agents is not meant to say anything favorable about paperclips, nor anything derogatory about sapient life.

# Thesis statement: Goal-directed agents are as tractable as their goals.

Suppose an agent’s utility function said, “Make the SHA512 hash of a digitized representation of the quantum state of the universe be 0 as often as possible.” This would be an exceptionally intractable kind of goal. Even if aliens offered to pay us to do that, we still couldn’t figure out how.

Similarly, even if aliens offered to pay us, we wouldn’t be able to optimize the goal “Make the total number of apples on this table be simultaneously even and odd” because the goal is self-contradictory.

But suppose instead that some strange and extremely powerful aliens offer to pay us the equivalent of a million dollars in wealth for every paperclip that we make, or even a galaxy’s worth of new resources for every new paperclip we make. If we imagine ourselves having a human reason to make lots of paperclips, the optimization problem “How can I make lots of paperclips?” would pose us no special difficulty. The factual questions:

• How many paperclips would result, if I pursued a policy $$\pi_0$$?

• How can I search out a policy $$\pi$$ that happens to have a high answer to the above question?

…would not be especially computationally burdensome or intractable.

We also wouldn’t forget to harvest and eat food while making paperclips. Even if offered goods of such overwhelming importance that making paperclips was at the top of everyone’s priority list, we could go on being strategic about which other actions were useful in order to make even more paperclips; this also wouldn’t be an intractably hard cognitive problem for us.

The weak form of the Orthogonality Thesis says, “Since the goal of making paperclips is tractable, somewhere in the design space is an agent that optimizes that goal.”

The strong form of Orthogonality says, “And this agent doesn’t need to be twisted or complicated or inefficient or have any weird defects of reflectivity; the agent is as tractable as the goal.” That is: When considering the necessary internal cognition of an agent that steers outcomes to achieve high scores in some outcome-scoring function $$U,$$ there’s no added difficulty in that cognition except whatever difficulty is inherent in the question “What policies would result in consequences with high $$U$$-scores?”

This could be restated as, “To whatever extent you (or a superintelligent version of you) could figure out how to get a high-$U$ outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that terminally prefers high-$U$ outcomes can be at least that good at achieving $$U$$.” This assertion would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make the agent not realize how pointless it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic because it didn’t have any such defects.

For purposes of stating Orthogonality’s precondition, the “tractability” of the computational problem of $$U$$-search should be taken as including only the object-level search problem of computing external actions to achieve external goals. If there turn out to be special difficulties associated with computing “How can I make sure that I go on pursuing $$U$$?” or “What kind of successor agent would want to pursue $$U$$?” whenever $$U$$ is something other than “be nice to all sapient life”, then these new difficulties contradict the intuitive claim of Orthogonality. Orthogonality is meant to be empirically-true-in-practice, not true-by-definition because of how we sneakily defined “optimization problem” in the setup.

Orthogonality is not literally, absolutely universal because theoretically ‘goals’ can include such weird constructions as “Make paperclips for some terminal reason other than valuing paperclips” and similar such statements that require cognitive algorithms and not just results. To the extent that goals don’t single out particular optimization methods, and just talk about paperclips, the Orthogonality claim should cover them.

# Summary of arguments

Some arguments for Orthogonality, in rough order of when they were first historically proposed and the strength of Orthogonality they argue for:

## Size of mind design space

The space of possible minds is enormous, and all human beings occupy a relatively tiny volume of it—we all have a cerebral cortex, cerebellum, thalamus, and so on. The sense that AIs are a particular kind of alien mind that ‘will’ want some particular things is an undermined intuition. “AI” really refers to the entire design space of possibilities outside the human. Somewhere in that vast space are possible minds with almost any kind of goal. For any thought you have about why a mind in that space ought to work one way, there’s a different possible mind that works differently.

This is an exceptionally generic sort of argument that could apply equally well to any property $$P$$ of a mind, but is still weighty even so: If we consider a space of minds a million bits wide, then any argument of the form “Some mind has property $$P$$” has $$2^{1,000,000}$$ chances to be true and any argument of the form “No mind has property $$P$$” has $$2^{1,000,000}$$ chances to be false.

This form of argument isn’t very specific to the nature of goals as opposed to any other kind of mental property. But it’s still useful for snapping out of the frame of mind of “An AI is a weird new kind of person, like the strange people of the Tribe Who Live Across The Water” and into the frame of mind of “The space of possible things we could call ‘AI’ is enormously wider than the space of possible humans.” Similarly, snapping out of the frame of mind of “But why would it pursue paperclips, when it wouldn’t have any fun that way?” and into the frame of mind “Well, I like having fun, but are there some possible minds that don’t pursue fun?”

## Instrumental convergence

A sufficiently intelligent paperclip maximizer isn’t disadvantaged in day-to-day operations relative to any other goal, so long as Clippy can estimate at least as well as you can how many more paperclips could be produced by pursuing instrumental strategies like “Do science research (for now)” or “Pretend to be nice (for now)”.

Restating: for at least some agent architectures, it is not necessary for the agent to have an independent terminal value in its utility function for “do science” in order for it to do science effectively; it is only necessary for the agent to understand at least as well as we do why certain forms of investigation will produce knowledge that will be useful later (e.g. for paperclips). When you say, “Oh, well, it won’t be interested in electromagnetism since it has no pure curiosity, it will only want to peer at paperclips in particular, so it will be at a disadvantage relative to more curious agents” you are postulating that you know a better operational policy than the agent does for producing paperclips, and an instrumentally efficient agent would know this as well as you do and be at no operational disadvantage due to its simpler utility function.

## Reflective stability

Suppose that Gandhi doesn’t want people to be murdered. Imagine that you offer Gandhi a pill that will make him start wanting to kill people. If Gandhi knows that this is what the pill does, Gandhi will refuse the pill, because Gandhi expects the result of taking the pill to be that future-Gandhi wants to murder people and then murders people and then more people will be murdered and Gandhi regards this as bad. Similarly, a sufficiently intelligent paperclip maximizer will not self-modify to act according to “actions which promote the welfare of sapient life” instead of “actions which lead to the most paperclips”, because then future-Clippy will produce fewer paperclips, and then there will be fewer paperclips, so present-Clippy does not evaluate this self-modification as producing the highest number of expected future paperclips.

## Hume’s is/​ought type distinction

David Hume observed an apparent difference of type between is-statements and ought-statements:

“In every system of morality, which I have hitherto met with, I have always remarked, that the author proceeds for some time in the ordinary ways of reasoning, and establishes the being of a God, or makes observations concerning human affairs; when all of a sudden I am surprised to find, that instead of the usual copulations of propositions, is, and is not, I meet with no proposition that is not connected with an ought, or an ought not. This change is imperceptible; but is however, of the last consequence.”

Hume was originally concerned with the question of where we get our ought-propositions, since (said Hume) there didn’t seem to be any way to derive an ought-proposition except by starting from another ought-proposition. We can figure out that the Sun is shining just by looking out the window; we can deduce that the outdoors will be warmer than otherwise by knowing about how sunlight imparts thermal energy when absorbed. On the other hand, to get from there to “And therefore I ought to go outside”, some kind of new consideration must have entered, along the lines of “I should get some sunshine” or “It’s better to be warm than cold.” Even if this prior ought-proposition is of a form that to humans seems very natural, or taken-for-granted, or culturally widespread, like “It is better for people to be happy than sad”, there must have still been some prior assumption which, if we write it down in words, will contain words like ought, should, better, and good.

Again translating Hume’s idea into more modern form, we can see ought-sentences as special because they invoke some ordering that we’ll designate $$<V.$$ E.g. “It’s better to go outside than stay inside” asserts “Staying inside $$<V$$ going outside”. Whenever we make a statement about one outcome or action being “better”, “preferred”, “good”, “prudent”, etcetera, we can see this as implicitly ordering actions and outcomes under this $$<V$$ relation. Some assertions, the ought-laden assertions, mention this $$<V$$ relation; other propositions just talk about energetic photons in sunlight.

Since we’ve put on hold the question of exactly what sort of entity this $$<V$$ relation is, we don’t need to concern ourselves for now with the question of whether Hume was right that we can’t derive $$<V$$-relations just from factual assertions. For purposes of Orthogonality, we only need a much weaker version of Hume’s thesis, the observation that we can apparently separate out a set of propositions that don’t invoke $$<V,$$ what we might call ‘simple facts’ or ‘questions of simple fact’. Furthermore, we can figure out simple facts just by making observations and considering other simple facts.

We can’t necessarily get all $$<V$$-mentioning propositions without considering simple facts. The $$<V$$-mentioning proposition “It’s better to be outside than inside” may depend on the non-$<_V$-mentioning simple fact “It is sunny outside.” But we can figure out whether it’s sunny outside, without considering any ought-propositions.

There are two potential ways we can conceptualize the relation of Hume’s is-ought separation to Orthogonality.

The relatively simpler conceptualization is to treat the relation ‘makes more paperclips’ as a kind of new ordering $$>_{paperclips}$$ that can, in a very general sense, fill in the role in a paperclip maximizer’s reasoning that would in our own reasoning be taken up by $$<V.$$ Then Hume’s is-ought separation seems to suggest that this paperclip maximizer can still have excellent reasoning about empirical questions like “Which policy leads to how many paperclips?” because is-questions can be thought about separately from ought-questions. When Clippy disassembles you to turn you into paperclips, it doesn’t have a values disagreement with you—it’s not the case that Clippy is doing that action because it thinks you have low value under $$<V.$$ Clippy’s actions just reflect its computation of the entirely separate ordering $$>_{paperclips}.$$

The deeper conceptualization is to see a paperclip maximizer as being constructed entirely out of is-questions. The questions “How many paperclips will result conditional on action $$\pi_0$$ being taken?” and “What is an action $$\pi$$ that would yield a large number of expected paperclips?” are pure is-questions, and (arguendo) everything a paperclip maximizer needs to consider in order to make as many paperclips as possible can be seen as a special case of one of these questions. When Clippy disassembles you for your atoms, it’s not disagreeing with you about the value of human life, or what it ought to do, or which outcomes are better or worse. All of those are ought-propositions. Clippy’s action is only informative about the true is-proposition ‘turning this person into paperclips causes there to be more paperclips in the universe’, and tells us nothing about any content of the mysterious $$<V$$-relation because Clippy wasn’t computing anything to do with $$<V.$$

The second viewpoint may be helpful for seeing why Orthogonality doesn’t require moral relativism. If we imagine Clippy as having a different version $$>_{paperclips}$$ of something very much like the value system $$<V,$$ then we may be tempted to reprise the entire Orthogonality debate at one remove, and ask, “But doesn’t Clippy see that $$<V$$ is more justified than $$>_{paperclips}$$? And if this fact isn’t evident to Clippy who is supposed to be very intelligent and have no defects of reflectivity and so on, doesn’t that imply that $$<V$$ really isn’t any more justified than $$>_{paperclips}$$?”

We could reply to that question by carrying the shallow conceptualization of Humean Orthogonality a step further, and saying, “Ah, when you talk about justification, you are again invoking a mysterious concept that doesn’t appear just in talking about the photons in sunlight. We could see propositions like this as involving a new idea $$\ll_W$$ that deals with which $$<)-systems are less or more justified, so that ‘<V is more justified than \(>_{paperclips}$$’ states ‘$>{paperclips} \llW <V$’. But Clippy doesn’t compute $$\ll_W,$$ it computes $$\gg_{paperclips},$$ so Clippy’s behavior doesn’t tell us anything about what is justified.”

But this is again tempting us to imagine Clippy as having its own version of the mysterious $$\ll_W$$ to which Clippy is equally attached, and tempts us to imagine Clippy as arguing with us or disagreeing with us within some higher metasystem.

So—putting on hold the true nature of our mysterious $$<V$$-mentioning concepts like ‘goodness’ or ‘better’ and the true nature of our $$\ll_W$$-mentioning concepts like ‘justified’ or ‘valid moral argument’—the deeper idea would be that Clippy is just not computing anything to do with $$<V$$ or $$\ll_W$$ at all. If Clippy self-modifies and writes new decision algorithms into place, these new algorithms will be selected according to the is-criterion “How many future paperclips will result if I write this piece of code?” and not anything resembling any arguments that humans have ever had over which ought-systems are justified. Clippy doesn’t ask whether its new decision algorithm is justified; it asks how many expected paperclips will result from executing the algorithm (and this is a pure is-question whose answers are either true or false as a matter of simple fact).

If we think Clippy is very intelligent, and we watch Clippy self-modify into a new paperclip maximizer, we are only learning is-facts about which executing algorithms lead to more paperclips existing. We are not learning anything about what is right, or what is justified, and in particular we’re not learning that ‘do good things’ is objectively no better justified than ‘make paperclips’. Even if that assertion were true under the mysterious $$\ll_W$$-relation on moral systems, you wouldn’t be able to learn that truth by watching Clippy, because Clippy never bothers to evaluate $$\ll_W$$ or any other analogous justification-system $$\gg_{something}$$.

(This is about as far as one can go in disentangling Orthogonality in computer science from normative metaethics without starting to pierce the mysterious opacity of $$<V.$$)

### Thick definitions of rationality or intelligence

Some philosophers responded to Hume’s distinction of empirical rationality from normative reasoning, by advocating ‘thick’ definitions of intelligence that included some statement about the ‘reasonableness’ of the agent’s ends.

For pragmatic purposes of AI alignment theory, if an agent is cognitively powerful enough to build Dyson Spheres, it doesn’t matter whether that agent is defined as ‘intelligent’ or its ends are defined as ‘reasonable’. A definition of the word ‘intelligence’ contrived to exclude paperclip maximization doesn’t change the empirical behavior or empirical power of a paperclip maximizer.

### Relation to moral internalism

While Orthogonality seems orthogonal to most traditional philosophical questions about metaethics, it does outright contradict some possible forms of moral internalism. For example, one could hold that by the very definition of rightness, knowledge of what is right must be inherently motivating to any entity that understands that knowledge. This is not the most common meaning of “moral internalism” held by modern philosophers, who instead seem to hold something like, “By definition, if I say that something is morally right, among my claims is that the thing is motivating to me.” We haven’t heard of a standard term for the position that, by definition, what is right must be universally motivating; we’ll designate that here as “universalist moral internalism”.

We can potentially resolve this tension between Orthogonality and this assertion about the nature of rightness by:

• Believing there must be some hidden flaw in the reasoning about a paperclip maximizer.

• Saying “No True Scotsman” to the paperclip maximizer being intelligent, even if it’s building Dyson Spheres.

• Saying “No True Scotsman” to the paperclip maximizer “truly understanding” $$<V,$$ even if Clippy is capable of predicting with extreme accuracy what humans will say and think about $$<V$$, and Clippy does not suffer any other deficit of empirical prediction because of this lack of ‘understanding’, and Clippy does not require any special twist of its mind to avoid being compelled by its understanding of $$<V.$$

• Rejecting Orthogonality, and asserting that a paperclip maximizer must fall short of being an intact mind in some way that implies an empirical capabilities disadvantage.

• Accepting nihilism, since a true moral argument must be compelling to everyone, and no moral argument is compelling to a paperclip maximizer. (Note: A paperclip maximizer doesn’t care about whether clippiness must be compelling to everyone, which makes this argument self-undermining. See also Rescuing the utility function for general arguments against adopting nihilism when you discover that your mind’s representation of something was running skew to reality.)

• Giving up on universalist moral internalism as an empirical proposition; AIXI-tl and Clippy empirically do different things, and will not be compelled to optimize the same goal no matter what they learn or know.

## Constructive specifications of orthogonal agents

We can exhibit unbounded formulas for agents larger than their environments that optimize any given goal, such that Orthogonality is visibly true about agents within that class. Arguments about what all possible minds must do are clearly false for these particular agents, contradicting all strong forms of inevitabilism. These minds use huge amounts of computing power, but there is no known reason to expect that, e.g. worthwhile-happiness-maximizers have bounded analogues while paperclip-maximizers do not.

The simplest unbounded formulas for orthogonal agents don’t involve reflectivity (the corresponding agents have no self-modification options, though they may create subagents). If we only had those simple formulas, it would theoretically leave open the possibility that self-reflection could somehow negate Orthogonality (reflective agents must inevitably have a particular utility function, and reflective agents being at a strong advantage relative to nonreflective agents). But there is already ongoing work on describing reflective agents that have the preference-stability property, and work toward increasingly bounded and approximable formulations of those. There is no hint from this work that Orthogonality is false; all the specifications have a free choice of utility function.

As of early 2017, the most recent work on tiling agents involves fully reflective, reflectively stable, logically uncertain agents whose computing time is roughly doubly-exponential in the size of the propositions considered.

So if you want to claim Orthogonality is false because e.g. all AIs will inevitably end up valuing all sapient life, you need to claim that the process of reducing the already-specified doubly-exponential computing-time decision algorithm to a more tractable decision algorithm can only be made realistically efficient for decision algorithms computing “Which policies protect all sapient life?” and are impossible to make efficient for decision algorithms computing “Which policies lead to the most paperclips?”

Since work on tiling agent designs hasn’t halted, one may need to backpedal and modify this impossibility claim further as more efficient decision algorithms are invented.

# Epistemic status

Among people who’ve seriously delved into these issues and are aware of the more advanced arguments for Orthogonality, we’re not aware of anyone who still defends “universalist moral internalism” as described above, and we’re not aware of anyone who thinks that arbitrary sufficiently-real-world-capable AI systems automatically adopt human-friendly terminal values.

Paul Christiano has said (if we’re quoting him correctly) that although it’s not his dominant hypothesis, he thinks some significant probability should be awarded to the proposition that only some subset of tractable utility functions, potentially excluding human-friendly ones or those of high cosmopolitan value, can be stable under reflection in powerful bounded AGI systems; e.g. because only direct functions of sense data can be adequately supervised in internal retraining. (This would be bad news rather than good news for AGI alignment and long-term optimization of human values.)

comment:

# Hume’s Guillotine

Orthogonality can be seen as corresponding to a philosophical principle advocated by David Hume, whose phrasings included, “Tis not contrary to reason to prefer the destruction of the whole world to the scratching of my finger.” In our terms: an agent whose preferences over outcomes scores the destruction of the world more highly than the scratching of Hume’s finger, is not thereby impeded from forming accurate models of the world or searching for policies that achieve various outcomes.

In modern terms, we’d say that Hume observed an apparent type distinction between is-statements and ought-statements:

“In every system of morality, which I have hitherto met with, I have always remarked, that the author proceeds for some time in the ordinary ways of reasoning, and establishes the being of a God, or makes observations concerning human affairs; when all of a sudden I am surprised to find, that instead of the usual copulations of propositions, is, and is not, I meet with no proposition that is not connected with an ought, or an ought not. This change is imperceptible; but is however, of the last consequence.”

“It is sunny outside” is an is-proposition. It can potentially be deduced solely from other is-facts, like “The Sun is in the sky” plus “The Sun emits sunshine”. If we now furthermore say “And therefore I ought to go outside”, we’ve introduced a new type of sentence, which, Hume argued, cannot be deduced just from is-statements like “The Sun is in the sky” or “I am low in Vitamin D”. Even if the prior ought-sentence seems to us very natural, or taken-for-granted, like “It is better to be happy than sad”, there must (Hume argued) have been some prior assertion or rule which, if we write it down in words, will contain words like ought, should, better, and good.

Again translating Hume’s idea into more modern form, we can see ought-sentences as special because they invoke some ordering that we’ll designate $$<V.$$ E.g. “It’s better to go outside than stay inside” asserts “Staying inside $$<V$$ going outside”. Whenever we make a statement about one outcome or action being “better”, “preferred”, “good”, “prudent”, etcetera, we can see this as implicitly ordering actions and outcomes under this $$<V$$ relation. We can put temporarily on hold the question of what sort of entity $$<V$$ may be; but we can already go ahead and observe that some assertions, the ought-assertions, mention this $$<V$$ relation; and other propositions just talk about the frequency of photons in sunlight.

We could rephrase Hume’s type distinction as observing that among within the set of all propositions, we can separate out a core set of propositions that don’t invoke $$<V,$$ what we might call ‘simple facts’. Furthermore, we can figure out simple facts just by making observations and considering other simple facts; the core set is closed under some kind of reasoning relation. This doesn’t imply that we can get $$<V$$-sentences without considering simple facts. The $$<V$$-mentioning proposition “It’s better to be outside than inside” can depend on the non-$<_V$-mentioning proposition “It is sunny outside.” But we can figure out whether it’s sunny outside, without considering any oughts.

We then observe that questions like “How many paperclips will result conditional on action $$\pi_0$$ being taken?” and “What is an action $$\pi$$ that would yield a large number of expected paperclips?” are pure is-questions, meaning that we can figure out the answer without considering $$<V$$-mentioning propositions. So if there’s some agent whose nature is just to output actions $$\pi$$ that are high in expected paperclips, the fact that this agent wasn’t considering $$<V$$-propositions needn’t hinder them from figuring out which actions are high in expected paperclips.

To establish that the paperclip maximizer need not suffer any defect of reality-modeling or planning or reflectivity, we need a bit more than the above argument. An efficient agent needs to prioritize which experiments to run, or choose which questions to spend computing power thinking about, and this choice seems to invoke some ordering. In particular, we need the instrumental convergence thesis that

A further idea of Orthogonality is that many possible orderings $$<U,$$ including the ‘number of resulting paperclips’ ordering,

An is-description of a system can produce assertions like “If the agent does action 1, then the whole world will be destroyed except for David Hume’s little finger, and if the agent does action 2, then David Hume’s finger will be scratched”—these are material predictions on the order of “If water is put on this sponge, the sponge will get wet.” To get from this is-statement to an ordering-statement like “action 1 $$<V$$ action 2,” we need some order-bearing statement like “destruction of world $$<V$$ scratching of David Hume’s little finger”, or some order-introducing rule like “If action 1 causes the destruction of the world and action 2 does not, introduce a new sentence ‘action 1 $$<V$$ action 2’.”

Taking this philosophical principle back to the notion of Orthogonality as a thesis in computer science: Since the type of ‘simple material facts’ is distinct from the type of ‘simple material facts and preference orderings’, it seems that we should be able to have agents that are just as good at thinking about the material facts, but output actions high in a different preference ordering.

The implication for Orthogonality as a thesis about computer science is that if one system of computation outputs actions according to whether they’re high in the ordering $$<V,$$ so that it tries to output it should be possible to construct another system that outputs actions higher in a different ordering (even if such actions are low in $$<P$$) without this presenting any bar to the system’s ability to reason about natural systems. A paperclip maximizer can have very good knowledge of the is-sentences about which actions lead to which consequences, while still outputting actions preferred under the ordering “Which action leads to the most paperclips?” instead of e.g. “Which action leads to the morally best consequences?” It is not that the paperclip maximizer is ignorant or mistaken about $$<P,$$ but that the paperclip maximizer just doesn’t output actions according to $$<P.$$

# Counterarguments and countercounterarguments

## Proving too much

A disbeliever in Orthogonality might ask, “Do these arguments Prove Too Much, as shown by applying a similar style of argument to “There are minds that think 2 + 2 = 5?”

Considering the arguments above in turn:

Size of mind design space.

From the perspective of somebody who currently regards “wants to make paperclips” as an exceptionally weird and strange property, “There are lots of possible minds so some want to make paperclips” will seem to be on an equal footing with “There are lots of possible minds so some believe 2 + 2 = 5.”

Thinking about the enormous space of possible minds might lead us to give more credibility to some of those possible minds believing that 2 + 2 = 5, but we might still think that minds like that will be weak, or hampered by other defects, or limited in how intelligent they could really be, or more complicated to specify, or unlikely to occur in the actual real world.

So from the perspective of somebody who doesn’t already believe in Orthogonality, the argument from the volume of mind design space is an argument at best for the Ultraweak version of Orthogonality.

Hume’s is/​ought distinction.

Depending on the exact variant of Hume-inspired argument that we deploy, the analogy to 2 + 2 = 5 might be weaker or stronger. For example, here’s a Hume-inspired argument where the 2 + 2 = 5 analogy seems relatively strong:

“In every case of a mind judging that ‘cure cancer’ $$>_P$$ ‘make paperclips’, this ordering judgment is produced by some particular comparison operation inside the mind. Nothing prohibits a different mind from producing a different comparison. Whatever you say is the cause of the ordering judgment, e.g., that it derives from a prior judgment ‘happy sapient lives’ $$>_P$$ ‘paperclips’, we can imagine that part of the agent also have been programmed differently. Different causes will yield different effects, and whatever the causality behind ‘cure cancer’ $$>_P$$ ‘make paperclips’, we can imagine a different causally constituted agent which arrives at a different judgment.”

If we substitute “2 + 2 = 5” into the above argument we get one in which all the constituent statements are equally true—this judgment is produced by a cause, the causes have causes, a different agent should produce a different output in that part of the computation, etcetera. So this version really has the same import as a general argument from the width of mind design space, and to a skeptic, would only imply the ultraweak form of Orthogonality.

However, if we’re willing to consider some additional properties of is/​ought, the analogy to “2 + 2 = 5” starts to become less tight. For instance, “Ought-comparators are not direct properties of the material world, there is no tiny $$>_P$$ among the quarks, and that’s why we can vary action-preference computations without affecting quark-predicting computations” does not have a clear analogous argument for why it should be just as easy to produce minds that judge 2 + 2 = 5.

Instrumental convergence.

There’s no obvious analogue of “An agent that knows as well as we do which policies are likely to lead to lots of expected paperclips, and an agent that knows as well as we do which policies are likely to lead to lots of happy sapient beings, are on an equal footing when it comes to doing things like scientific research”, for “agents that believe 2 + 2 = 5 are at no disadvantage compared to agents that believe 2 + 2 = 4″.

Reflective stability.

Relatively weaker forms of the reflective-stability argument might allow analogies between “prefer paperclips” and “believe 2 + 2 = 5″, but probing for more details makes the analogy break down. E.g., consider the following supposedly analogous argument:

“Suppose you think the sky is green. Then you won’t want to self-modify to make a future version of yourself believe that the sky is blue, because you’ll believe this future version of yourself would believe something false. Therefore, all beliefs are equally stable under reflection.”

This does poke at an underlying point: By default, all Bayesian priors will be equally stable under reflection. However, minds that understand how different possible worlds will provide sensors with different evidence, will want to do Bayesian updates on the data from the sensors. (We don’t even need to regard this as changing the prior; under Updateless Decision Theory, we can see it as the agent branching its successors to behave differently in different worlds.) There’s a particular way that a consequentialist agent, contemplating its own operation, goes from “The sky is very probably green, but might be blue” to “check what this sensor says and update the belief”, and indeed, an agent like this will not wantonly change its current belief without looking at a sensor, as the argument indicates.

In contrast, the way in which “prefer more paperclips” propagates through an agent’s beliefs about the effects of future designs and their interactions with the world does not suggest that future versions of the agent will prefer something other than paperclips, or that it would make the desire to produce paperclips conditional on a particular sensor value, since this would not be expected to lead to more total paperclips.

Orthogonal search tractability, constructive specifications of Orthogonal agent architectures.

These have no obvious analogue in “orthogonal tractability of optimization with different arithmetical answers” or “agent architectures that look very straightforward, are otherwise effective, and accept as input a free choice of what they think 2 + 2 equals”.

## Moral internalism

(Todo: Moral internalism says that truly normative content must be inherently compelling to all possible minds, but we can exhibit increasingly bounded agent designs that obviously wouldn’t be compelled by it. We can reply to this by (a) believing there must be some hidden flaw in the reasoning about a paperclip maximizer, (b) saying “No True Scotsman” to the paperclip maximizer even though it’s building Dyson Spheres and socially manipulating its programmers, (c) believing that a paperclip maximizer must fall short of being a true mind in some way that implies a big capabilities disadvantage, (d) accepting nihilism, or (e) not believing in moral internalism.)

## Selection filters

(Todo: Arguments from evolvability or selection filters. Distinguish naive failures to understand efficient instrumental convergence, from more sophisticated concerns in multipolar scenarios. Pragmatic argument on the histories of inefficient agents.)

# Pragmatic issues

(Todo: In practice, some utility functions /​ preference frameworks might be much harder to build and test than others. Eliezer Yudkowsky on realistic targets for the first AGI needing to be built out of elements that are simple enough to be learnable. Paul Christiano’s concern about whether only sensory-based goals might be possible to build.)

%%comment:

### Caveats

• The Orthogonality thesis is about mind design space in general. Particular agent architectures may not be Orthogonal.

• Some agents may be constructed such that their apparent utility functions shift with increasing cognitive intelligence.

• Some agent architectures may constrain what class of goals can be optimized.

• ‘Agent’ is intended to be understood in a very general way, and not to imply, e.g., a small local robot body.

For pragmatic reasons, the phrase ‘every agent of sufficient cognitive power’ in the Inevitability Thesis is specified to include e.g. all cognitive entities that are able to invent new advanced technologies and build Dyson Spheres in pursuit of long-term strategies, regardless of whether a philosopher might claim that they lack some particular cognitive capacity in view of how they respond to attempted moral arguments, or whether they are e.g. conscious in the same sense as humans, etcetera.

### Refinements

Most pragmatic implications of Orthogonality or Inevitability revolve around the following refinements:

Implementation dependence: The humanly accessible space of AI development methodologies has enough variety to yield both AI designs that are value-aligned, and AI designs that are not value-aligned.

Value loadability possible: There is at least one humanly feasible development methodology for advanced agents that has Orthogonal freedom of what utility function or meta-utility framework is introduced into the advanced agent. (Thus, if we could describe a value-loadable design, and also describe a value-aligned meta-utility framework, we could combine them to create a value-aligned advanced agent.)

Pragmatic inevitability: There exists some goal G such that almost all humanly feasible development methods result in an agent that ends up behaving like it optimizes some particular goal G, perhaps among others. Most particular arguments about futurism will pick different goals G, but all such arguments are negated by anything that tends to contradict pragmatic inevitability in general.

### Implications

Implementation dependence is the core of the policy argument that solving the value alignment problem is necessary and possible.

Futuristic scenarios in which AIs are said in passing to ‘want’ something-or-other usually rely on some form of pragmatic inevitability premise and are negated by implementation dependence.

Orthogonality directly contradicts the metaethical position of moral internalism, which would be falsified by the observation of a paperclip maximizer. On the metaethical position that orthogonality and cognitivism are compatible, exhibiting a paperclip maximizer has few or no implications for object-level moral questions, and Orthogonality does not imply that our humane values or normative values are arbitrary, selfish, non-cosmopolitan, that we have a myopic view of the universe or value, etc.

%%

<div>

Children:

Parents:

• Theory of (advanced) agents

One of the research subproblems of building powerful nice AIs, is the theory of (sufficiently advanced) minds in general.

• I am pretty surprised by how confident the voters are!

Is “arbitrarily powerful” intended to include e.g. an arbitrarily dumb search given arbitrarily large amounts of computing power? Or is it intended to require arbitrarily high efficiency as well? The latter interpretation seems to make more sense (and is relevant for forecasting). Also, it’s the only option if we read “can exist” as referring to physical possibility, given that there are probably limits on the resources available to any physical system. But on that reading, 99% seems clearly crazy.

It also seems weird to give arguments in favor without offering any plausible way in which the claim could be false, or offering any arguments against. The only alternative mentioned is inevitability, which is maybe taken seriously in philosophy but doesn’t really seem plausible.

I guess the norm is that I can add counterarguments and alternatives to the article itself if I object? Somehow the current experience is not set up in a way that would make that feel natural.

Note that most plausible failures of orthogonality are bad news, perhaps very bad news.

• To make sure we’re on the same page, Orthogonality is true if it’s possible for a paperclip maximizer to exist and be, say, 95% as cognitively efficient and ~100% as technologically sophisticated as any other agent (with equivalent resources). Check?

• Paul, you can start by writing an objection as a comment, if it’s a few paragraphs long. You can write a new comment for each new objection. If you want to make it detailed /​ add a vote, then creating a new page makes sense.

I agree that the website currently doesn’t provide intuitive support for arguments; this will come in the near future. For this year we focused on explanation /​ presentation.

• (Understandable to focus on explanation for now. Threaded replies to replies would also be great eventually.)

Eliezer: I assumed 95% efficiency was not sufficient; I was thinking about asymptotic equivalence, i.e. efficiency approaching 1 as the sophistication of the system increases. Asymptotic equivalence of technological capability seems less interesting than of cognitive capability, though they are equivalent if either we construe technology broadly to include cognitive tasks or if we measure technological capability in a way with lots of headroom.

(Nick says “more or less any level of intelligence,” which I guess could be taken to exclude the very highest levels of intelligence, but based on his other writing I think he intended merely to exclude low levels. The language in this post seems to explicitly cover arbitrarily high efficiency.)

I still think that 99% confidence is way too high even if you allow 50% efficiency, though at that point I would at least go for “very likely.”

Also of course you need to be able to replace “paperclip maximizer” with anything. When I imagine orthogonality failing, “human values” seem like a much more likely failure case than “paperclips.”

I don’t think that this disagreement about orthogonality is especially important, I mostly found the 99%’s amusing and wanted to give you a hard time about it. It does suggest that in some sense I might be more pessimistic about the AI control problem itself than you are, with my optimism driven by faith in humanity /​ the AI community.

• Paul, I didn’t say “99%” lightly, obviously. And that makes me worried that we’re not talking about the same thing. Which of the following statements sound agreeable or disagreeable?

“If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage.”

“The failure scenario that Paul visualizes for Orthogonality is something along the lines of, ‘You can’t have superintelligences that optimize any external factor, only things analogous to internal reinforcement.’”

“The failure scenario that Paul visualizes for Orthogonality is something along the lines of, ‘The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.’”

“Paul is worried about something else /​ Eliezer has completely missed Paul’s point.”

• (This is hard without threaded conversations. Responding to the “agree/​disagree” from Eliezer)

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, ‘You can’t have superintelligences that optimize any external factor, only things analogous to internal reinforcement.’

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, ‘The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.’

I think there are a lot of plausible failure modes. The two failures you outline don’t seem meaningfully distinct given our current understanding, and seem to roughly describe what I’m imagining. Possible examples:

• Systems that simply want to reproduce and expand their own influence are at a fundamental advantage. To make this more concrete, imagine powerful agents that have lots of varied internal processes, and that constant effort is needed to prevent the proliferation of internal processes that are optimized for their own proliferation rather than pursuit of some overarching goal. Maybe this kind of effort is needed to obtain competent high-level behavior at all, but maybe if you have some simple values you can spend less effort and let your own internal character shift freely according to competitive pressures.

• What we were calling “sensory optimization” may be a core feature of some useful algorithms, and it may require a constant fraction of one’s resources to repurpose that sensory optimization towards non-sensory ends. This might just be a different way of articulating the last bullet point. I think we could talk about the same thing in many different ways, and at this point we only have a vague understanding of what those scenarios actually look like concretely.

• It turns out that at some fixed level of organization, the behavior of a system needs to reflect something about the goals of that system—there is no way to focus “generic” medium-level behavior towards an arbitrary goal that isn’t already baked into that behavior. (The alternative, which seems almost necessary for the literal form of orthogonality, is that you can have arbitrarily large internal computations that are mostly independent of the agent’s goals.) This implies that systems with more complex goals need to do at least slightly more work to pursue those goals. For example, if the system only devotes 0.0000001% of its storage space/​internal communication bandwidth to goal content, then that puts a clear lower bound on the scale at which the goals can inform behavior. Of course arbitrarily complex goals could probably be specified indirectly (e.g. I want whatever is written in the envelope over there), but if simple indirect representations are themselves larger than the representation of the simplest goals, this could still represent a real efficiency loss.

Paul is worried about something else /​ Eliezer has completely missed Paul’s point.

I do think the more general point, of “we really don’t know what’s going on here,” is probably more important than the particular possible counterexamples. Even if I had no plausible counterexamples in mind, I just wouldn’t especially confident.

I think the only robust argument in favor is that unbounded agents are probably orthogonal. But (1) that doesn’t speak to efficiency, and (2) even that is a bit dicey, so I wouldn’t go for 99% even on the weaker form of orthogonality that neglects efficiency.

If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage.

It sounds regrettable but certainly not catastrophic. Here is how I would think about this kind of thing (it’s not something I’ve thought about quantitatively much, it doesn’t seem particularly action-relevant).

We might think that the speed of development or productivity of projects varies a lot randomly. So in the “race to take over the world” model (which I think is the best case for an inefficient project maximizing its share of the future), we’d want to think about what kind of probabilistic disadvantage a small productivity gap introduces.

As a simple toy model, you can imagine two projects; the one that does better will take over the world.

If you thought that productivity was log normal with a standard deviation of */​ 2, then a 5% productivity disadvantage corresponds to maybe a 48% chance of being more productive. Over the course of more time the disadvantage becomes more pronounced if randomness averages out. If productivity variation is larger or smaller then it decreases or increases the impact of an efficiency loss. If there are more participants, then the impact of a productivity hit becomes significantly large. If the good guys only have a small probability of losing, then the cost is proportionally lower. And so on.

Combining with my other views, maybe one is looking at a cost of tenths of a percent. You would presumably hope to avoid this by having the world coordinate even a tiny bit (I thought about this a bit here). Overall I’ll stick with regrettable but far from catastrophic.

(My bigger issue in practice with efficiency losses is similar to your view that people ought to have really high confidence. I think it is easy to make sloppy arguments that one approach to AI is 10% as effective as another, when in fact it is 0.0001% as effective, and that holding yourself to asymptotic equivalence is a more productive standard unless it turns out to be unrealizable.)

• I’m skeptical of Orthogonality. My basic concern is that it can be interpreted as true-but-useless for purposes of defending it, and useful-but-implausible when trying to get it to do some work for you, and that the user of the idea may not notice the switch-a-roo. Consider the following statements: there are arbitrarily powerful cognitive agents

1. which have circular preferences,

2. with the goal of paperclip maximization,

3. with the goal of phlogiston maximization,

4. which are not relfective,

5. with values aligned with humanity.

Rehearsing the arguments for Orthogonality and then evaluating these questions, I find my mind gets very slippery.

Orthongonality proponents I’ve spoken to say 1 is false, because “goal space” excludes circular preferences. But there are very likely other restrictions on goal space imposed once an agent groks things like symmetry. If “goal space” means whatever goals are not excluded by our current understanding of intelligence, I think Orthogonality is unlikely (and poorly formulated). If it means “whatever goals powerful cognitive agents can have”, Orthogonality is tautological and distracts us from pursuing the interesting question of what that space of goals actually is. Let’s narrow down goal space.

If 2 and 3 get different answers, why? Might a paperclip maximizer take liberties with what is considered a paperclip once it learns that papers can be electrostatically attracted?

If 4 is easily true, I wonder if we’re defining “mind space” too broadly to be useful. I’d really like humanity to focus on the sector of mind space that we should focus on in order to get a good outcome. The forms of Orthogonality which are clearly (to me) true distract from the interesting question of what that sector actually is. Let’s narrow down mind space.

For 5, I don’t find Orthogonality to be a convincing argument. A more convincing argument is to shoot for “humanity can grow up to have arbitrarily high cognitive power” instead.

• As regards 4, I’d say that while there may theoretically be arbitrarily powerful agents in math-space that are non-reflective, it’s not clear that this is a pragmatic truth about most of the AIs that would exist in the long run—although we might be able to get very powerful non-reflective genies. So we’re interested in some short-run solutions that involve nonreflectivity, but not long-run solutions.

I don’t think 2 and 3 do have different answers. See the argument about what happens if you use an AI that only considers classical atomic hypotheses, in https://​​arbital.com/​​p/​​5c?lens=4657963068455733951

1 seems a bit odd. You could argue that the Argument from Mind Design Space Width supports it, but this just demonstrates that this initial argument may be too crude to do more than act as an intuition pump. By the time we’re talking about the Argument from Reflective Stability, I don’t think that argument supports “you can have circular preferences” any more. It’s also not clear to me why 1 matters—all the arguments I know about, that depend on Orthogonality, still go through if we restrict ourselves to only agents with noncircular preferences. A friendly one should still exist, a paperclip maximizer should still exist.

• 1 seems a bit odd. You could argue that the Argument from Mind Design Space Width supports it, but this just demonstrates that this initial argument may be too crude to do more than act as an intuition pump. By the time we’re talking about the Argument from Reflective Stability, I don’t think that argument supports “you can have circular preferences” any more.

That’s exactly the point (except I’m not sure what you mean by “the Argument from Reflective Stability”; the capital letters suggest you’re talking about something very specific). The arguments in favor of Orthogonality just seem like crude intuition pumps. The purpose of 1 was not to actually talk about circular preferences, but to pick an example of something supported by largeness of mind design space, but which we expect to break for some other reason. Orthogonality feels like claiming the existence of an integer with two distinct prime factorizations because “there are so many integers”. Like the integers, mind design space is vast, but not arbitrary. It seems unlikely to me that there cannot be theorems showing that sufficiently high cognitive power implies some restriction on goals.

• There’s 6 successively stronger arguments listed under “Arguments” in the current version of the page. Mind design space largeness and Humean freedom of preference are #1 and #2. By the time we get to the Gandhi stability argument #3, and the higher tiers of argument above (especially including the tiling agents that seem to directly show stability of arbitrary goals), we’re outside the domain of arguments that could specialize equally well to supporting circular preferences. The reason for listing #1 and #2 as arguments anyway is not that they finish the argument, but that (a) before the later tiers of argument were developed #1 and #2 were strong intuition-pumps in the correct direction and (b) even if they might arguably prove too much if applied sloppily, they counteract other sloppy intuitions along the lines of “What does this strange new species ‘AI’ want?” or “But won’t it be persuaded by…” Like, it’s important to understand that even if it doesn’t finish the argument, it is indeed the case that “All AIs have property P” has a lot of chances to be wrong and “At least one AI has property P” has a lot of chances to be right. It doesn’t finish the story—if we took it as finishing the story, we’d be proving much too much, like circular preferences—but it pushes the story in a long way in a particular direction compared to coming in with a prior frame of mind about “What will AIs want? Hm, paperclips doesn’t sound right, I bet they want mostly to be left alone.”

• Thanks for the reply. I agree that strong Inevitability is unreasonable, and I understand the function of #1 and #2 in disrupting a prior frame of mind which assumes strong Inevitability, but that’s not the only alternative to Orthogonality. I’m surprised that the arguments are considered successively stronger arguments in favor of Orthogonality, since #6 basically says “under reasonable hypotheses, Orthogonality may well be false.” (I admit that’s a skewed reading, but I don’t know what the referenced ongoing work looks like, so I’m skipping that bit for now. [Edit: is this “tiling agents”? I’m not familiar with that work, but I can go learn about it.])

The other arguments are interesting commentary, but don’t argue that Orthogonality is true for agents we ought to care about.

• Gandhian stability argues that self-modifying agents will try to preserve their preference systems, but not that they can become arbitrarily powerful while doing so. As it happens, circular preference systems illustrate how Gandhian stability could limit how powerful a cognitive agent can become.

• The unbounded agents argument says Orthogonality is true when “mind space” is broader than what we care about.

• The search tractability argument looks like a statement about the relative difficulty of accomplishing different goals, not the relative difficulties of holding those goals. I don’t mean to dismiss the argument, but I don’t understand it. I’m not even clear on exactly what the argument is saying about the tractability of searching for strategies for different goals. That it’s the same for all possible goals?

• The section on Moral Internalism is slightly inaccurate, or at least misleading. Internalism is the metaethical view that an agent can not judge something to be right and yet still not be the least bit motivated to perform it. As such, it is really a semantic claim about the meaning of moral vocabulary: whether or not it is part of the meaning of “that is right” or “that is wrong” that the speaker approves or disapproves respectively of an action. Internalism, then, (as intended by analytic philosophers,) is totally compatible with the Orthogonality Thesis. (Internalism + Orthogonality = noncognitivism or relativism or nihilism.) IIRC, Hume himself was an Internalist!

I suggest changing the section to either “Realist Moral Internalism” or a more comprehensive examination of the options available to the AI-grade philosopher when it comes to moral motivation.