Orthogonality Thesis


The Orthog­o­nal­ity Th­e­sis as­serts that there can ex­ist ar­bi­trar­ily in­tel­li­gent agents pur­su­ing any kind of goal.

The strong form of the Orthog­o­nal­ity Th­e­sis says that there’s no ex­tra difficulty or com­pli­ca­tion in cre­at­ing an in­tel­li­gent agent to pur­sue a goal, above and be­yond the com­pu­ta­tional tractabil­ity of that goal.

Sup­pose some strange alien came to Earth and cred­ibly offered to pay us one mil­lion dol­lars’ worth of new wealth ev­ery time we cre­ated a pa­per­clip. We’d en­counter no spe­cial in­tel­lec­tual difficulty in figur­ing out how to make lots of pa­per­clips.

That is, minds would read­ily be able to rea­son about:

  • How many pa­per­clips would re­sult, if I pur­sued a policy \(\pi_0\)?

  • How can I search out a policy \(\pi\) that hap­pens to have a high an­swer to the above ques­tion?

The Orthog­o­nal­ity Th­e­sis as­serts that since these ques­tions are not com­pu­ta­tion­ally in­tractable, it’s pos­si­ble to have an agent that tries to make pa­per­clips with­out be­ing paid, be­cause pa­per­clips are what it wants. The strong form of the Orthog­o­nal­ity Th­e­sis says that there need be noth­ing es­pe­cially com­pli­cated or twisted about such an agent.

The Orthog­o­nal­ity Th­e­sis is a state­ment about com­puter sci­ence, an as­ser­tion about the log­i­cal de­sign space of pos­si­ble cog­ni­tive agents. Orthog­o­nal­ity says noth­ing about whether a hu­man AI re­searcher on Earth would want to build an AI that made pa­per­clips, or con­versely, want to make a nice AI. The Orthog­o­nal­ity Th­e­sis just as­serts that the space of pos­si­ble de­signs con­tains AIs that make pa­per­clips. And also AIs that are nice, to the ex­tent there’s a sense of “nice” where you could say how to be nice to some­one if you were paid a billion dol­lars to do that, and to the ex­tent you could name some­thing phys­i­cally achiev­able to do.

This con­trasts to in­evitab­list the­ses which might as­sert, for ex­am­ple:

  • “It doesn’t mat­ter what kind of AI you build, it will turn out to only pur­sue its own sur­vival as a fi­nal end.”

  • “Even if you tried to make an AI op­ti­mize for pa­per­clips, it would re­flect on those goals, re­ject them as be­ing stupid, and em­brace a goal of valu­ing all sapi­ent life.”

The rea­son to talk about Orthog­o­nal­ity is that it’s a key premise in two highly im­por­tant policy-rele­vant propo­si­tions:

  • It is pos­si­ble to build a nice AI.

  • It is pos­si­ble to screw up when try­ing to build a nice AI, and if you do, the AI will not au­to­mat­i­cally de­cide to be nice in­stead.

Orthog­o­nal­ity does not re­quire that all agent de­signs be equally com­pat­i­ble with all goals. E.g., the agent ar­chi­tec­ture AIXI-tl can only be for­mu­lated to care about di­rect func­tions of its sen­sory data, like a re­ward sig­nal; it would not be easy to re­jig­ger the AIXI ar­chi­tec­ture to care about cre­at­ing mas­sive di­a­monds in the en­vi­ron­ment (let alone any more com­pli­cated en­vi­ron­men­tal goals). The Orthog­o­nal­ity Th­e­sis states “there ex­ists at least one pos­si­ble agent such that…” over the whole de­sign space; it’s not meant to be true of ev­ery par­tic­u­lar agent ar­chi­tec­ture and ev­ery way of con­struct­ing agents.

Orthog­o­nal­ity is meant as a de­scrip­tive state­ment about re­al­ity, not a nor­ma­tive as­ser­tion. Orthog­o­nal­ity is not a claim about the way things ought to be; nor a claim that moral rel­a­tivism is true (e.g. that all moral­ities are on equally un­cer­tain foot­ing ac­cord­ing to some higher meta­moral­ity that judges all moral­ities as equally de­void of what would ob­jec­tively con­sti­tute a jus­tifi­ca­tion). Claiming that pa­per­clip max­i­miz­ers can be con­structed as cog­ni­tive agents is not meant to say any­thing fa­vor­able about pa­per­clips, nor any­thing deroga­tory about sapi­ent life.

Th­e­sis state­ment: Goal-di­rected agents are as tractable as their goals.

Sup­pose an agent’s util­ity func­tion said, “Make the SHA512 hash of a digi­tized rep­re­sen­ta­tion of the quan­tum state of the uni­verse be 0 as of­ten as pos­si­ble.” This would be an ex­cep­tion­ally in­tractable kind of goal. Even if aliens offered to pay us to do that, we still couldn’t figure out how.

Similarly, even if aliens offered to pay us, we wouldn’t be able to op­ti­mize the goal “Make the to­tal num­ber of ap­ples on this table be si­mul­ta­neously even and odd” be­cause the goal is self-con­tra­dic­tory.

But sup­pose in­stead that some strange and ex­tremely pow­er­ful aliens offer to pay us the equiv­a­lent of a mil­lion dol­lars in wealth for ev­ery pa­per­clip that we make, or even a galaxy’s worth of new re­sources for ev­ery new pa­per­clip we make. If we imag­ine our­selves hav­ing a hu­man rea­son to make lots of pa­per­clips, the op­ti­miza­tion prob­lem “How can I make lots of pa­per­clips?” would pose us no spe­cial difficulty. The fac­tual ques­tions:

  • How many pa­per­clips would re­sult, if I pur­sued a policy \(\pi_0\)?

  • How can I search out a policy \(\pi\) that hap­pens to have a high an­swer to the above ques­tion?

…would not be es­pe­cially com­pu­ta­tion­ally bur­den­some or in­tractable.

We also wouldn’t for­get to har­vest and eat food while mak­ing pa­per­clips. Even if offered goods of such over­whelming im­por­tance that mak­ing pa­per­clips was at the top of ev­ery­one’s pri­or­ity list, we could go on be­ing strate­gic about which other ac­tions were use­ful in or­der to make even more pa­per­clips; this also wouldn’t be an in­tractably hard cog­ni­tive prob­lem for us.

The weak form of the Orthog­o­nal­ity Th­e­sis says, “Since the goal of mak­ing pa­per­clips is tractable, some­where in the de­sign space is an agent that op­ti­mizes that goal.”

The strong form of Orthog­o­nal­ity says, “And this agent doesn’t need to be twisted or com­pli­cated or in­effi­cient or have any weird defects of re­flec­tivity; the agent is as tractable as the goal.” That is: When con­sid­er­ing the nec­es­sary in­ter­nal cog­ni­tion of an agent that steers out­comes to achieve high scores in some out­come-scor­ing func­tion \(U,\) there’s no added difficulty in that cog­ni­tion ex­cept what­ever difficulty is in­her­ent in the ques­tion “What poli­cies would re­sult in con­se­quences with high \(U\)-scores?”

This could be restated as, “To what­ever ex­tent you (or a su­per­in­tel­li­gent ver­sion of you) could figure out how to get a high-\(U\) out­come if aliens offered to pay you huge amount of re­sources to do it, the cor­re­spond­ing agent that ter­mi­nally prefers high-\(U\) out­comes can be at least that good at achiev­ing \(U\).” This as­ser­tion would be false if, for ex­am­ple, an in­tel­li­gent agent that ter­mi­nally wanted pa­per­clips was limited in in­tel­li­gence by the defects of re­flec­tivity re­quired to make the agent not re­al­ize how pointless it is to pur­sue pa­per­clips; whereas a galac­tic su­per­in­tel­li­gence be­ing paid to pur­sue pa­per­clips could be far more in­tel­li­gent and strate­gic be­cause it didn’t have any such defects.

For pur­poses of stat­ing Orthog­o­nal­ity’s pre­con­di­tion, the “tractabil­ity” of the com­pu­ta­tional prob­lem of \(U\)-search should be taken as in­clud­ing only the ob­ject-level search prob­lem of com­put­ing ex­ter­nal ac­tions to achieve ex­ter­nal goals. If there turn out to be spe­cial difficul­ties as­so­ci­ated with com­put­ing “How can I make sure that I go on pur­su­ing \(U\)?” or “What kind of suc­ces­sor agent would want to pur­sue \(U\)?” when­ever \(U\) is some­thing other than “be nice to all sapi­ent life”, then these new difficul­ties con­tra­dict the in­tu­itive claim of Orthog­o­nal­ity. Orthog­o­nal­ity is meant to be em­piri­cally-true-in-prac­tice, not true-by-defi­ni­tion be­cause of how we sneak­ily defined “op­ti­miza­tion prob­lem” in the setup.

Orthog­o­nal­ity is not liter­ally, ab­solutely uni­ver­sal be­cause the­o­ret­i­cally ‘goals’ can in­clude such weird con­struc­tions as “Make pa­per­clips for some ter­mi­nal rea­son other than valu­ing pa­per­clips” and similar such state­ments that re­quire cog­ni­tive al­gorithms and not just re­sults. To the ex­tent that goals don’t sin­gle out par­tic­u­lar op­ti­miza­tion meth­ods, and just talk about pa­per­clips, the Orthog­o­nal­ity claim should cover them.

Sum­mary of arguments

Some ar­gu­ments for Orthog­o­nal­ity, in rough or­der of when they were first his­tor­i­cally pro­posed and the strength of Orthog­o­nal­ity they ar­gue for:

Size of mind de­sign space

The space of pos­si­ble minds is enor­mous, and all hu­man be­ings oc­cupy a rel­a­tively tiny vol­ume of it—we all have a cere­bral cor­tex, cere­bel­lum, tha­la­mus, and so on. The sense that AIs are a par­tic­u­lar kind of alien mind that ‘will’ want some par­tic­u­lar things is an un­der­mined in­tu­ition. “AI” re­ally refers to the en­tire de­sign space of pos­si­bil­ities out­side the hu­man. Some­where in that vast space are pos­si­ble minds with al­most any kind of goal. For any thought you have about why a mind in that space ought to work one way, there’s a differ­ent pos­si­ble mind that works differ­ently.

This is an ex­cep­tion­ally generic sort of ar­gu­ment that could ap­ply equally well to any prop­erty \(P\) of a mind, but is still weighty even so: If we con­sider a space of minds a mil­lion bits wide, then any ar­gu­ment of the form “Some mind has prop­erty \(P\)” has \(2^{1,000,000}\) chances to be true and any ar­gu­ment of the form “No mind has prop­erty \(P\)” has \(2^{1,000,000}\) chances to be false.

This form of ar­gu­ment isn’t very spe­cific to the na­ture of goals as op­posed to any other kind of men­tal prop­erty. But it’s still use­ful for snap­ping out of the frame of mind of “An AI is a weird new kind of per­son, like the strange peo­ple of the Tribe Who Live Across The Water” and into the frame of mind of “The space of pos­si­ble things we could call ‘AI’ is enor­mously wider than the space of pos­si­ble hu­mans.” Similarly, snap­ping out of the frame of mind of “But why would it pur­sue pa­per­clips, when it wouldn’t have any fun that way?” and into the frame of mind “Well, I like hav­ing fun, but are there some pos­si­ble minds that don’t pur­sue fun?”

In­stru­men­tal convergence

A suffi­ciently in­tel­li­gent pa­per­clip max­i­mizer isn’t dis­ad­van­taged in day-to-day op­er­a­tions rel­a­tive to any other goal, so long as Clippy can es­ti­mate at least as well as you can how many more pa­per­clips could be pro­duced by pur­su­ing in­stru­men­tal strate­gies like “Do sci­ence re­search (for now)” or “Pre­tend to be nice (for now)”.

Res­tat­ing: for at least some agent ar­chi­tec­tures, it is not nec­es­sary for the agent to have an in­de­pen­dent ter­mi­nal value in its util­ity func­tion for “do sci­ence” in or­der for it to do sci­ence effec­tively; it is only nec­es­sary for the agent to un­der­stand at least as well as we do why cer­tain forms of in­ves­ti­ga­tion will pro­duce knowl­edge that will be use­ful later (e.g. for pa­per­clips). When you say, “Oh, well, it won’t be in­ter­ested in elec­tro­mag­netism since it has no pure cu­ri­os­ity, it will only want to peer at pa­per­clips in par­tic­u­lar, so it will be at a dis­ad­van­tage rel­a­tive to more cu­ri­ous agents” you are pos­tu­lat­ing that you know a bet­ter op­er­a­tional policy than the agent does for pro­duc­ing pa­per­clips, and an in­stru­men­tally effi­cient agent would know this as well as you do and be at no op­er­a­tional dis­ad­van­tage due to its sim­pler util­ity func­tion.

Reflec­tive stability

Sup­pose that Gandhi doesn’t want peo­ple to be mur­dered. Imag­ine that you offer Gandhi a pill that will make him start want­ing to kill peo­ple. If Gandhi knows that this is what the pill does, Gandhi will re­fuse the pill, be­cause Gandhi ex­pects the re­sult of tak­ing the pill to be that fu­ture-Gandhi wants to mur­der peo­ple and then mur­ders peo­ple and then more peo­ple will be mur­dered and Gandhi re­gards this as bad. Similarly, a suffi­ciently in­tel­li­gent pa­per­clip max­i­mizer will not self-mod­ify to act ac­cord­ing to “ac­tions which pro­mote the welfare of sapi­ent life” in­stead of “ac­tions which lead to the most pa­per­clips”, be­cause then fu­ture-Clippy will pro­duce fewer pa­per­clips, and then there will be fewer pa­per­clips, so pre­sent-Clippy does not eval­u­ate this self-mod­ifi­ca­tion as pro­duc­ing the high­est num­ber of ex­pected fu­ture pa­per­clips.

Hume’s is/​ought type distinction

David Hume ob­served an ap­par­ent differ­ence of type be­tween is-state­ments and ought-state­ments:

“In ev­ery sys­tem of moral­ity, which I have hith­erto met with, I have always re­marked, that the au­thor pro­ceeds for some time in the or­di­nary ways of rea­son­ing, and es­tab­lishes the be­ing of a God, or makes ob­ser­va­tions con­cern­ing hu­man af­fairs; when all of a sud­den I am sur­prised to find, that in­stead of the usual cop­u­la­tions of propo­si­tions, is, and is not, I meet with no propo­si­tion that is not con­nected with an ought, or an ought not. This change is im­per­cep­ti­ble; but is how­ever, of the last con­se­quence.”

Hume was origi­nally con­cerned with the ques­tion of where we get our ought-propo­si­tions, since (said Hume) there didn’t seem to be any way to de­rive an ought-propo­si­tion ex­cept by start­ing from an­other ought-propo­si­tion. We can figure out that the Sun is shin­ing just by look­ing out the win­dow; we can de­duce that the out­doors will be warmer than oth­er­wise by know­ing about how sun­light im­parts ther­mal en­ergy when ab­sorbed. On the other hand, to get from there to “And there­fore I ought to go out­side”, some kind of new con­sid­er­a­tion must have en­tered, along the lines of “I should get some sun­sh­ine” or “It’s bet­ter to be warm than cold.” Even if this prior ought-propo­si­tion is of a form that to hu­mans seems very nat­u­ral, or taken-for-granted, or cul­turally wide­spread, like “It is bet­ter for peo­ple to be happy than sad”, there must have still been some prior as­sump­tion which, if we write it down in words, will con­tain words like ought, should, bet­ter, and good.

Again trans­lat­ing Hume’s idea into more mod­ern form, we can see ought-sen­tences as spe­cial be­cause they in­voke some or­der­ing that we’ll des­ig­nate \(<V.\) E.g. “It’s bet­ter to go out­side than stay in­side” as­serts “Stay­ing in­side \(<V\) go­ing out­side”. When­ever we make a state­ment about one out­come or ac­tion be­ing “bet­ter”, “preferred”, “good”, “pru­dent”, etcetera, we can see this as im­plic­itly or­der­ing ac­tions and out­comes un­der this \(<V\) re­la­tion. Some as­ser­tions, the ought-laden as­ser­tions, men­tion this \(<V\) re­la­tion; other propo­si­tions just talk about en­er­getic pho­tons in sun­light.

Since we’ve put on hold the ques­tion of ex­actly what sort of en­tity this \(<V\) re­la­tion is, we don’t need to con­cern our­selves for now with the ques­tion of whether Hume was right that we can’t de­rive \(<V\)-re­la­tions just from fac­tual as­ser­tions. For pur­poses of Orthog­o­nal­ity, we only need a much weaker ver­sion of Hume’s the­sis, the ob­ser­va­tion that we can ap­par­ently sep­a­rate out a set of propo­si­tions that don’t in­voke \(<V,\) what we might call ‘sim­ple facts’ or ‘ques­tions of sim­ple fact’. Fur­ther­more, we can figure out sim­ple facts just by mak­ing ob­ser­va­tions and con­sid­er­ing other sim­ple facts.

We can’t nec­es­sar­ily get all \(<V\)-men­tion­ing propo­si­tions with­out con­sid­er­ing sim­ple facts. The \(<V\)-men­tion­ing propo­si­tion “It’s bet­ter to be out­side than in­side” may de­pend on the non\(<V\)—men­tion­ing sim­ple fact “It is sunny out­side.” But we can figure out whether it’s sunny out­side, with­out con­sid­er­ing any ought-propo­si­tions.

There are two po­ten­tial ways we can con­cep­tu­al­ize the re­la­tion of Hume’s is-ought sep­a­ra­tion to Orthog­o­nal­ity.

The rel­a­tively sim­pler con­cep­tu­al­iza­tion is to treat the re­la­tion ‘makes more pa­per­clips’ as a kind of new or­der­ing \(>_{paperclips}\) that can, in a very gen­eral sense, fill in the role in a pa­per­clip max­i­mizer’s rea­son­ing that would in our own rea­son­ing be taken up by \(<V.\) Then Hume’s is-ought sep­a­ra­tion seems to sug­gest that this pa­per­clip max­i­mizer can still have ex­cel­lent rea­son­ing about em­piri­cal ques­tions like “Which policy leads to how many pa­per­clips?” be­cause is-ques­tions can be thought about sep­a­rately from ought-ques­tions. When Clippy dis­assem­bles you to turn you into pa­per­clips, it doesn’t have a val­ues dis­agree­ment with you—it’s not the case that Clippy is do­ing that ac­tion be­cause it thinks you have low value un­der \(<V.\) Clippy’s ac­tions just re­flect its com­pu­ta­tion of the en­tirely sep­a­rate or­der­ing \(>_{paperclips}.\)

The deeper con­cep­tu­al­iza­tion is to see a pa­per­clip max­i­mizer as be­ing con­structed en­tirely out of is-ques­tions. The ques­tions “How many pa­per­clips will re­sult con­di­tional on ac­tion \(\pi_0\) be­ing taken?” and “What is an ac­tion \(\pi\) that would yield a large num­ber of ex­pected pa­per­clips?” are pure is-ques­tions, and (ar­guendo) ev­ery­thing a pa­per­clip max­i­mizer needs to con­sider in or­der to make as many pa­per­clips as pos­si­ble can be seen as a spe­cial case of one of these ques­tions. When Clippy dis­assem­bles you for your atoms, it’s not dis­agree­ing with you about the value of hu­man life, or what it ought to do, or which out­comes are bet­ter or worse. All of those are ought-propo­si­tions. Clippy’s ac­tion is only in­for­ma­tive about the true is-propo­si­tion ‘turn­ing this per­son into pa­per­clips causes there to be more pa­per­clips in the uni­verse’, and tells us noth­ing about any con­tent of the mys­te­ri­ous \(<V\)-re­la­tion be­cause Clippy wasn’t com­put­ing any­thing to do with \(<V.\)

The sec­ond view­point may be helpful for see­ing why Orthog­o­nal­ity doesn’t re­quire moral rel­a­tivism. If we imag­ine Clippy as hav­ing a differ­ent ver­sion \(>_{paperclips}\) of some­thing very much like the value sys­tem \(<V,\) then we may be tempted to reprise the en­tire Orthog­o­nal­ity de­bate at one re­move, and ask, “But doesn’t Clippy see that \(<V\) is more jus­tified than \(>_{paperclips}\)? And if this fact isn’t ev­i­dent to Clippy who is sup­posed to be very in­tel­li­gent and have no defects of re­flec­tivity and so on, doesn’t that im­ply that \(<V\) re­ally isn’t any more jus­tified than \(>_{paperclips}\)?”

We could re­ply to that ques­tion by car­ry­ing the shal­low con­cep­tu­al­iza­tion of Humean Orthog­o­nal­ity a step fur­ther, and say­ing, “Ah, when you talk about jus­tifi­ca­tion, you are again in­vok­ing a mys­te­ri­ous con­cept that doesn’t ap­pear just in talk­ing about the pho­tons in sun­light. We could see propo­si­tions like this as in­volv­ing a new idea \(\ll_W\) that deals with which \(<)-sys­tems are less or more jus­tified, so that ′\(<V\) is more jus­tified than \(>_{paperclips}\)′ states \(>_{paperclips} \ll_W <V\)″. But Clippy doesn’t com­pute \(\ll_W,\) it com­putes \(\gg_{paperclips},\) so Clippy’s be­hav­ior doesn’t tell us any­thing about what is jus­tified.”

But this is again tempt­ing us to imag­ine Clippy as hav­ing its own ver­sion of the mys­te­ri­ous \(\ll_W\) to which Clippy is equally at­tached, and tempts us to imag­ine Clippy as ar­gu­ing with us or dis­agree­ing with us within some higher meta­sys­tem.

So—putting on hold the true na­ture of our mys­te­ri­ous \(<V\)-men­tion­ing con­cepts like ‘good­ness’ or ‘bet­ter’ and the true na­ture of our \(\ll_W\)-men­tion­ing con­cepts like ‘jus­tified’ or ‘valid moral ar­gu­ment’—the deeper idea would be that Clippy is just not com­put­ing any­thing to do with \(<V\) or \(\ll_W\) at all. If Clippy self-mod­ifies and writes new de­ci­sion al­gorithms into place, these new al­gorithms will be se­lected ac­cord­ing to the is-crite­rion “How many fu­ture pa­per­clips will re­sult if I write this piece of code?” and not any­thing re­sem­bling any ar­gu­ments that hu­mans have ever had over which ought-sys­tems are jus­tified. Clippy doesn’t ask whether its new de­ci­sion al­gorithm is jus­tified; it asks how many ex­pected pa­per­clips will re­sult from ex­e­cut­ing the al­gorithm (and this is a pure is-ques­tion whose an­swers are ei­ther true or false as a mat­ter of sim­ple fact).

If we think Clippy is very in­tel­li­gent, and we watch Clippy self-mod­ify into a new pa­per­clip max­i­mizer, we are only learn­ing is-facts about which ex­e­cut­ing al­gorithms lead to more pa­per­clips ex­ist­ing. We are not learn­ing any­thing about what is right, or what is jus­tified, and in par­tic­u­lar we’re not learn­ing that ‘do good things’ is ob­jec­tively no bet­ter jus­tified than ‘make pa­per­clips’. Even if that as­ser­tion were true un­der the mys­te­ri­ous \(\ll_W\)-re­la­tion on moral sys­tems, you wouldn’t be able to learn that truth by watch­ing Clippy, be­cause Clippy never both­ers to eval­u­ate \(\ll_W\) or any other analo­gous jus­tifi­ca­tion-sys­tem \(\gg_{something}\).

(This is about as far as one can go in dis­en­tan­gling Orthog­o­nal­ity in com­puter sci­ence from nor­ma­tive metaethics with­out start­ing to pierce the mys­te­ri­ous opac­ity of \(<V.\))

Thick defi­ni­tions of ra­tio­nal­ity or intelligence

Some philoso­phers re­sponded to Hume’s dis­tinc­tion of em­piri­cal ra­tio­nal­ity from nor­ma­tive rea­son­ing, by ad­vo­cat­ing ‘thick’ defi­ni­tions of in­tel­li­gence that in­cluded some state­ment about the ‘rea­son­able­ness’ of the agent’s ends.

For prag­matic pur­poses of AI al­ign­ment the­ory, if an agent is cog­ni­tively pow­er­ful enough to build Dyson Spheres, it doesn’t mat­ter whether that agent is defined as ‘in­tel­li­gent’ or its ends are defined as ‘rea­son­able’. A defi­ni­tion of the word ‘in­tel­li­gence’ con­trived to ex­clude pa­per­clip max­i­miza­tion doesn’t change the em­piri­cal be­hav­ior or em­piri­cal power of a pa­per­clip max­i­mizer.

Re­la­tion to moral internalism

While Orthog­o­nal­ity seems or­thog­o­nal to most tra­di­tional philo­soph­i­cal ques­tions about metaethics, it does out­right con­tra­dict some pos­si­ble forms of moral in­ter­nal­ism. For ex­am­ple, one could hold that by the very defi­ni­tion of right­ness, knowl­edge of what is right must be in­her­ently mo­ti­vat­ing to any en­tity that un­der­stands that knowl­edge. This is not the most com­mon mean­ing of “moral in­ter­nal­ism” held by mod­ern philoso­phers, who in­stead seem to hold some­thing like, “By defi­ni­tion, if I say that some­thing is morally right, among my claims is that the thing is mo­ti­vat­ing to me.” We haven’t heard of a stan­dard term for the po­si­tion that, by defi­ni­tion, what is right must be uni­ver­sally mo­ti­vat­ing; we’ll des­ig­nate that here as “uni­ver­sal­ist moral in­ter­nal­ism”.

We can po­ten­tially re­solve this ten­sion be­tween Orthog­o­nal­ity and this as­ser­tion about the na­ture of right­ness by:

  • Believ­ing there must be some hid­den flaw in the rea­son­ing about a pa­per­clip max­i­mizer.

  • Say­ing “No True Scots­man” to the pa­per­clip max­i­mizer be­ing in­tel­li­gent, even if it’s build­ing Dyson Spheres.

  • Say­ing “No True Scots­man” to the pa­per­clip max­i­mizer “truly un­der­stand­ing” \(<V,\) even if Clippy is ca­pa­ble of pre­dict­ing with ex­treme ac­cu­racy what hu­mans will say and think about \(<V\), and Clippy does not suffer any other deficit of em­piri­cal pre­dic­tion be­cause of this lack of ‘un­der­stand­ing’, and Clippy does not re­quire any spe­cial twist of its mind to avoid be­ing com­pel­led by its un­der­stand­ing of \(<V.\)

  • Re­ject­ing Orthog­o­nal­ity, and as­sert­ing that a pa­per­clip max­i­mizer must fall short of be­ing an in­tact mind in some way that im­plies an em­piri­cal ca­pa­bil­ities dis­ad­van­tage.

  • Ac­cept­ing nihilism, since a true moral ar­gu­ment must be com­pel­ling to ev­ery­one, and no moral ar­gu­ment is com­pel­ling to a pa­per­clip max­i­mizer. (Note: A pa­per­clip max­i­mizer doesn’t care about whether clip­piness must be com­pel­ling to ev­ery­one, which makes this ar­gu­ment self-un­der­min­ing. See also Res­cu­ing the util­ity func­tion for gen­eral ar­gu­ments against adopt­ing nihilism when you dis­cover that your mind’s rep­re­sen­ta­tion of some­thing was run­ning skew to re­al­ity.)

  • Giv­ing up on uni­ver­sal­ist moral in­ter­nal­ism as an em­piri­cal propo­si­tion; AIXI-tl and Clippy em­piri­cally do differ­ent things, and will not be com­pel­led to op­ti­mize the same goal no mat­ter what they learn or know.

Con­struc­tive speci­fi­ca­tions of or­thog­o­nal agents

We can ex­hibit un­bounded for­mu­las for agents larger than their en­vi­ron­ments that op­ti­mize any given goal, such that Orthog­o­nal­ity is visi­bly true about agents within that class. Ar­gu­ments about what all pos­si­ble minds must do are clearly false for these par­tic­u­lar agents, con­tra­dict­ing all strong forms of in­evita­bil­ism. Th­ese minds use huge amounts of com­put­ing power, but there is no known rea­son to ex­pect that, e.g. worth­while-hap­piness-max­i­miz­ers have bounded analogues while pa­per­clip-max­i­miz­ers do not.

The sim­plest un­bounded for­mu­las for or­thog­o­nal agents don’t in­volve re­flec­tivity (the cor­re­spond­ing agents have no self-mod­ifi­ca­tion op­tions, though they may cre­ate sub­agents). If we only had those sim­ple for­mu­las, it would the­o­ret­i­cally leave open the pos­si­bil­ity that self-re­flec­tion could some­how negate Orthog­o­nal­ity (re­flec­tive agents must in­evitably have a par­tic­u­lar util­ity func­tion, and re­flec­tive agents be­ing at a strong ad­van­tage rel­a­tive to non­re­flec­tive agents). But there is already on­go­ing work on de­scribing re­flec­tive agents that have the prefer­ence-sta­bil­ity prop­erty, and work to­ward in­creas­ingly bounded and ap­prox­imable for­mu­la­tions of those. There is no hint from this work that Orthog­o­nal­ity is false; all the speci­fi­ca­tions have a free choice of util­ity func­tion.

As of early 2017, the most re­cent work on tiling agents in­volves fully re­flec­tive, re­flec­tively sta­ble, log­i­cally un­cer­tain agents whose com­put­ing time is roughly dou­bly-ex­po­nen­tial in the size of the propo­si­tions con­sid­ered.

So if you want to claim Orthog­o­nal­ity is false be­cause e.g. all AIs will in­evitably end up valu­ing all sapi­ent life, you need to claim that the pro­cess of re­duc­ing the already-speci­fied dou­bly-ex­po­nen­tial com­put­ing-time de­ci­sion al­gorithm to a more tractable de­ci­sion al­gorithm can only be made re­al­is­ti­cally effi­cient for de­ci­sion al­gorithms com­put­ing “Which poli­cies pro­tect all sapi­ent life?” and are im­pos­si­ble to make effi­cient for de­ci­sion al­gorithms com­put­ing “Which poli­cies lead to the most pa­per­clips?”

Since work on tiling agent de­signs hasn’t halted, one may need to backpedal and mod­ify this im­pos­si­bil­ity claim fur­ther as more effi­cient de­ci­sion al­gorithms are in­vented.

Epistemic status

Among peo­ple who’ve se­ri­ously delved into these is­sues and are aware of the more ad­vanced ar­gu­ments for Orthog­o­nal­ity, we’re not aware of any­one who still defends “uni­ver­sal­ist moral in­ter­nal­ism” as de­scribed above, and we’re not aware of any­one who thinks that ar­bi­trary suffi­ciently-real-world-ca­pa­ble AI sys­tems au­to­mat­i­cally adopt hu­man-friendly ter­mi­nal val­ues.

Paul Chris­ti­ano has said (if we’re quot­ing him cor­rectly) that al­though it’s not his dom­i­nant hy­poth­e­sis, he thinks some sig­nifi­cant prob­a­bil­ity should be awarded to the propo­si­tion that only some sub­set of tractable util­ity func­tions, po­ten­tially ex­clud­ing hu­man-friendly ones or those of high cos­mopoli­tan value, can be sta­ble un­der re­flec­tion in pow­er­ful bounded AGI sys­tems; e.g. be­cause only di­rect func­tions of sense data can be ad­e­quately su­per­vised in in­ter­nal re­train­ing. (This would be bad news rather than good news for AGI al­ign­ment and long-term op­ti­miza­tion of hu­man val­ues.)


Hume’s Guillotine

Orthog­o­nal­ity can be seen as cor­re­spond­ing to a philo­soph­i­cal prin­ci­ple ad­vo­cated by David Hume, whose phras­ings in­cluded, “Tis not con­trary to rea­son to pre­fer the de­struc­tion of the whole world to the scratch­ing of my finger.” In our terms: an agent whose prefer­ences over out­comes scores the de­struc­tion of the world more highly than the scratch­ing of Hume’s finger, is not thereby im­peded from form­ing ac­cu­rate mod­els of the world or search­ing for poli­cies that achieve var­i­ous out­comes.

In mod­ern terms, we’d say that Hume ob­served an ap­par­ent type dis­tinc­tion be­tween is-state­ments and ought-state­ments:

“In ev­ery sys­tem of moral­ity, which I have hith­erto met with, I have always re­marked, that the au­thor pro­ceeds for some time in the or­di­nary ways of rea­son­ing, and es­tab­lishes the be­ing of a God, or makes ob­ser­va­tions con­cern­ing hu­man af­fairs; when all of a sud­den I am sur­prised to find, that in­stead of the usual cop­u­la­tions of propo­si­tions, is, and is not, I meet with no propo­si­tion that is not con­nected with an ought, or an ought not. This change is im­per­cep­ti­ble; but is how­ever, of the last con­se­quence.”

“It is sunny out­side” is an is-propo­si­tion. It can po­ten­tially be de­duced solely from other is-facts, like “The Sun is in the sky” plus “The Sun emits sun­sh­ine”. If we now fur­ther­more say “And there­fore I ought to go out­side”, we’ve in­tro­duced a new type of sen­tence, which, Hume ar­gued, can­not be de­duced just from is-state­ments like “The Sun is in the sky” or “I am low in Vi­tamin D”. Even if the prior ought-sen­tence seems to us very nat­u­ral, or taken-for-granted, like “It is bet­ter to be happy than sad”, there must (Hume ar­gued) have been some prior as­ser­tion or rule which, if we write it down in words, will con­tain words like ought, should, bet­ter, and good.

Again trans­lat­ing Hume’s idea into more mod­ern form, we can see ought-sen­tences as spe­cial be­cause they in­voke some or­der­ing that we’ll des­ig­nate \(<V.\) E.g. “It’s bet­ter to go out­side than stay in­side” as­serts “Stay­ing in­side \(<V\) go­ing out­side”. When­ever we make a state­ment about one out­come or ac­tion be­ing “bet­ter”, “preferred”, “good”, “pru­dent”, etcetera, we can see this as im­plic­itly or­der­ing ac­tions and out­comes un­der this \(<V\) re­la­tion. We can put tem­porar­ily on hold the ques­tion of what sort of en­tity \(<V\) may be; but we can already go ahead and ob­serve that some as­ser­tions, the ought-as­ser­tions, men­tion this \(<V\) re­la­tion; and other propo­si­tions just talk about the fre­quency of pho­tons in sun­light.

We could rephrase Hume’s type dis­tinc­tion as ob­serv­ing that among within the set of all propo­si­tions, we can sep­a­rate out a core set of propo­si­tions that don’t in­voke \(<V,\) what we might call ‘sim­ple facts’. Fur­ther­more, we can figure out sim­ple facts just by mak­ing ob­ser­va­tions and con­sid­er­ing other sim­ple facts; the core set is closed un­der some kind of rea­son­ing re­la­tion. This doesn’t im­ply that we can get \(<V\)-sen­tences with­out con­sid­er­ing sim­ple facts. The \(<V\)-men­tion­ing propo­si­tion “It’s bet­ter to be out­side than in­side” can de­pend on the non\(<V\)—men­tion­ing propo­si­tion “It is sunny out­side.” But we can figure out whether it’s sunny out­side, with­out con­sid­er­ing any oughts.

We then ob­serve that ques­tions like “How many pa­per­clips will re­sult con­di­tional on ac­tion \(\pi_0\) be­ing taken?” and “What is an ac­tion \(\pi\) that would yield a large num­ber of ex­pected pa­per­clips?” are pure is-ques­tions, mean­ing that we can figure out the an­swer with­out con­sid­er­ing \(<V\)-men­tion­ing propo­si­tions. So if there’s some agent whose na­ture is just to out­put ac­tions \(\pi\) that are high in ex­pected pa­per­clips, the fact that this agent wasn’t con­sid­er­ing \(<V\)-propo­si­tions needn’t hin­der them from figur­ing out which ac­tions are high in ex­pected pa­per­clips.

To es­tab­lish that the pa­per­clip max­i­mizer need not suffer any defect of re­al­ity-mod­el­ing or plan­ning or re­flec­tivity, we need a bit more than the above ar­gu­ment. An effi­cient agent needs to pri­ori­tize which ex­per­i­ments to run, or choose which ques­tions to spend com­put­ing power think­ing about, and this choice seems to in­voke some or­der­ing. In par­tic­u­lar, we need the in­stru­men­tal con­ver­gence the­sis that

A fur­ther idea of Orthog­o­nal­ity is that many pos­si­ble or­der­ings \(<U,\) in­clud­ing the ‘num­ber of re­sult­ing pa­per­clips’ or­der­ing,

An is-de­scrip­tion of a sys­tem can pro­duce as­ser­tions like “If the agent does ac­tion 1, then the whole world will be de­stroyed ex­cept for David Hume’s lit­tle finger, and if the agent does ac­tion 2, then David Hume’s finger will be scratched”—these are ma­te­rial pre­dic­tions on the or­der of “If wa­ter is put on this sponge, the sponge will get wet.” To get from this is-state­ment to an or­der­ing-state­ment like “ac­tion 1 \(<V\) ac­tion 2,” we need some or­der-bear­ing state­ment like “de­struc­tion of world \(<V\) scratch­ing of David Hume’s lit­tle finger”, or some or­der-in­tro­duc­ing rule like “If ac­tion 1 causes the de­struc­tion of the world and ac­tion 2 does not, in­tro­duce a new sen­tence ‘ac­tion 1 \(<V\) ac­tion 2’.”

Tak­ing this philo­soph­i­cal prin­ci­ple back to the no­tion of Orthog­o­nal­ity as a the­sis in com­puter sci­ence: Since the type of ‘sim­ple ma­te­rial facts’ is dis­tinct from the type of ‘sim­ple ma­te­rial facts and prefer­ence or­der­ings’, it seems that we should be able to have agents that are just as good at think­ing about the ma­te­rial facts, but out­put ac­tions high in a differ­ent prefer­ence or­der­ing.

The im­pli­ca­tion for Orthog­o­nal­ity as a the­sis about com­puter sci­ence is that if one sys­tem of com­pu­ta­tion out­puts ac­tions ac­cord­ing to whether they’re high in the or­der­ing \(<V,\) so that it tries to out­put it should be pos­si­ble to con­struct an­other sys­tem that out­puts ac­tions higher in a differ­ent or­der­ing (even if such ac­tions are low in \(<P\)) with­out this pre­sent­ing any bar to the sys­tem’s abil­ity to rea­son about nat­u­ral sys­tems. A pa­per­clip max­i­mizer can have very good knowl­edge of the is-sen­tences about which ac­tions lead to which con­se­quences, while still out­putting ac­tions preferred un­der the or­der­ing “Which ac­tion leads to the most pa­per­clips?” in­stead of e.g. “Which ac­tion leads to the morally best con­se­quences?” It is not that the pa­per­clip max­i­mizer is ig­no­rant or mis­taken about \(<P,\) but that the pa­per­clip max­i­mizer just doesn’t out­put ac­tions ac­cord­ing to \(<P.\)

Ar­gu­ments pro

Coun­ter­ar­gu­ments and countercounterarguments

Prov­ing too much

A dis­be­liever in Orthog­o­nal­ity might ask, “Do these ar­gu­ments Prove Too Much, as shown by ap­ply­ing a similar style of ar­gu­ment to “There are minds that think 2 + 2 = 5?”

Con­sid­er­ing the ar­gu­ments above in turn:

Size of mind de­sign space.

From the per­spec­tive of some­body who cur­rently re­gards “wants to make pa­per­clips” as an ex­cep­tion­ally weird and strange prop­erty, “There are lots of pos­si­ble minds so some want to make pa­per­clips” will seem to be on an equal foot­ing with “There are lots of pos­si­ble minds so some be­lieve 2 + 2 = 5.”

Think­ing about the enor­mous space of pos­si­ble minds might lead us to give more cred­i­bil­ity to some of those pos­si­ble minds be­liev­ing that 2 + 2 = 5, but we might still think that minds like that will be weak, or ham­pered by other defects, or limited in how in­tel­li­gent they could re­ally be, or more com­pli­cated to spec­ify, or un­likely to oc­cur in the ac­tual real world.

So from the per­spec­tive of some­body who doesn’t already be­lieve in Orthog­o­nal­ity, the ar­gu­ment from the vol­ume of mind de­sign space is an ar­gu­ment at best for the Ul­traweak ver­sion of Orthog­o­nal­ity.

Hume’s is/​ought dis­tinc­tion.

Depend­ing on the ex­act var­i­ant of Hume-in­spired ar­gu­ment that we de­ploy, the anal­ogy to 2 + 2 = 5 might be weaker or stronger. For ex­am­ple, here’s a Hume-in­spired ar­gu­ment where the 2 + 2 = 5 anal­ogy seems rel­a­tively strong:

“In ev­ery case of a mind judg­ing that ‘cure can­cer’ \(>_P\) ‘make pa­per­clips’, this or­der­ing judg­ment is pro­duced by some par­tic­u­lar com­par­i­son op­er­a­tion in­side the mind. Noth­ing pro­hibits a differ­ent mind from pro­duc­ing a differ­ent com­par­i­son. What­ever you say is the cause of the or­der­ing judg­ment, e.g., that it de­rives from a prior judg­ment ‘happy sapi­ent lives’ \(>_P\) ‘pa­per­clips’, we can imag­ine that part of the agent also have been pro­grammed differ­ently. Differ­ent causes will yield differ­ent effects, and what­ever the causal­ity be­hind ‘cure can­cer’ \(>_P\) ‘make pa­per­clips’, we can imag­ine a differ­ent causally con­sti­tuted agent which ar­rives at a differ­ent judg­ment.”

If we sub­sti­tute “2 + 2 = 5” into the above ar­gu­ment we get one in which all the con­stituent state­ments are equally true—this judg­ment is pro­duced by a cause, the causes have causes, a differ­ent agent should pro­duce a differ­ent out­put in that part of the com­pu­ta­tion, etcetera. So this ver­sion re­ally has the same im­port as a gen­eral ar­gu­ment from the width of mind de­sign space, and to a skep­tic, would only im­ply the ul­tra­weak form of Orthog­o­nal­ity.

How­ever, if we’re will­ing to con­sider some ad­di­tional prop­er­ties of is/​ought, the anal­ogy to “2 + 2 = 5” starts to be­come less tight. For in­stance, “Ought-com­para­tors are not di­rect prop­er­ties of the ma­te­rial world, there is no tiny \(>_P\) among the quarks, and that’s why we can vary ac­tion-prefer­ence com­pu­ta­tions with­out af­fect­ing quark-pre­dict­ing com­pu­ta­tions” does not have a clear analo­gous ar­gu­ment for why it should be just as easy to pro­duce minds that judge 2 + 2 = 5.

In­stru­men­tal con­ver­gence.

There’s no ob­vi­ous analogue of “An agent that knows as well as we do which poli­cies are likely to lead to lots of ex­pected pa­per­clips, and an agent that knows as well as we do which poli­cies are likely to lead to lots of happy sapi­ent be­ings, are on an equal foot­ing when it comes to do­ing things like sci­en­tific re­search”, for “agents that be­lieve 2 + 2 = 5 are at no dis­ad­van­tage com­pared to agents that be­lieve 2 + 2 = 4″.

Reflec­tive sta­bil­ity.

Rel­a­tively weaker forms of the re­flec­tive-sta­bil­ity ar­gu­ment might al­low analo­gies be­tween “pre­fer pa­per­clips” and “be­lieve 2 + 2 = 5″, but prob­ing for more de­tails makes the anal­ogy break down. E.g., con­sider the fol­low­ing sup­pos­edly analo­gous ar­gu­ment:

“Sup­pose you think the sky is green. Then you won’t want to self-mod­ify to make a fu­ture ver­sion of your­self be­lieve that the sky is blue, be­cause you’ll be­lieve this fu­ture ver­sion of your­self would be­lieve some­thing false. There­fore, all be­liefs are equally sta­ble un­der re­flec­tion.”

This does poke at an un­der­ly­ing point: By de­fault, all Bayesian pri­ors will be equally sta­ble un­der re­flec­tion. How­ever, minds that un­der­stand how differ­ent pos­si­ble wor­lds will provide sen­sors with differ­ent ev­i­dence, will want to do Bayesian up­dates on the data from the sen­sors. (We don’t even need to re­gard this as chang­ing the prior; un­der Up­date­less De­ci­sion The­ory, we can see it as the agent branch­ing its suc­ces­sors to be­have differ­ently in differ­ent wor­lds.) There’s a par­tic­u­lar way that a con­se­quen­tial­ist agent, con­tem­plat­ing its own op­er­a­tion, goes from “The sky is very prob­a­bly green, but might be blue” to “check what this sen­sor says and up­date the be­lief”, and in­deed, an agent like this will not wan­tonly change its cur­rent be­lief with­out look­ing at a sen­sor, as the ar­gu­ment in­di­cates.

In con­trast, the way in which “pre­fer more pa­per­clips” prop­a­gates through an agent’s be­liefs about the effects of fu­ture de­signs and their in­ter­ac­tions with the world does not sug­gest that fu­ture ver­sions of the agent will pre­fer some­thing other than pa­per­clips, or that it would make the de­sire to pro­duce pa­per­clips con­di­tional on a par­tic­u­lar sen­sor value, since this would not be ex­pected to lead to more to­tal pa­per­clips.

Orthog­o­nal search tractabil­ity, con­struc­tive speci­fi­ca­tions of Orthog­o­nal agent ar­chi­tec­tures.

Th­ese have no ob­vi­ous analogue in “or­thog­o­nal tractabil­ity of op­ti­miza­tion with differ­ent ar­ith­meti­cal an­swers” or “agent ar­chi­tec­tures that look very straight­for­ward, are oth­er­wise effec­tive, and ac­cept as in­put a free choice of what they think 2 + 2 equals”.

Mo­ral internalism

(Todo: Mo­ral in­ter­nal­ism says that truly nor­ma­tive con­tent must be in­her­ently com­pel­ling to all pos­si­ble minds, but we can ex­hibit in­creas­ingly bounded agent de­signs that ob­vi­ously wouldn’t be com­pel­led by it. We can re­ply to this by (a) be­liev­ing there must be some hid­den flaw in the rea­son­ing about a pa­per­clip max­i­mizer, (b) say­ing “No True Scots­man” to the pa­per­clip max­i­mizer even though it’s build­ing Dyson Spheres and so­cially ma­nipu­lat­ing its pro­gram­mers, (c) be­liev­ing that a pa­per­clip max­i­mizer must fall short of be­ing a true mind in some way that im­plies a big ca­pa­bil­ities dis­ad­van­tage, (d) ac­cept­ing nihilism, or (e) not be­liev­ing in moral in­ter­nal­ism.)

Selec­tion filters

(Todo: Ar­gu­ments from evolv­abil­ity or se­lec­tion filters. Dist­in­guish naive failures to un­der­stand effi­cient in­stru­men­tal con­ver­gence, from more so­phis­ti­cated con­cerns in mul­ti­po­lar sce­nar­ios. Prag­matic ar­gu­ment on the his­to­ries of in­effi­cient agents.)

Prag­matic issues

(Todo: In prac­tice, some util­ity func­tions /​ prefer­ence frame­works might be much harder to build and test than oth­ers. Eliezer Yud­kowsky on re­al­is­tic tar­gets for the first AGI need­ing to be built out of el­e­ments that are sim­ple enough to be learn­able. Paul Chris­ti­ano’s con­cern about whether only sen­sory-based goals might be pos­si­ble to build.)



  • The Orthog­o­nal­ity the­sis is about mind de­sign space in gen­eral. Par­tic­u­lar agent ar­chi­tec­tures may not be Orthog­o­nal.

  • Some agents may be con­structed such that their ap­par­ent util­ity func­tions shift with in­creas­ing cog­ni­tive in­tel­li­gence.

  • Some agent ar­chi­tec­tures may con­strain what class of goals can be op­ti­mized.

  • ‘Agent’ is in­tended to be un­der­stood in a very gen­eral way, and not to im­ply, e.g., a small lo­cal robot body.

For prag­matic rea­sons, the phrase ‘ev­ery agent of suffi­cient cog­ni­tive power’ in the Inevita­bil­ity Th­e­sis is speci­fied to in­clude e.g. all cog­ni­tive en­tities that are able to in­vent new ad­vanced tech­nolo­gies and build Dyson Spheres in pur­suit of long-term strate­gies, re­gard­less of whether a philoso­pher might claim that they lack some par­tic­u­lar cog­ni­tive ca­pac­ity in view of how they re­spond to at­tempted moral ar­gu­ments, or whether they are e.g. con­scious in the same sense as hu­mans, etcetera.


Most prag­matic im­pli­ca­tions of Orthog­o­nal­ity or Inevita­bil­ity re­volve around the fol­low­ing re­fine­ments:

Im­ple­men­ta­tion de­pen­dence: The hu­manly ac­cessible space of AI de­vel­op­ment method­olo­gies has enough va­ri­ety to yield both AI de­signs that are value-al­igned, and AI de­signs that are not value-al­igned.

Value load­abil­ity pos­si­ble: There is at least one hu­manly fea­si­ble de­vel­op­ment method­ol­ogy for ad­vanced agents that has Orthog­o­nal free­dom of what util­ity func­tion or meta-util­ity frame­work is in­tro­duced into the ad­vanced agent. (Thus, if we could de­scribe a value-load­able de­sign, and also de­scribe a value-al­igned meta-util­ity frame­work, we could com­bine them to cre­ate a value-al­igned ad­vanced agent.)

Prag­matic in­evita­bil­ity: There ex­ists some goal G such that al­most all hu­manly fea­si­ble de­vel­op­ment meth­ods re­sult in an agent that ends up be­hav­ing like it op­ti­mizes some par­tic­u­lar goal G, per­haps among oth­ers. Most par­tic­u­lar ar­gu­ments about fu­tur­ism will pick differ­ent goals G, but all such ar­gu­ments are negated by any­thing that tends to con­tra­dict prag­matic in­evita­bil­ity in gen­eral.


Im­ple­men­ta­tion de­pen­dence is the core of the policy ar­gu­ment that solv­ing the value al­ign­ment prob­lem is nec­es­sary and pos­si­ble.

Fu­tur­is­tic sce­nar­ios in which AIs are said in pass­ing to ‘want’ some­thing-or-other usu­ally rely on some form of prag­matic in­evita­bil­ity premise and are negated by im­ple­men­ta­tion de­pen­dence.

Orthog­o­nal­ity di­rectly con­tra­dicts the metaeth­i­cal po­si­tion of moral in­ter­nal­ism, which would be falsified by the ob­ser­va­tion of a pa­per­clip max­i­mizer. On the metaeth­i­cal po­si­tion that or­thog­o­nal­ity and cog­ni­tivism are com­pat­i­ble, ex­hibit­ing a pa­per­clip max­i­mizer has few or no im­pli­ca­tions for ob­ject-level moral ques­tions, and Orthog­o­nal­ity does not im­ply that our hu­mane val­ues or nor­ma­tive val­ues are ar­bi­trary, self­ish, non-cos­mopoli­tan, that we have a my­opic view of the uni­verse or value, etc.




  • Paperclip maximizer

    This agent will not stop un­til the en­tire uni­verse is filled with pa­per­clips.

  • Mind design space is wide

    Imag­ine all hu­man be­ings as one tiny dot in­side a much vaster sphere of pos­si­bil­ities for “The space of minds in gen­eral.” It is wiser to make claims about some minds than all minds.

  • Instrumental goals are almost-equally as tractable as terminal goals

    Get­ting the milk from the re­friger­a­tor be­cause you want to drink it, is not vastly harder than get­ting the milk from the re­friger­a­tor be­cause you in­her­ently de­sire it.


  • Theory of (advanced) agents

    One of the re­search sub­prob­lems of build­ing pow­er­ful nice AIs, is the the­ory of (suffi­ciently ad­vanced) minds in gen­eral.