Coherent extrapolated volition (alignment target)

Introduction

“Co­her­ent ex­trap­o­lated vo­li­tion” (CEV) is Eliezer Yud­kowsky’s pro­posed thing-to-do with an ex­tremely ad­vanced AGI, if you’re ex­tremely con­fi­dent of your abil­ity to al­ign it on com­pli­cated tar­gets.

Roughly, a CEV-based su­per­in­tel­li­gence would do what cur­rently ex­ist­ing hu­mans would want* the AI to do, if coun­ter­fac­tu­ally:

  1. We knew ev­ery­thing the AI knew;

  2. We could think as fast as the AI and con­sider all the ar­gu­ments;

  3. We knew our­selves perfectly and had bet­ter self-con­trol or self-mod­ifi­ca­tion abil­ity;

to what­ever ex­tent most ex­ist­ing hu­mans, thus ex­trap­o­lated, would pre­dictably want* the same things. (For ex­am­ple, in the limit of ex­trap­o­la­tion, nearly all hu­mans might want* not to be turned into pa­per­clips, but might not agree* on the best pizza top­pings. See be­low.)

CEV is meant to be the liter­ally op­ti­mal or ideal or nor­ma­tive thing to do with an au­tonomous su­per­in­tel­li­gence, if you trust your abil­ity to perfectly al­ign a su­per­in­tel­li­gence on a very com­pli­cated tar­get. (See be­low.)

CEV is rather com­pli­cated and meta and hence not in­tended as some­thing you’d do with the first AI you ever tried to build. CEV might be some­thing that ev­ery­one in­side a pro­ject agreed was an ac­cept­able mu­tual tar­get for their sec­ond AI. (The first AI should prob­a­bly be a Task AGI.)

For the cor­re­spond­ing metaeth­i­cal the­ory see Ex­trap­o­lated vo­li­tion (nor­ma­tive moral the­ory).

start split­ting the sub­sec­tions into sep­a­rate pages, then build learn­ing paths.

Concept

%%knows-req­ui­site(Ex­trap­o­lated vo­li­tion (nor­ma­tive moral the­ory)):

See “Ex­trap­o­lated vo­li­tion (nor­ma­tive moral the­ory)”.

%%

%%!knows-req­ui­site(Ex­trap­o­lated vo­li­tion (nor­ma­tive moral the­ory)):

Ex­trap­o­lated vo­li­tion is the metaeth­i­cal the­ory that when we ask “What is right?”, then in­so­far as we’re ask­ing some­thing mean­ingful, we’re ask­ing “What would a coun­ter­fac­tual ideal­ized ver­sion of my­self want* if it knew all the facts, had con­sid­ered all the ar­gu­ments, and had perfect self-knowl­edge and self-con­trol?” (As a metaeth­i­cal the­ory, this would make “What is right?” a mixed log­i­cal and em­piri­cal ques­tion, a func­tion over pos­si­ble states of the world.)

A very sim­ple ex­am­ple of ex­trap­o­lated vo­li­tion might be to con­sider some­body who asks you to bring them or­ange juice from the re­friger­a­tor. You open the re­friger­a­tor and see no or­ange juice, but there’s lemon­ade. You imag­ine that your friend would want you to bring them lemon­ade if they knew ev­ery­thing you knew about the re­friger­a­tor, so you bring them lemon­ade in­stead. On an ab­stract level, we can say that you “ex­trap­o­lated” your friend’s “vo­li­tion”, in other words, you took your model of their mind and de­ci­sion pro­cess, or your model of their “vo­li­tion”, and you imag­ined a coun­ter­fac­tual ver­sion of their mind that had bet­ter in­for­ma­tion about the con­tents of your re­friger­a­tor, thereby “ex­trap­o­lat­ing” this vo­li­tion.

Hav­ing bet­ter in­for­ma­tion isn’t the only way that a de­ci­sion pro­cess can be ex­trap­o­lated; we can also, for ex­am­ple, imag­ine that a mind has more time in which to con­sider moral ar­gu­ments, or bet­ter knowl­edge of it­self. Maybe you cur­rently want re­venge on the Ca­pulet fam­ily, but if some­body had a chance to sit down with you and have a long talk about how re­venge af­fects civ­i­liza­tions in the long run, you could be talked out of that. Maybe you’re cur­rently con­vinced that you ad­vo­cate for green shoes to be out­lawed out of the good­ness of your heart, but if you could ac­tu­ally see a print­out of all of your own emo­tions at work, you’d see there was a lot of bit­ter­ness di­rected at peo­ple who wear green shoes, and this would change your mind about your de­ci­sion.

In Yud­kowsky’s ver­sion of ex­trap­o­lated vo­li­tion con­sid­ered on an in­di­vi­d­ual level, the three core di­rec­tions of ex­trap­o­la­tion are:

  • In­creased knowl­edge—hav­ing more veridi­cal knowl­edge of declar­a­tive facts and ex­pected out­comes.

  • In­creased con­sid­er­a­tion of ar­gu­ments—be­ing able to con­sider more pos­si­ble ar­gu­ments and as­sess their val­idity.

  • In­creased re­flec­tivity—greater knowl­edge about the self, and to some de­gree, greater self-con­trol (though this raises fur­ther ques­tions about which parts of the self nor­ma­tively get to con­trol which other parts).

%%

Motivation

Differ­ent peo­ple ini­tially re­act differ­ently to the ques­tion “Where should we point a su­per­in­tel­li­gence?” or “What should an al­igned su­per­in­tel­li­gence do?”—not just differ­ent be­liefs about what’s good, but differ­ent frames of mind about how to ask the ques­tion.

Some com­mon re­ac­tions:

  1. “Differ­ent peo­ple want differ­ent things! There’s no way you can give ev­ery­one what they want. Even if you pick some way of com­bin­ing things that peo­ple want, you’ll be the one say­ing how to com­bine it. Some­one else might think they should just get the whole world for them­selves. There­fore, in the end you’re de­cid­ing what the AI will do, and any claim to some sort of higher jus­tice or nor­ma­tivity is noth­ing but sophistry.”

  2. “What we should do with an AI is ob­vi­ous—it should op­ti­mize liberal demo­cratic val­ues. That already takes into ac­count ev­ery­one’s in­ter­ests in a fair way. The real threat is if bad peo­ple get their hands on an AGI and build an AGI that doesn’t op­ti­mize liberal demo­cratic val­ues.”

  3. “Imag­ine the an­cient Greeks tel­ling a su­per­in­tel­li­gence what to do. They’d have told it to op­ti­mize for glo­ri­ous deaths in bat­tle. Pro­gram­ming any other set of in­flex­ible goals into a su­per­in­tel­li­gence seems equally stupid; it has to be able to change and grow.”

  4. “What if we tell the su­per­in­tel­li­gence what to do and it’s the wrong thing? What if we’re ba­si­cally con­fused about what’s right? Shouldn’t we let the su­per­in­tel­li­gence figure that out on its own, with its as­sumed su­pe­rior in­tel­li­gence?”

An ini­tial re­sponse to each of these frames might be:

  1. “Okay, but sup­pose you’re build­ing a su­per­in­tel­li­gence and you’re try­ing not to be a jerk about it. If you say, ‘What­ever I do origi­nates in my­self, and there­fore is equally self­ish, so I might as well de­clare my­self God-Em­peror of the Uni­verse’ then you’re be­ing a jerk. Is there any­thing you could do in­stead which would be less like be­ing a jerk? What’s the least jerky thing you could do?”

  2. “What if you would, af­ter some fur­ther dis­cus­sion, want to tweak your defi­ni­tion of ‘liberal demo­cratic val­ues’ just a lit­tle? What if it’s pre­dictable that you would do that? Would you re­ally want to be stuck with your off-the-cuff defi­ni­tion a mil­lion years later?”

  3. “Okay, so what should the An­cient Greeks have done if they did have to pro­gram an AI? How could they not have doomed fu­ture gen­er­a­tions? Sup­pose the An­cient Greeks are clever enough to have no­ticed that some­times peo­ple change their minds about things and to re­al­ize that they might not be right about ev­ery­thing. How can they use the clev­er­ness of the AGI in a con­struc­tively speci­fied, com­putable fash­ion that gets them out of this hole? You can’t just tell the AGI to com­pute what’s ‘right’, you need to put an ac­tual com­putable ques­tion in there, not a word.”

  4. “You asked, what if we’re ba­si­cally con­fused about what’s right—well, in that case, what does the word ‘right’ even mean? If you don’t know what’s right, and you don’t know how to com­pute what’s right, then what are we even talk­ing about? Do you have any ground on which to say that an AGI which only asks ‘Which out­come leads to the great­est num­ber of pa­per­clips?’ isn’t com­put­ing right­ness? If you don’t think a pa­per­clip max­i­mizer is com­put­ing right­ness, then you must know some­thing about the right­ness-ques­tion which ex­cludes that pos­si­bil­ity—so let’s talk about how to pro­gram that right­ness-ques­tion into an AGI.”

Ar­guendo by CEV’s ad­vo­cates, all of these lines of dis­cus­sion even­tu­ally end up con­verg­ing on the idea of co­her­ent ex­trap­o­lated vo­li­tion. For ex­am­ple:

  1. Ask­ing what ev­ery­one would want* if they knew what the AI knew, and do­ing what they’d all pre­dictably agree on, is just about the least jerky thing you can do. If you tell the AI to give ev­ery­one a vol­cano lair be­cause you think vol­cano lairs are neat, you’re not be­ing self­ish, but you’re be­ing a jerk to ev­ery­one who doesn’t want a vol­cano lair. If you have the AI just do what peo­ple ac­tu­ally say, they’ll end up hurt­ing them­selves with dumb wishes and you’d be a jerk. If you only ex­trap­o­late your friends and have the AI do what only you’d want, you’re be­ing jerks to ev­ery­one else.

  2. Yes, liberal demo­cratic val­ues are good; so is ap­ple pie. Ap­ple pie is a good thing but it’s not the only good thing. William Frankena’s list of ends-in-them­selves in­cluded “Life, con­scious­ness, and ac­tivity; health and strength; plea­sures and satis­fac­tions of all or cer­tain kinds; hap­piness, beat­i­tude, con­tent­ment” and then 25 more items, and the list cer­tainly isn’t com­plete. The only way you’re go­ing to get a com­plete list is by an­a­lyz­ing hu­man minds; and even then, if our de­scen­dants would pre­dictably want some­thing else a mil­lion years later, we ought to take that into ac­count too.

  3. Every im­prove­ment is a change, but not ev­ery change is an im­prove­ment. Just let­ting a su­per­in­tel­li­gence change at ran­dom doesn’t en­cap­su­late moral progress. Say­ing that change to­ward more liberal demo­cratic val­ues is progress, pre­sumes that we already know the des­ti­na­tion or an­swer. We can’t even just ask the AGI to pre­dict what civ­i­liza­tions would think a thou­sand years later, since (a) the AI it­self im­pacts this and (b) if the AI did noth­ing, maybe in a thou­sand years ev­ery­one would have ac­ci­den­tally blissed them­selves out while try­ing to mod­ify their own brains. If we want to do bet­ter than the hy­po­thet­i­cal an­cient Greeks, we need to define a suffi­ciently ab­stract and meta crite­rion that de­scribes valid di­rec­tions of progress—such as changes in moral be­liefs as­so­ci­ated with learn­ing new facts, for ex­am­ple; or moral change that would pre­dictably oc­cur if we con­sid­ered a larger set of ar­gu­ments; or moral change that would pre­dictably oc­cur if we un­der­stood our­selves bet­ter.

  4. This one is a long story: Me­taethics deals with the ques­tion of what sort of en­tity ‘right­ness’ is ex­actly—tries to rec­on­cile this strange in­ef­fable ‘right­ness’ busi­ness with a uni­verse made out of par­ti­cle fields. Even though it seems like hu­man be­ings want­ing to mur­der peo­ple wouldn’t make mur­der right, there’s also nowhere in the stars or moun­tains where we can ac­tu­ally find it writ­ten that mur­der is wrong. At the end of a rather long dis­cus­sion, we de­cide that for any given per­son speak­ing at a given point in time, ‘right­ness’ is a log­i­cal con­stant which, al­though not coun­ter­fac­tu­ally de­pen­dent on the state of the per­son’s brain, must be an­a­lyt­i­cally iden­ti­fied with the ex­trap­o­lated vo­li­tion of that brain; and we show that (only) this stance gives con­sis­tent an­swers to all the stan­dard ques­tions in metaethics. (This dis­cus­sion takes a while, on the or­der of ex­plain­ing how de­ter­minis­tic laws of physics don’t show that you have un­free will.)

(To do: Write di­alogues from each of these four en­trance points.) write these and split them up into sep­a­rate subpages

Si­tu­at­ing CEV in con­tem­po­rary metaethics

See the cor­re­spond­ing sec­tion in “Ex­trap­o­lated vo­li­tion (nor­ma­tive moral the­ory)”.

Scary de­sign challenges

There are sev­eral rea­sons why CEV is way too challeng­ing to be a good tar­get for any pro­ject’s first try at build­ing ma­chine in­tel­li­gence:

  1. A CEV agent would be in­tended to carry out an au­tonomous open-ended mis­sion. This im­plies all the usual rea­sons we ex­pect an au­tonomous AI to be harder to make safe than a Task AGI.

  2. CEV is a weird goal. It in­volves re­cur­sion.

  3. Even the terms in CEV, like “know more” or “ex­trap­o­late a hu­man”, seem com­pli­cated and value-laden. You might have to build a high-level Do What I Know I Mean agent, and then tell it to do CEV. Do What I Know I Mean is com­pli­cated enough that you’d need to build an AI that can learn DWIKIM, so that DWIKIM can be taught rather than for­mally speci­fied. So we’re look­ing at some­thing like CEV, run­ning on top of DWIKIM, run­ning on top of a goal-learn­ing sys­tem, at least un­til the first time the CEV agent rewrites it­self.

Do­ing this cor­rectly the very first time we build a smarter-than-hu­man in­tel­li­gence seems im­prob­a­ble. The only way this would make a good first tar­get is if the CEV con­cept is for­mally sim­pler than it cur­rently seems, and timelines to AGI are un­usu­ally long and per­mit a great deal of ad­vance work on safety.

If AGI is 20 years out (or less), it seems wiser to think in terms of a Task AGI perform­ing some rel­a­tively sim­ple pivotal act. The role of CEV is of an­swer­ing the ques­tion, “What can you all agree in ad­vance that you’ll try to do next, af­ter you’ve ex­e­cuted your Task AGI and got­ten out from un­der the shadow of im­me­di­ate doom?”

What if CEV fails to co­here?

A fre­quently asked ques­tion is “What if ex­trap­o­lat­ing hu­man vo­li­tions pro­duces in­co­her­ent an­swers?”

Ar­guendo ac­cord­ing to the origi­nal mo­ti­va­tion for CEV, if this hap­pens in some places, a Friendly AI ought to ig­nore those places. If it hap­pens ev­ery­where, you prob­a­bly picked a silly way to con­strue an ex­trap­o­lated vo­li­tion and you ought to re­think it. noteAlbeit in prac­tice, you would not want an AI pro­ject to take a dozen tries at defin­ing CEV. This would in­di­cate some­thing ex­tremely wrong about the method be­ing used to gen­er­ate sug­gested an­swers. What­ever fi­nal at­tempt passed would prob­a­bly be the first an­swer all of whose re­main­ing flaws were hid­den, rather than an an­swer with all flaws elimi­nated.

That is:

  • If your CEV al­gorithm finds that “Peo­ple co­her­ently want to not be eaten by pa­per­clip max­i­miz­ers, but end up with a broad spec­trum of in­di­vi­d­ual and col­lec­tive pos­si­bil­ities for which pizza top­pings they pre­fer”, we would nor­ma­tively want a Friendly AI to pre­vent peo­ple from be­ing eaten by pa­per­clip max­i­miz­ers but not mess around with which pizza top­pings peo­ple end up eat­ing in the Fu­ture.

  • If your CEV al­gorithm claims that there’s no co­her­ent sense in which “A lot of peo­ple would want to not be eaten by Clippy and would still want* this even if they knew more stuff” then this is a sus­pi­cious and un­ex­pected re­sult. Per­haps you have picked a silly way to con­strue some­body’s vo­li­tion.

The origi­nal mo­ti­va­tion for CEV can also be viewed from the per­spec­tive of “What is it to help some­one?” and “How can one help a large group of peo­ple?”, where the in­tent be­hind the ques­tion is to build an AI that ren­ders ‘help’ as we re­ally in­tend that. The el­e­ments of CEV can be seen as caveats to the naive no­tion of “Help is giv­ing peo­ple what­ever they ask you for!” in which some­body asks you to bring them or­ange juice but the or­ange juice in the re­friger­a­tor is poi­sonous (and they’re not try­ing to poi­son them­selves).

What about helping a group of peo­ple? If two peo­ple ask for juice and you can only bring one kind of juice, you should bring a non-poi­sonous kind of juice they’d both like, to the ex­tent any such juice ex­ists. If no such juice ex­ists, find a kind of juice that one of them is meh about and that the other one likes, and flip a coin or some­thing to de­cide who wins. You are then be­ing around as helpful as it is pos­si­ble to be.

Can there be no way to help a large group of peo­ple? This seems im­plau­si­ble. You could at least give the starv­ing ones pizza with a kind of pizza top­ping they cur­rently like. To the ex­tent your philos­o­phy claims “Oh noes even that is not helping be­cause it’s not perfectly co­her­ent,” you have picked the wrong con­strual of ‘helping’.

It could be that, if we find that ev­ery rea­son­able-sound­ing con­strual of ex­trap­o­lated vo­li­tion fails to co­here, we must ar­rive at some en­tirely other no­tion of ‘helping’. But then this new form of helping also shouldn’t in­volve bring­ing peo­ple poi­sonous or­ange juice that they don’t know is poi­soned, be­cause that still in­tu­itively seems un­helpful.

Helping peo­ple with in­co­her­ent preferences

What if some­body be­lieves them­selves to pre­fer onions to pineap­ple on their pizza, pre­fer pineap­ple to mush­rooms, and pre­fer mush­rooms to onions? In the sense that, offered any two slices from this set, they would pick ac­cord­ing to the given or­der­ing?

(This isn’t an un­re­al­is­tic ex­am­ple. Numer­ous ex­per­i­ments in be­hav­ioral eco­nomics demon­strate ex­actly this sort of cir­cu­lar prefer­ence. For in­stance, you can ar­range 3 items such that each pair of them brings a differ­ent salient qual­ity into fo­cus for com­par­i­son.)

One may worry that we couldn’t ‘co­her­ently ex­trap­o­late the vo­li­tion’ of some­body with these pizza prefer­ences, since these lo­cal choices ob­vi­ously aren’t con­sis­tent with any co­her­ent util­ity func­tion. But how could we help some­body with a pizza prefer­ence like this?

Well, ap­peal­ing to the in­tu­itive no­tion of helping:

  • We could give them what­ever kind of pizza they’d pick if they had to pick among all three si­mul­ta­neously.

  • We could figure out how happy they’d be eat­ing each type of pizza, in terms of emo­tional in­ten­sity as mea­sured in neu­ro­trans­mit­ters; and offer them the slice of pizza that they’ll most en­joy.

  • We could let them pick their own damn pizza top­pings and con­cern our­selves mainly with mak­ing sure the pizza isn’t poi­sonous, since the per­son definitely prefers non-poi­soned pizza.

  • We could, given suffi­cient brain­power on our end, figure out what this per­son would ask us to do for them in this case af­ter that per­son had learned about the con­cept of a prefer­ence re­ver­sal and been told about their own cir­cu­lar prefer­ences. If this varies wildly de­pend­ing on ex­actly how we ex­plain the con­cept of a prefer­ence re­ver­sal, we could re­fer back to one of the pre­vi­ous three an­swers in­stead.

Con­versely, these al­ter­na­tives seem less helpful:

  • Re­fuse to have any­thing to do with that per­son since their cur­rent prefer­ences don’t form a co­her­ent util­ity func­tion.

  • Emit “ERROR ERROR” sounds like a Hol­ly­wood AI that’s just found out about the Epi­menides Para­dox.

  • Give them pizza with your own fa­vorite top­ping, green pep­pers, even though they’d pre­fer any of the 3 other top­pings to those.

  • Give them pizza with the top­ping that would taste best to them, pep­per­oni, de­spite their be­ing veg­e­tar­i­ans.

Ar­guendo by ad­vo­cates of CEV: If you blank the com­plex­ities of ‘ex­trap­o­lated vo­li­tion’ out of your mind; and ask how you could rea­son­ably help peo­ple as best as pos­si­ble if you were try­ing not be a jerk; and then try to figure out how to semifor­mal­ize what­ever men­tal pro­ce­dure you just fol­lowed to ar­rive at your an­swer for how to help peo­ple; then you will even­tu­ally end up at CEV again.

Role of meta-ideals in pro­mot­ing early agreement

A pri­mary pur­pose of CEV is rep­re­sent a rel­a­tively sim­ple meta-level ideal that peo­ple can agree upon, even where they might dis­agree on the ob­ject level. By a hope­fully analo­gous ex­am­ple, two hon­est sci­en­tists might dis­agree on the cor­rect mass of an elec­tron, but agree that the ex­per­i­men­tal method is a good way to re­solve the an­swer.

Imag­ine Millikan be­lieves an elec­tron’s mass is 9.1e-28 grams, and Nan­nikan be­lieves the cor­rect elec­tron mass is 9.1e-34 grams. Millikan might be very wor­ried about Nan­nikan’s pro­posal to pro­gram an AI to be­lieve the elec­tron mass is 9.1e-34 grams; Nan­nikan doesn’t like Millikan’s pro­posal to pro­gram in 9.1-e28; and both of them would be un­happy with a com­pro­mise mass of 9.1e-31 grams. They might still agree on pro­gram­ming an AI with some analogue of prob­a­bil­ity the­ory and a sim­plic­ity prior, and let­ting a su­per­in­tel­li­gence come to the con­clu­sions im­plied by Bayes and Oc­cam, be­cause the two can agree on an effec­tively com­putable ques­tion even though they think the ques­tion has differ­ent an­swers. Of course, this is eas­ier to agree on when the AI hasn’t yet pro­duced an an­swer, or if the AI doesn’t tell you the an­swer.

It’s not guaran­teed that ev­ery hu­man em­bod­ies the same im­plicit moral ques­tions, in­deed this seems un­likely, which means that Alice and Bob might still ex­pect their ex­trap­o­lated vo­li­tions to dis­agree about things. Even so, while the out­puts are still ab­stract and not-yet-com­puted, Alice doesn’t have much of a place to stand on which to ap­peal to Carol, Den­nis, and Eve­lyn by say­ing, “But as a mat­ter of moral­ity and jus­tice, you should have the AI im­ple­ment my ex­trap­o­lated vo­li­tion, not Bob’s!” To ap­peal to Carol, Den­nis, and Eve­lyn about this, you’d need them to be­lieve that Alice’s EV was more likely to agree with their EVs than Bob’s was—and at that point, why not come to­gether on the ob­vi­ous Schel­ling point of ex­trap­o­lat­ing ev­ery­one’s EVs?

Thus, one of the pri­mary pur­poses of CEV (sel­l­ing points, de­sign goals) is that it’s some­thing that Alice, Bob, and Carol can agree now that Den­nis and Eve­lyn should do with an AI that will be de­vel­oped later; we can try to set up com­mit­ment mechanisms now, or check-and-bal­ance mechanisms now, to en­sure that Den­nis and Eve­lyn are still work­ing on CEV later.

Role of ‘co­her­ence’ in re­duc­ing ex­pected un­re­solv­able disagreements

A CEV is not nec­es­sar­ily a ma­jor­ity vote. A lot of peo­ple with an ex­trap­o­lated weak prefer­ence* might be coun­ter­bal­anced by a few peo­ple with a strong ex­trap­o­lated prefer­ence* in the op­po­site di­rec­tion. Nick Bostrom’s “par­li­a­men­tary model” for re­solv­ing un­cer­tainty be­tween in­com­men­su­rable eth­i­cal the­o­ries, per­mits a sub­the­ory very con­cerned about a de­ci­sion to spend a large amount of its limited in­fluence on in­fluenc­ing that par­tic­u­lar de­ci­sion.

This means that, e.g., a ve­gan or an­i­mal-rights ac­tivist should not need to ex­pect that they must seize con­trol of a CEV al­gorithm in or­der for the re­sult of CEV to pro­tect an­i­mals. It doesn’t seem like most of hu­man­ity would be de­riv­ing huge amounts of util­ity from hurt­ing an­i­mals in a post-su­per­in­tel­li­gence sce­nario, so even a small part of the pop­u­la­tion that strongly op­poses* this sce­nario should be de­ci­sive in pre­vent­ing it.

Mo­ral haz­ard vs. debugging

One of the points of the CEV pro­posal is to have min­i­mal moral haz­ard (aka, not tempt­ing the pro­gram­mers to take over the world or the fu­ture); but this may be com­pro­mised if CEV’s re­sults don’t go liter­ally unchecked.

Part of the pur­pose of CEV is to stand as an an­swer to the ques­tion, “If the an­cient Greeks had been the ones to in­vent su­per­in­tel­li­gence, what could they have done that would not, from our later per­spec­tive, ir­re­triev­ably warp the fu­ture? If the an­cient Greeks had pro­grammed in their own val­ues di­rectly, they would have pro­grammed in a glo­ri­ous death in com­bat. Now let us con­sider that per­haps we too are not so wise.” We can imag­ine the an­cient Greeks writ­ing a CEV mechanism, peek­ing at the re­sult of this CEV mechanism be­fore im­ple­ment­ing it, and be­ing hor­rified by the lack of glo­ri­ous-deaths-in-com­bat in the fu­ture and value sys­tem thus re­vealed.

We can also imag­ine that the Greeks, try­ing to cut down on moral haz­ard, vir­tu­ously re­fuse to peek at the out­put; but it turns out that their ver­sion of CEV has some un­fore­seen be­hav­ior when ac­tu­ally run by a su­per­in­tel­li­gence, and so their world is turned into pa­per­clips.

This is a safety-vs.-moral-haz­ard trade­off be­tween (a) the benefit of be­ing able to look at CEV out­puts in or­der to bet­ter-train the sys­tem or just ver­ify that noth­ing went hor­ribly wrong; and (b) the moral haz­ard that comes from the temp­ta­tion to over­ride the out­put, thus defeat­ing the point of hav­ing a CEV mechanism in the first place.

There’s also a po­ten­tial safety haz­ard just with look­ing at the in­ter­nals of a CEV al­gorithm; the simu­lated fu­ture could con­tain all sorts of di­rectly mind-hack­ing cog­ni­tive haz­ards.

Rather than giv­ing up en­tirely and em­brac­ing max­i­mum moral haz­ard, one pos­si­ble ap­proach to this is­sue might be to have some sin­gle hu­man that is sup­posed to peek at the out­put and provide a 1 or 0 (pro­ceed or stop) judg­ment to the mechanism, with­out any other in­for­ma­tion flow be­ing al­lowed to the pro­gram­mers if the hu­man out­puts 0. (For ex­am­ple, the vol­un­teer might be in a room with ex­plo­sives that go off if 0 is out­put.)

“Selfish bas­tards” problem

Sup­pose that Fred is fund­ing Grace to work on a CEV-based su­per­in­tel­li­gence; and Eve­lyn has de­cided not to op­pose this pro­ject. The re­sult­ing CEV is meant to ex­trap­o­late the vo­li­tions of Alice, Bob, Carol, Den­nis, Eve­lyn, Fred, and Grace with equal weight. (If you’re read­ing this, you’re more than usu­ally likely to be one of Eve­lyn, Fred, or Grace.)

Eve­lyn and Fred and Grace might worry: “What if a su­per­ma­jor­ity of hu­man­ity con­sists of ‘self­ish* bas­tards’, such that their ex­trap­o­lated vo­li­tions would cheer­fully vote* for a world in which it was le­gal to own ar­tifi­cial sapi­ent be­ings as slaves so long as they per­son­ally hap­pened to be in the slave­own­ing class; and we, Eve­lyn and Fred and Grace, just hap­pen to be in the minor­ity that ex­tremely doesn’t want nor want* the fu­ture to be like that?”

That is: What if hu­man­ity’s ex­trap­o­lated vo­li­tions di­verge in such a way that from the stand­point of our vo­li­tions—since, if you’re read­ing this, you’re un­usu­ally likely to be one of Eve­lyn or Fred or Grace − 90% of ex­trap­o­lated hu­man­ity would choose* some­thing such that we would not ap­prove of it, and our vo­li­tions would not ap­prove* of it, even af­ter tak­ing into ac­count that we don’t want to be jerks about it and that we don’t think we were born with any un­usual or ex­cep­tional right to de­ter­mine the fate of hu­man­ity.

That is, let the sce­nario be as fol­lows:

90% of the peo­ple (but not we who are col­lec­tively spon­sor­ing the AI) are self­ish bas­tards at the core, such that any rea­son­able ex­trap­o­la­tion pro­cess (it’s not just that we picked a bro­ken one) would lead to them en­dors­ing a world in which they them­selves had rights, but it was okay to cre­ate ar­tifi­cial peo­ple and hurt them. Fur­ther­more, they would de­rive enough util­ity from be­ing per­sonal God-Em­per­ors that this would over­ride our minor­ity ob­jec­tion even in a par­li­a­men­tary model.

We can see this hy­po­thet­i­cal out­come as po­ten­tially un­der­min­ing ev­ery sort of rea­son that we, who hap­pen to be in a po­si­tion of con­trol to pre­vent that out­come, should vol­un­tar­ily re­lin­quish that con­trol to the re­main­ing 90% of hu­man­ity:

  • We can’t be pri­ori­tiz­ing be­ing fair to ev­ery­one in­clud­ing the other 90% of hu­man­ity, be­cause what about be­ing fair to the ar­tifi­cial peo­ple who are be­ing hurt?

  • We can’t be wor­ry­ing that the other 90% of hu­man­ity would with­draw their sup­port from the pro­ject, or wor­ry­ing about be­tray­ing the pro­ject’s sup­port­ers, be­cause by hy­poth­e­sis they weren’t sup­port­ing it or even per­mit­ting it.

  • We can’t be agree­ing to defer to a righter and more in­tel­li­gent pro­cess to re­solve our dis­pute, be­cause by hy­poth­e­sis the CEV made up of 90% self­ish* bas­tards is not, from our own per­spec­tive, ideally righter.

  • We can’t rely on a par­li­a­men­tary model of co­her­ence to pre­vent what a minor­ity sees as dis­aster, be­cause by hy­poth­e­sis the other 90% is de­riv­ing enough util­ity from col­lec­tively declar­ing them­selves God-Em­per­ors to trump even a strong minor­ity coun­ter­vote.

Rather than giv­ing up en­tirely and tak­ing over the world, or ex­pos­ing our­selves to moral haz­ard by peek­ing at the re­sults, one pos­si­ble ap­proach to this is­sue might be to run a three-stage pro­cess.

This pro­cess in­volves some in­ter­nal refer­ences, so the de­tailed ex­pla­na­tion needs to fol­low a shorter sum­mary ex­pla­na­tion.

In sum­mary:

  • Ex­trap­o­late ev­ery­one’s CEV.

  • Ex­trap­o­late the CEV of the con­trib­u­tors only, and let it give (only) an up-down vote on Every­one’s CEV.

  • If the re­sult is thumbs-up, run Every­one’s CEV.

  • Ex­trap­o­late ev­ery­one’s CEV, but kick­ing out all the parts that would act unilat­er­ally and with­out any con­cern for oth­ers if they were in po­si­tions of unchecked power.

  • Have the Con­trib­u­tor CEV give an up/​down an­swer on the Fal­lback CEV.

  • If the re­sult is thumbs-up, run the Fal­lback CEV.

  • Other­wise fail.

In de­tail:

  • First, ex­trap­o­late the ev­ery­one-on-Earth CEV as though it were not be­ing checked.

  • If any hy­po­thet­i­cal ex­trap­o­lated per­son wor­ries about be­ing checked, delete that con­cern and ex­trap­o­late them as though they didn’t have it. This is nec­es­sary to pre­vent the check it­self from hav­ing a UDT in­fluence on the ex­trap­o­la­tion and the ac­tual fu­ture.

  • Next, ex­trap­o­late the CEV of ev­ery­one who con­tributed to the pro­ject, weighted by their con­tri­bu­tion (pos­si­bly based on some mix of “how much was ac­tu­ally done” ver­sus “how much was ra­tio­nally ex­pected to be ac­com­plished” ver­sus “the frac­tion of what could’ve been done ver­sus what was ac­tu­ally done”). Allow this other ex­trap­o­la­tion an up-or-down vote—not any kind of de­tailed cor­rec­tion—on whether to let the ev­ery­one-on-Earth CEV to go through un­mod­ified.

  • Re­move from the ex­trap­o­la­tion of the Con­trib­u­tor-CEV any strate­gic con­sid­er­a­tions hav­ing to do with the Fal­lback-CEV or post-Fail re­de­vel­op­ment be­ing a bet­ter al­ter­na­tive; we want to ex­tract a judg­ment about “satis­fic­ing” in some sense, whether the Every­one-CEV is in some non-rel­a­tive sense too hor­rible to be al­lowed.

  • If the Every­one-CEV passes the Con­trib­u­tor-CEV check, run it.

  • Other­wise, re-ex­trap­o­late a Fal­lback-CEV that starts with all ex­ist­ing hu­mans as a base, but dis­cards from the ex­trap­o­la­tion all ex­trap­o­lated de­ci­sion pro­cesses that, if they were in a su­pe­rior strate­gic po­si­tion or a po­si­tion of unilat­eral power, would not bother to ex­trap­o­late oth­ers’ vo­li­tions or care about their welfare.

  • Again, re­move all ex­trap­o­lated strate­gic con­sid­er­a­tions about pass­ing the com­ing check.

  • Check the Fal­lback-CEV against the Con­trib­u­tor-CEV for an up-down vote. If it passes, run it.

  • Other­wise Fail (AI shuts down safely, we re­think what to do next or im­ple­ment an agreed-on fal­lback course past this point).

The par­tic­u­lar fal­lback of “kick out from the ex­trap­o­la­tion any weighted por­tions of ex­trap­o­lated de­ci­sion pro­cesses that would act unilat­er­ally and with­out car­ing for oth­ers, given unchecked power” is meant to have a prop­erty of po­etic jus­tice, or ren­der­ing ob­jec­tions to it self-defeat­ing: If it’s okay to act unilat­er­ally, then why can’t we unilat­er­ally kick out the unilat­eral parts? This is meant to be the ‘sim­plest’ or most ‘el­e­gant’ way of kick­ing out a part of the CEV whose in­ter­nal rea­son­ing di­rectly op­poses the whole rea­son we ran CEV in the first place, but im­pos­ing the min­i­mum pos­si­ble filter be­yond that.

Thus if Alice (who by hy­poth­e­sis is not in any way a con­trib­u­tor) says, “But I de­mand you al­tru­is­ti­cally in­clude the ex­trap­o­la­tion of me that would unilat­er­ally act against you if it had power!” then we re­ply, “We’ll try that, but if it turns out to be a suffi­ciently bad idea, there’s no co­her­ent in­ter­per­sonal grounds on which you can re­buke us for tak­ing the fal­lback op­tion in­stead.”

Similarly in re­gards to the Fail op­tion at the end, to any­one who says, “Fair­ness de­mands that you run Fal­lback CEV even if you wouldn’t like* it!” we can re­ply, “Our own power may not be used against us; if we’d re­gret ever hav­ing built the thing, fair­ness doesn’t oblige us to run it.”

Why base CEV on “ex­ist­ing hu­mans” and not some other class of ex­trapolees?

One fre­quently asked ques­tion about the im­ple­men­ta­tion de­tails of CEV is ei­ther:

  • Why for­mu­late CEV such that it is run on “all ex­ist­ing hu­mans” and not “all ex­ist­ing and past hu­mans” or “all mam­mals” or “all sapi­ent life as it prob­a­bly ex­ists ev­ery­where in the mea­sure-weighted in­finite mul­ti­verse”?

  • Why not re­strict the ex­trap­o­la­tion base to “only peo­ple who con­tributed to the AI pro­ject”?

In par­tic­u­lar, it’s been asked why re­stric­tive an­swers to Ques­tion 1 don’t also im­ply the more re­stric­tive an­swer to Ques­tion 2.

Why not in­clude mam­mals?

We’ll start by con­sid­er­ing some replies to the ques­tion, “Why not in­clude all mam­mals into CEV’s ex­trap­o­la­tion base?”

  • Be­cause you could be wrong about mam­mals be­ing ob­jects of sig­nifi­cant eth­i­cal value, such that we should on an ob­ject level re­spect their welfare. The ex­trap­o­la­tion pro­cess will catch the er­ror if you’d pre­dictably change your mind about that. In­clud­ing mam­mals into the ex­trap­o­la­tion base for CEV po­ten­tially sets in stone what could well be an er­ror, the sort of thing we’d pre­dictably change our minds about later. If you’re nor­ma­tively right that we should all care about mam­mals and even try to ex­trap­o­late their vo­li­tions into a judg­ment of Earth’s des­tiny, if that’s what al­most all of us would pre­dictably de­cide af­ter think­ing about it for a while, then that’s what our EVs will de­cide* to do on our be­half; and if they don’t de­cide* to do that, it wasn’t right which un­der­mines your ar­gu­ment for do­ing it un­con­di­tion­ally.

  • Be­cause even if we ought to care about mam­mals’ welfare qua welfare, ex­trap­o­lated an­i­mals might have re­ally damn weird prefer­ences that you’d re­gret in­clud­ing into the CEV. (E.g., af­ter hu­man vo­li­tions are out­voted by the vo­li­tions of other an­i­mals, the cur­rent base of ex­ist­ing an­i­mals’ ex­trap­o­lated vo­li­tions choose* a world in which they are up­lifted to God-Em­per­ors and rule over suffer­ing other an­i­mals.)

  • Be­cause maybe not ev­ery­one on Earth cares* about an­i­mals even if your EV would in fact care* about them, and to avoid a slap-fight over who gets to rule the world, we’re go­ing to set­tle this by e.g. a par­li­a­men­tary-style model in which you get to ex­pend your share of Earth’s des­tiny-de­ter­mi­na­tion on pro­tect­ing an­i­mals.

To ex­pand on this last con­sid­er­a­tion, we can re­ply: “Even if you would re­gard it as more just to have the right an­i­mal-pro­tect­ing out­come baked into the fu­ture im­me­di­ately, so that your EV didn’t need to ex­pend some of its vot­ing strength on as­sur­ing it, not ev­ery­one else might re­gard that as just. From our per­spec­tive as pro­gram­mers we have no par­tic­u­lar rea­son to listen to you rather than Alice. We’re not ar­gu­ing about whether an­i­mals will be pro­tected if a minor­ity ve­gan-type sub­pop­u­la­tion strongly want* that and the rest of hu­man­ity doesn’t care*. We’re ar­gu­ing about whether, if you want* that but a ma­jor­ity doesn’t, your EV should justly need to ex­pend some ne­go­ti­at­ing strength in or­der to make sure an­i­mals are pro­tected. This seems pretty rea­son­able to us as pro­gram­mers from our stand­point of want­ing to be fair, not be jerks, and not start any slap-fights over world dom­i­na­tion.”

This third re­ply is par­tic­u­larly im­por­tant be­cause taken in iso­la­tion, the first two replies of “You could be wrong about that be­ing a good idea” and “Even if you care about their welfare, maybe you wouldn’t like their EVs” could equally ap­ply to ar­gue that con­trib­u­tors to the CEV pro­ject ought to ex­trap­o­late only their own vo­li­tions and not the rest of hu­man­ity:

  • We could be wrong about it be­ing a good idea, by our own lights, to ex­trap­o­late the vo­li­tions of ev­ery­one else; in­clud­ing this into the CEV pro­ject bakes this con­sid­er­a­tion into stone; if we were right about run­ning an Every­one CEV, if we would pre­dictably ar­rive at that con­clu­sion af­ter think­ing about it for a while, our EVs could do that for us.

  • Not ex­trap­o­lat­ing other peo­ple’s vo­li­tions isn’t the same as say­ing we shouldn’t care. We could be right to care about the welfare of oth­ers, but there could be some spec­tac­u­lar hor­ror built into their EVs.

The pro­posed way of ad­dress­ing this was to run a com­pos­ite CEV with a con­trib­u­tor-CEV check and a Fal­lback-CEV fal­lback. But then why not run an An­i­mal-CEV with a Con­trib­u­tor-CEV check be­fore try­ing the Every­one-CEV?

One an­swer would go back to the third re­ply above: Non­hu­man mam­mals aren’t spon­sor­ing the CEV pro­ject, al­low­ing it to pass, or po­ten­tially get­ting an­gry at peo­ple who want to take over the world with no seem­ing con­cern for fair­ness. So they aren’t part of the Schel­ling Point for “ev­ery­one gets an ex­trap­o­lated vote”.

Why not ex­trap­o­late all sapi­ents?

Similarly if we ask: “Why not in­clude all sapi­ent be­ings that the SI sus­pects to ex­ist ev­ery­where in the mea­sure-weighted mul­ti­verse?”

  • Be­cause large num­bers of them might have EVs as alien as the EV of an Ich­neu­monidae wasp.

  • Be­cause our EVs can always do that if it’s ac­tu­ally a good idea.

  • Be­cause they aren’t here to protest and with­draw poli­ti­cal sup­port if we don’t bake them into the ex­trap­o­la­tion base im­me­di­ately.

Why not ex­trap­o­late de­ceased hu­mans?

“Why not in­clude all de­ceased hu­man be­ings as well as all cur­rently liv­ing hu­mans?”

In this case, we can’t then re­ply that they didn’t con­tribute to the hu­man pro­ject (e.g. I. J. Good). Their EVs are also less likely to be alien than in any other case con­sid­ered above.

But again, we fall back on the third re­ply: “The peo­ple who are still al­ive” is a sim­ple Schel­ling cir­cle to draw that in­cludes ev­ery­one in the cur­rent poli­ti­cal pro­cess. To the ex­tent it would be nice or fair to ex­trap­o­late Leo Szilard and in­clude him, we can do that if a su­per­ma­jor­ity of EVs de­cide* that this would be nice or just. To the ex­tent we don’t bake this de­ci­sion into the model, Leo Szilard won’t rise from the grave and re­buke us. This seems like rea­son enough to re­gard “The peo­ple who are still al­ive” as a sim­ple and ob­vi­ous ex­trap­o­la­tion base.

Why in­clude peo­ple who are pow­er­less?

“Why in­clude very young chil­dren, un­con­tacted tribes who’ve never heard about AI, and re­triev­able cry­on­ics pa­tients (if any)? They can’t, in their cur­rent state, vote for or against any­thing.”

  • A lot of the in­tu­itive mo­ti­va­tion for CEV is to not be a jerk, and ig­nor­ing the wishes of pow­er­less liv­ing peo­ple seems in­tu­itively a lot more jerk­ish than ig­nor­ing the wishes of pow­er­less dead peo­ple.

  • They’ll ac­tu­ally be pre­sent in the fu­ture, so it seems like less of a jerk thing to do to ex­trap­o­late them and take their wishes into ac­count in shap­ing that fu­ture, than to not ex­trap­o­late them.

  • Their rel­a­tives might take offense oth­er­wise.

  • It keeps the Schel­ling bound­ary sim­ple.

Parents:

  • Value

    The word ‘value’ in the phrase ‘value al­ign­ment’ is a meta­syn­tac­tic vari­able that in­di­cates the speaker’s fu­ture goals for in­tel­li­gent life.