Aligning an AGI adds significant development time


The votable propo­si­tion is true if, com­par­ing rea­son­ably at­tain­able de­vel­op­ment paths for…

  • Pro­ject Path 1: An al­igned ad­vanced AI cre­ated by a re­spon­si­ble pro­ject that is hur­ry­ing where it can, but still be­ing care­ful enough to main­tain a suc­cess prob­a­bil­ity greater than 25%

  • Pro­ject Path 2: An un­al­igned un­limited su­per­in­tel­li­gence pro­duced by a pro­ject cut­ting all pos­si­ble corners

…where oth­er­wise both pro­jects have ac­cess to the same ideas or dis­cov­er­ies in the field of AGI ca­pa­bil­ities and similar com­pu­ta­tion re­sources; then, as the de­fault /​ or­di­nary /​ modal case af­ter con­di­tion­ing on all of the said as­sump­tions:

Pro­ject Path 1 will re­quire at least 50% longer se­rial time to com­plete than Pro­ject Path 2, or two years longer, whichever is less.


This page was writ­ten to ad­dress mul­ti­ple ques­tion­ers who seem to have ac­cepted the Orthog­o­nal­ity the­sis, but still mostly dis­be­lieve it would take sig­nifi­cantly longer to de­velop al­igned AGI than un­al­igned AGI, if I’ve un­der­stood cor­rectly.

At pre­sent this page is an overview of pos­si­ble places of dis­agree­ment, and may later be se­lec­tively rather than fully ex­panded.


Propo­si­tions feed­ing into this one in­clude:

If ques­tioner be­lieves the nega­tion of ei­ther of these, it would im­ply easy speci­fi­a­bil­ity of a de­ci­sion func­tion suit­able for an un­limited su­per­in­tel­li­gence. That could greatly re­duce the need for, e.g:

It’s worth check­ing whether any of these time-costly de­vel­op­ment prin­ci­ples seem to ques­tioner to not fol­low as im­por­tant from the ba­sic idea of value al­ign­ment be­ing nec­es­sary and not triv­ially solv­able.

Out­side view

To the best of my knowl­edge, it is nor­mal /​ usual /​ un­sur­pris­ing for at least 50% in­creased de­vel­op­ment time to be re­quired by strong ver­sus min­i­mal de­mands on any one of:

  • (3a) safety of any kind

  • (3b) ro­bust be­hav­ior in new one-shot con­texts that can’t be tested in advance

  • (3c) ro­bust be­hav­ior when ex­pe­rienc­ing strong forces

  • (3d) re­li­able avoidance of a sin­gle catas­trophic failure

  • (3e) re­silience in the face of strong op­ti­miza­tion pres­sures that can po­ten­tially lead the sys­tem to tra­verse un­usual ex­e­cu­tion paths

  • (3f) con­for­mance to com­pli­cated de­tails of a user’s de­sired sys­tem behavior

com­ment: It would in­deed be un­usual—some pro­ject man­agers might call it ex­tra-or­di­nary good for­tune—if a sys­tem de­mand­ing two or more of these prop­er­ties did not re­quire at least 50
more de­vel­op­ment time com­pared to a sys­tem that didn’t.%

Ob­vi­ous-seem­ing-to-me analo­gies in­clude:

  • Launch­ing a space probe that can­not be cor­rected once launched, a deed which usu­ally calls for ex­traor­di­nary ad­di­tional ad­vance check­ing and testing

  • Launch­ing the sim­plest work­ing rocket that will ex­pe­rience un­com­monly great ac­cel­er­a­tions and forces, com­pared to build­ing the sim­plest work­ing airplane

  • It would be far less ex­pen­sive to de­sign rock­ets if “the rocket ex­plodes” were not a prob­lem; most of the cost of a rocket is hav­ing the rocket not explode

  • NASA man­ag­ing to write al­most en­tirely bug-free code for some pro­jects at 100x the cost per line of code, us­ing means that in­volved mul­ti­ple re­views and care­ful lines of or­ga­ni­za­tional ap­proval for ev­ery as­pect and el­e­ment of the system

  • The OpenBSD pro­ject to pro­duce a se­cure op­er­at­ing sys­tem, which needed to con­strain its code to be more min­i­mal than larger Linux pro­jects, and prob­a­bly added a lot more than 50% time per func­tion point to ap­prove each el­e­ment of the code

  • The differ­ence in effort put forth by an am­a­teur writ­ing an en­cryp­tion sys­tem they think is se­cure, ver­sus the cryp­to­graphic ecosys­tem try­ing to en­sure a chan­nel is secure

  • The real pre­mium on safety for hos­pi­tal equip­ment, as op­posed to the bu­reau­cratic pre­mium on it, is prob­a­bly still over 50% be­cause it does in­volve le­gi­t­i­mate ad­di­tional test­ing to try to not kill the patient

  • Sur­geons prob­a­bly le­gi­t­i­mately re­quire at least 50% longer to op­er­ate on hu­mans than they would re­quire to perform op­er­a­tions of analo­gous com­plex­ity on large plants it was okay to kill 10% of the time

  • Even in the to­tal ab­sence of reg­u­la­tory over­head, it seems le­gi­t­i­mately harder to build a nu­clear power plant that usu­ally does not melt down, com­pared to a coal power plant (con­firmable by the Soviet ex­pe­rience?)

Some of the stan­dard ways in which sys­tems with strong ver­sus min­i­mal de­mands on (3*)-prop­er­ties *usu­ally* re­quire ad­di­tional de­vel­op­ment time:

  • (4a) Ad­di­tional work for:

  • Whole ex­tra modules

  • Univer­sally en­forced properties

  • Lots of lit­tle lo­cal func­tion points

  • (4b) Need­ing a more ex­tended pro­cess of in­ter­ac­tive shap­ing in or­der to con­form to a com­pli­cated target

  • (4c) Le­gi­t­i­mately re­quiring longer or­ga­ni­za­tional paths to ap­prove ideas, changes and commits

  • (4d) Longer and deeper test phases; on whole sys­tems, on lo­cal com­po­nents, and on func­tion points

  • (4e) Not be­ing able to de­ploy a fast or easy solu­tion (that you could use at some par­tic­u­lar choice point if you didn’t need to worry about the rocket ex­plod­ing)

Out­side view on AI problems

Another refer­ence class that feels rele­vant to me is that things hav­ing to do with AI are of­ten more difficult than ex­pected. E.g. the story of com­puter vi­sion be­ing as­signed to 2 un­der­grads over the sum­mer. This seems like a rele­vant case in point of “un­cor­rected in­tu­ition has a di­rec­tional bias in un­der­es­ti­mat­ing the amount of work re­quired to im­ple­ment things hav­ing to do with AI, and you should cor­rect that di­rec­tional bias by re­vis­ing your es­ti­mate up­ward”.

Given a suffi­ciently ad­vanced Ar­tifi­cial Gen­eral In­tel­li­gence, we might per­haps get nar­row prob­lems on the or­der of com­puter vi­sion for free. But the whole point of Orthog­o­nal­ity is that you do not get AI al­ign­ment for free with gen­eral in­tel­li­gence. Like­wise, iden­ti­fy­ing value-laden con­cepts or ex­e­cut­ing value-laden be­hav­iors doesn’t come free with iden­ti­fy­ing nat­u­ral em­piri­cal con­cepts. We have sep­a­rate ba­sic AI work to do for al­ign­ment. So the anal­ogy to un­der­es­ti­mat­ing a nar­row AI prob­lem, in the early days be­fore any­one had con­fronted that prob­lem, still seems rele­vant.

com­ment: I can’t see how, af­ter imag­in­ing one­self in the shoes of the early re­searchers tack­ling com­puter vi­sion and ‘com­mon­sense rea­son­ing’ and ‘nat­u­ral-lan­guage pro­cess­ing’, af­ter the en­tirety of the his­tory of AI, any­one could rea­son­ably stag­ger back in shocked and hor­rified sur­prise upon en­coun­ter­ing the com­pletely un­ex­pected fact of a weird new AI prob­lem be­ing… kinda hard.

In­side view

While it is pos­si­ble to build new sys­tems that aren’t 100% un­der­stood, and have them work, the suc­cess­ful de­signs were usu­ally greatly ov­ereng­ineered. Some Ro­man bridges have stayed up two mil­len­nia later, which prob­a­bly wasn’t in the de­sign re­quire­ments, so in that sense they turned out to be hugely ov­ereng­ineered, but we can’t blame them. “What takes good en­g­ineer­ing is build­ing bridges that just barely stay up.”

If we’re try­ing for an al­igned Task AGI with­out a re­ally deep un­der­stand­ing of how to build ex­actly the right AGI with no ex­tra parts or ex­tra prob­lems—which must cer­tainly be lack­ing on any sce­nario in­volv­ing rel­a­tively short timescales—then we have to do lots of safety things in or­der to have any chance of sur­viv­ing, be­cause we don’t know in ad­vance which part of the sys­tem will nearly fail. We don’t know in ad­vance that the O-Rings are the part of the Space Shut­tle that’s go­ing to sud­denly be­have un­ex­pect­edly, and we can’t put in ex­tra effort to ar­mor only that part of the pro­cess. We have to ov­ereng­ineer ev­ery­thing to catch the small num­ber of as­pects that turn out not to be so “ov­ereng­ineered” af­ter all.

This sug­gests that even if one doesn’t be­lieve my par­tic­u­lar laun­dry list be­low, who­ever walks through this prob­lem, con­di­tional on their even­tual sur­vival, will have shown up with some laun­dry list of pre­cau­tions, in­clud­ing costly pre­cau­tions; and they will (cor­rectly) not imag­ine them­selves able to sur­vive based on “min­i­mum nec­es­sary” pre­cau­tions.

Some spe­cific ex­tra time costs that I imag­ine might be re­quired:

  • The AGI can only de­ploy in­ter­nal op­ti­miza­tion on pieces of it­self that are small enough to be rel­a­tively safe and not vi­tal to fully understand

  • In other words, the cau­tious pro­gram­mers must in gen­eral do ex­tra work to ob­tain func­tion­al­ity that a cor­ner-cut­ting pro­ject could get in virtue of the AGI hav­ing fur­ther self-improved

  • Every­thing to do with real value al­ign­ment (as op­posed to the AI hav­ing a re­ward but­ton or be­ing re­in­force­ment-trained to ‘obey or­ders’ on some chan­nel) is an ad­di­tional set of func­tion points

  • You have to build new pieces of the sys­tem for trans­parency and mon­i­tor­ing.

  • In­clud­ing e.g. costly but im­por­tant no­tions like “There’s ac­tu­ally a sep­a­ratish AI over here that we built to in­spect the first AI and check for prob­lems, in­clud­ing hav­ing this sep­a­rate AI trained on differ­ent data for safety-re­lated con­cepts”

  • There’s a lot of trusted func­tion points where you can’t just toss in an enor­mous deep­net be­cause that wouldn’t meet the trans­parency or ef­fa­bil­ity re­quire­ments at that func­tion point

  • When some­body pro­poses a new op­ti­miza­tion thingy, it has to be re­jig­gered to en­sure e.g. that it meets the top-to-bot­tom task­ish­ness re­quire­ment, and ev­ery­one has to stare at it to make sure it doesn’t blow up the world somehow

  • You can’t run jobs on AWS be­cause you don’t trust Ama­zon with the code and you don’t want to put your AI in close causal con­tact with the Internet

  • Some of your sys­tem de­signs rely on all ‘ma­jor’ events be­ing mon­i­tored and all un­seen events be­ing ‘minor’, and the ma­jor mon­i­tored events go through a hu­man in the loop. The hu­mans in the loop are then a rate-limit­ing fac­tor and you can’t just ‘push the lever all the way up’ on that pro­cess.

  • E.g., maybe only ‘ma­jor’ goals can re­cruit sub­goals across all known do­mains and ‘minor’ goals always op­er­ate within a sin­gle do­main us­ing limited cog­ni­tive re­sources.

  • De­ploy­ment in­volves a long con­ver­sa­tion with the AI about ‘what do you ex­pect to hap­pen af­ter you do X?’, and dur­ing that con­ver­sa­tion other pro­gram­mers are slow­ing down the AI to look at pas­sively trans­par­ent in­ter­pre­ta­tions of the AI’s in­ter­nal thoughts

  • The pro­ject has a much lower thresh­old for say­ing “wait, what the hell just hap­pened, we need to stop melt and catch fire, not just try differ­ent patches un­til it seems to run again”

  • The good pro­ject per­haps does a tad more testing

In­de­pe­dently of the par­tic­u­lar list above, this doesn’t feel to me like a case where the con­clu­sion is highly de­pen­dent on Eliezer-de­tails. Any­one with a con­crete plan for al­ign­ing an AI will walk in with a list of plans and meth­ods for safety, some of which re­quire close in­spec­tion of parts, and con­strain al­low­able de­signs, and just plain take more work. One of the im­por­tant ideas is go­ing to turn out to take 500% more work than re­quired, or solv­ing a deep AI prob­lem, and this isn’t go­ing to shock them ei­ther.

Meta view

I gen­uinely have some trou­ble imag­in­ing what ob­jec­tion is stand­ing in the way of ac­cept­ing “ce­teris paribus, al­ign­ment takes at least 50% more time”, hav­ing granted Orthog­o­nal­ity and al­ign­ment not be­ing com­pletely triv­ial. I did not ex­pect the ar­gu­ment to bog down at this par­tic­u­lar step. I won­der if I’m miss­ing some ba­sic premise or mi­s­un­der­stand­ing ques­tioner’s en­tire the­sis.

If I’m not mi­s­un­der­stand­ing, or if I con­sider the the­sis as-my-ears-heard-it at face value, then I can only imag­ine the judg­ment “al­ign­ment prob­a­bly doesn’t take that much longer” be­ing pro­duced by ig­nor­ing what I con­sider to be ba­sic prin­ci­ples of cog­ni­tive re­al­ism. De­spite the dan­gers of psy­chol­o­giz­ing, for pur­poses of over­shar­ing, I’m go­ing to say what feels to me like it would need to be miss­ing:

  • (5a) Even if one feels in­tu­itively op­ti­mistic about a pro­ject, one ought to ex­pect in ad­vance to run into difficul­ties not im­me­di­ately ob­vi­ous. You should not be in a state of mind where to­mor­row’s sur­prises are a lot more likely to be un­pleas­ant than pleas­ant; this is pre­dictable up­dat­ing. The per­son tel­ling you your hope­ful soft­ware pro­ject is go­ing to take longer than 2 weeks should not need to ar­gue you into ac­knowl­edg­ing in ad­vance that some par­tic­u­lar de­lay will oc­cur. It feels like the or­di­nary skill of “stan­dard cor­rec­tion for op­ti­mistic bias” is not be­ing ap­plied.

  • (5b) It feels like this is maybe be­ing put into a men­tal bucket of “fu­tur­is­tic sce­nar­ios” rather than “soft­ware de­vel­op­ment pro­jects”, and is be­ing pro­cessed as pes­simistic fu­ture ver­sus nor­mal fu­ture, or some­thing. In­stead of: “If I ask a pro­ject man­ager for a mis­sion-crit­i­cal deep fea­ture that im­pacts ev­ery as­pect of the soft­ware pro­ject and needs to be im­ple­mented to a high stan­dard of re­li­a­bil­ity, can that get done in just 10% more time than a pro­ject that’s elimi­nat­ing that fea­ture and cut­ting all the cor­ners?”

  • (5c) I similarly re­call the old ex­per­i­ment in which stu­dents named their “best case” sce­nar­ios where “ev­ery­thing goes as well as it rea­son­ably could”, or named their “av­er­age case” sce­nar­ios; and the two elic­i­ta­tions pro­duced in­dis­t­in­guish­able re­sults; and re­al­ity was usu­ally slightly worse than the “worse case” sce­nario. I won­der if the “nor­mal case” for AI al­ign­ment work re­quired is be­ing eval­u­ated along much the same lines as “the best case /​ the case if ev­ery in­di­vi­d­ual event goes as well as I imag­ine by de­fault”.

AI al­ign­ment could be easy in the­ory and still take 50% more de­vel­op­ment time in prac­tice. That is a very or­di­nary thing to have hap­pen when some­body asks the pro­ject man­ager to make sure a piece of highly novel soft­ware ac­tu­ally im­ple­ments an “easy” prop­erty the first time the soft­ware is run un­der new con­di­tions that can’t be fully tested in ad­vance.

“At least 50% more de­vel­op­ment time for the al­igned AI pro­ject, ver­sus the cor­ner-cut­ting pro­ject, as­sum­ing both pro­jects oth­er­wise have ac­cess to the same stock of ideas and meth­ods and com­pu­ta­tional re­sources” seems to me like an ex­tremely prob­a­ble and nor­mal work­ing premise to adopt. What am I miss­ing?

com­ment: I have a sense of “Why am I not up fifty points in the polls?” and “What ex­pe­rienced soft­ware man­ager on the face of the Earth (as­sum­ing they didn’t go men­tally hay­wire on hear­ing the words ‘Ar­tifi­cial In­tel­li­gence’, and con­sid­ered this ques­tion as if it were en­g­ineer­ing), even if they knew al­most noth­ing else about AI al­ign­ment the­ory, would not be giv­ing a rather skep­ti­cal look to the no­tion that care­fully craft­ing a par­tially su­per­hu­man in­tel­li­gence to be safe and ro­bust would only take 1.5 times as long com­pared to cut­ting all the cor­ners?”


  • Value achievement dilemma

    How can Earth-origi­nat­ing in­tel­li­gent life achieve most of its po­ten­tial value, whether by AI or oth­er­wise?