An in­tu­itive hu­man cat­e­gory, or other hu­manly in­tu­itive quan­tity or fact, is value-laden when it passes through hu­man goals and de­sires, such that an agent couldn’t re­li­ably de­ter­mine this in­tu­itive cat­e­gory or quan­tity with­out know­ing lots of com­pli­cated in­for­ma­tion about hu­man goals and de­sires (and how to ap­ply them to ar­rive at the in­tended con­cept).

In terms of Hume’s is-ought type dis­tinc­tion, value-laden cat­e­gories are those that hu­mans com­pute us­ing in­for­ma­tion from the ought side of the bound­ary, whether or not they no­tice they are do­ing so.


Im­pact vs. im­por­tant impact

Sup­pose we want an AI to cure can­cer, with­out this caus­ing any im­por­tant side effects. What is or isn’t an “im­por­tant side effect” de­pends on what you con­sider “im­por­tant”. If the can­cer cure causes the level of thy­roid-stim­u­lat­ing hor­mone to in­crease by 5%, this prob­a­bly isn’t very im­por­tant. If the cure in­creases the user’s sero­tonin level by 5% and this sig­nifi­cantly changes the user’s emo­tional state, we’d prob­a­bly con­sider that quite im­por­tant. But un­less the AI already un­der­stands com­pli­cated hu­man val­ues, it doesn’t nec­es­sar­ily have any way of know­ing that one change in blood chem­i­cal lev­els is “not im­por­tant” and the other is “im­por­tant”.

If you imag­ine the can­cer cure as dis­turb­ing a set of vari­ables \(X_1, X_2, X_3...\) such that their val­ues go from \(x_1, x_2, x_3\) to \(x_1^\prime, x_2^\prime, x_3^\prime\) then the ques­tion of which \(X_i\) are im­por­tant vari­ables is value-laden. If we tem­porar­ily mechanomor­phize hu­mans and sup­pose that we have a util­ity func­tion, then we could say that vari­ables are “im­por­tant” when they’re eval­u­ated by our util­ity func­tion, or when changes to those vari­ables change our ex­pected util­ity.

But by or­thog­o­nal­ity and Humean free­dom of the util­ity func­tion, there’s an un­limited num­ber of in­creas­ingly com­pli­cated util­ity func­tions that take into ac­count differ­ent vari­ables and func­tions of vari­ables, so to know what we in­tu­itively mean by “im­por­tant”, the AI would need in­for­ma­tion of high al­gorith­mic com­plex­ity that the AI had no way to de­duce a pri­ori. Which vari­ables are “im­por­tant” isn’t a ques­tion of sim­ple fact—it’s on the “ought” side of the Humean is-ought type dis­tinc­tion—so we can’t as­sume that an AI which be­comes in­creas­ingly good at an­swer­ing “is”-type ques­tions also knows which vari­ables are “im­por­tant”.

Another way of look­ing at it is that if an AI merely builds a very good pre­dic­tive model of the world, the set of “im­por­tant vari­ables” or “bad side effects” would be a squig­gly cat­e­gory with a com­pli­cated bound­ary. Even af­ter the AI has already formed a rich nat­u­ral is-lan­guage to de­scribe con­cepts like “thy­roid” and “sero­tonin” that are use­ful for mod­el­ing and pre­dict­ing hu­man biol­ogy, it might still re­quire a long mes­sage in this lan­guage to ex­actly de­scribe the wig­gly bound­ary of “im­por­tant im­pact” or the even more wig­gly bound­ary of “bad im­pact”.

This sug­gests that it might be sim­pler to try to tell the AI to cure can­cer with a min­i­mum of any side effects, and check­ing any re­main­ing side effects with the hu­man op­er­a­tor. If we have a set of “im­pacts” \(X_k\) to be ei­ther min­i­mized or checked which is broad enough to in­clude, in pass­ing, ev­ery­thing in­side the squig­gly bound­ary of the \(X_h\) that hu­mans care about, then this broader bound­ary of “any im­pact” might be smoother and less wig­gly—that is, a short mes­sage in the AI’s is-lan­guage, mak­ing it eas­ier to learn. For the same rea­son that a library con­tain­ing ev­ery pos­si­ble book has less in­for­ma­tion than a library which only con­tains one book, a cat­e­gory bound­ary “im­pact” which in­cludes ev­ery­thing a hu­man cares about, plus some other stuff, can po­ten­tially be much sim­pler than an ex­act bound­ary drawn around “im­pacts we care about” which is value-laden be­cause it in­volves car­ing.

From a hu­man per­spec­tive, the com­plex­ity of our value sys­tem is already built into us and now ap­pears as a de­cep­tively sim­ple-look­ing func­tion call—rel­a­tive to the com­plex­ity already built into us, “bad im­pact” sounds very ob­vi­ous and very easy to de­scribe. This may lead peo­ple to un­der­es­ti­mate the difficulty of train­ing AIs to per­ceive the same bound­ary. (Just list out all the im­pacts that po­ten­tially lower ex­pected value, darn it! Just the im­por­tant stuff!)

Faith­ful simu­la­tion vs. ad­e­quate simulation

Sup­pose we want to run an “ad­e­quate” or “good-enough” simu­la­tion of an up­loaded hu­man brain. We can’t say that an ad­e­quate simu­la­tion is one with iden­ti­cal in­put-out­put be­hav­ior to a biolog­i­cal brain, be­cause the brain will al­most cer­tainly be a chaotic sys­tem, mean­ing that it’s im­pos­si­ble for any simu­la­tion to get ex­actly the same re­sult as the biolog­i­cal sys­tem would yield. We nonethe­less don’t want the brain to have epilepsy, or to go psy­cho­pathic, etcetera.

The con­cept of an “ad­e­quate” simu­la­tion, in this case, is re­ally stand­ing in for “a simu­la­tion such that the ex­pected value of us­ing the simu­lated brain’s in­for­ma­tion is within ep­silon of us­ing a biolog­i­cal brain”. In other words, our in­tu­itive no­tion of what counts as a good-enough simu­la­tion is re­ally a value-laden thresh­old be­cause it in­volves an es­ti­mate of what’s good enough.

So if we want an AI to have a no­tion of what kind of simu­la­tion is a faith­ful one, we might find it sim­pler to try to de­scribe some su­per­set of brain prop­er­ties, such that if the simu­lated brain doesn’t per­turb the ex­pec­ta­tions of those prop­er­ties, it doesn’t per­turb ex­pected value ei­ther from our own in­tu­itive stand­point (mean­ing the re­sult of run­ning the up­loaded brain is equally valuable in our own ex­pec­ta­tion). This set of faith­ful­ness prop­er­ties would need to au­to­mat­i­cally pick up on changes like psy­chosis, but could po­ten­tially pick up on a much wider range of other changes that we’d re­gard as unim­por­tant, so long as all the im­por­tant ones are in there.

in­te­grate this into a longer ex­pla­na­tion of the or­thog­o­nal­ity the­sis, maybe with differ­ent ver­sions for peo­ple who have and haven’t checked off Orthog­o­nal­ity. it’s a sec­tion some­one might run into early on.


Sup­pose that non-veg­e­tar­ian pro­gram­mers train an AGI on their in­tu­itive cat­e­gory “per­son”, such that:

  • Rocks are not “peo­ple” and can be harmed if nec­es­sary.

  • Shoes are not “peo­ple” and can be harmed if nec­es­sary.

  • Cats are some­times valuable to peo­ple, but are not them­selves peo­ple.

  • Alice, Bob, and Carol are “peo­ple” and should not be kil­led.

  • Chim­panzees, dolphins, and the AGI it­self: not sure, check with the users if the is­sue arises.

Now fur­ther sup­pose that the pro­gram­mers haven’t thought to cover, in the train­ing data, any case of a cry­on­i­cally sus­pended brain. Is this a per­son? Should it not be harmed? On many ‘nat­u­ral’ met­rics, a cry­on­i­cally sus­pended brain is more similar to a rock than to Alice.

From an in­tu­itive per­spec­tive of avoid­ing harm to sapi­ent life, a cry­on­i­cally sus­pended brain has to be pre­sumed a per­son un­til oth­er­wise proven. But the nat­u­ral, or in­duc­tively sim­ple cat­e­gory that cov­ers the train­ing cases is likely to la­bel the brain a non-per­son, maybe with very high prob­a­bil­ity. The fact that we want the AI to be care­ful not to hurt the cry­on­i­cally sus­pended brain is the sort of thing you could only de­duce by know­ing which sort of things hu­mans care about and why. It’s not a sim­ple phys­i­cal fea­ture of the brain it­self.

Since the cat­e­gory “per­son” is a value-laden one, when we ex­tend it to a new re­gion be­yond the pre­vi­ous train­ing cases, it’s pos­si­ble for an en­tirely new set of philo­soph­i­cal con­sid­er­a­tions to swoop in, ac­ti­vate, and con­trol how we clas­sify that case via con­sid­er­a­tions that didn’t play a role in the pre­vi­ous train­ing cases.


Our in­tu­itive eval­u­a­tion of value-laden cat­e­gories goes through our Humean de­grees of free­dom. This means that a value-laden cat­e­gory which a hu­man sees as in­tu­itively sim­ple, can still have high al­gorith­mic com­plex­ity, even rel­a­tive to so­phis­ti­cated mod­els of the “is” side of the world. This in turn means that even an AI un­der­stands the “is” side of the world very well, might not cor­rectly and ex­actly learn a value-laden cat­e­gory from a small or in­com­plete set of train­ing cases.

From the per­spec­tive of train­ing an agent that hasn’t yet been al­igned along all the Humean de­grees of free­dom, value-laden cat­e­gories are very wig­gly and com­pli­cated rel­a­tive to the agent’s em­piri­cal lan­guage. Value-laden cat­e­gories are li­able to con­tain ex­cep­tional re­gions that your train­ing cases turned out not to cover, where from your per­spec­tive the ob­vi­ous in­tu­itive an­swer is a func­tion of new value-con­sid­er­a­tions that the agent wouldn’t be able to de­duce from pre­vi­ous train­ing data.

This is why much of the art in Friendly AI con­sists of try­ing to rephrase an al­ign­ment schema into terms that are sim­ple rel­a­tive to “is”-only con­cepts, where we want an AI with an im­pact-in-gen­eral met­ric, rather than AI which avoids only bad im­pacts. “Im­pact” might have a sim­ple, cen­tral core rel­a­tive to a mod­er­ately so­phis­ti­cated lan­guage for de­scribing the uni­verse-as-is. “Bad im­pact” or “im­por­tant im­pact” don’t have a sim­ple, cen­tral core and hence might be much harder to iden­tify via train­ing cases or com­mu­ni­ca­tion. Again, this is hard be­cause hu­mans do have all their sub­tle value-laden cat­e­gories like ‘im­por­tant’ built in as opaque func­tion calls. Hence peo­ple ap­proach­ing value al­ign­ment for the first time of­ten ex­pect that var­i­ous con­cepts are easy to iden­tify, and tend to see the in­tu­itive or in­tended val­ues of all their con­cepts as “com­mon sense” re­gard­less of which side of the is-ought di­vide that com­mon sense is on.

It’s true, for ex­am­ple, that a mod­ern chess-play­ing al­gorithm has “com­mon sense” about when not to try to seize con­trol of the game­board’s cen­ter; and similarly a suffi­ciently ad­vanced agent would de­velop “com­mon sense” about which sub­stances would in em­piri­cal fact have which con­se­quences on hu­man biol­ogy, since this part is a strictly “is”-ques­tion that can be an­swered just by look­ing hard at the uni­verse. Not want­ing to ad­minister poi­sonous sub­stances to a hu­man re­quires a prior dis­prefer­ence over the con­se­quences of ad­minis­ter­ing that poi­son, even if the con­se­quences are cor­rectly fore­casted. Similarly, the cat­e­gory “poi­son” could be said to re­ally mean some­thing like “a sub­stance which, if ad­ministered to a hu­man, pro­duces low util­ity”; some peo­ple might clas­sify vodka as poi­sonous, while oth­ers could dis­agree. An AI doesn’t nec­es­sary have com­mon sense about the in­tended eval­u­a­tion of the “poi­sonous” cat­e­gory, even if it has fully de­vel­oped com­mon sense about which sub­stances have which em­piri­cal biolog­i­cal con­se­quences when in­gested. One of those forms of com­mon sense can be de­vel­oped by star­ing very in­tel­li­gently at biolog­i­cal data, and one of them can­not. But from a hu­man in­tu­itive stand­point, both of these can feel equally like the same no­tion of “com­mon sense”, which might lead to a dan­ger­ous ex­pec­ta­tion that an AI gain­ing in one type of com­mon sense is bound to gain in the other.

Fur­ther reading


  • Reflectively consistent degree of freedom

    When an in­stru­men­tally effi­cient, self-mod­ify­ing AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a re­flec­tively con­sis­tent de­gree of free­dom.