An intuitive human category, or other humanly intuitive quantity or fact, is value-laden when it passes through human goals and desires, such that an agent couldn’t reliably determine this intuitive category or quantity without knowing lots of complicated information about human goals and desires (and how to apply them to arrive at the intended concept).

In terms of Hume’s is-ought type distinction, value-laden categories are those that humans compute using information from the ought side of the boundary, whether or not they notice they are doing so.


Impact vs. important impact

Suppose we want an AI to cure cancer, without this causing any important side effects. What is or isn’t an “important side effect” depends on what you consider “important”. If the cancer cure causes the level of thyroid-stimulating hormone to increase by 5%, this probably isn’t very important. If the cure increases the user’s serotonin level by 5% and this significantly changes the user’s emotional state, we’d probably consider that quite important. But unless the AI already understands complicated human values, it doesn’t necessarily have any way of knowing that one change in blood chemical levels is “not important” and the other is “important”.

If you imagine the cancer cure as disturbing a set of variables \(X_1, X_2, X_3...\) such that their values go from \(x_1, x_2, x_3\) to \(x_1^\prime, x_2^\prime, x_3^\prime\) then the question of which \(X_i\) are important variables is value-laden. If we temporarily mechanomorphize humans and suppose that we have a utility function, then we could say that variables are “important” when they’re evaluated by our utility function, or when changes to those variables change our expected utility.

But by orthogonality and Humean freedom of the utility function, there’s an unlimited number of increasingly complicated utility functions that take into account different variables and functions of variables, so to know what we intuitively mean by “important”, the AI would need information of high algorithmic complexity that the AI had no way to deduce a priori. Which variables are “important” isn’t a question of simple fact—it’s on the “ought” side of the Humean is-ought type distinction—so we can’t assume that an AI which becomes increasingly good at answering “is”-type questions also knows which variables are “important”.

Another way of looking at it is that if an AI merely builds a very good predictive model of the world, the set of “important variables” or “bad side effects” would be a squiggly category with a complicated boundary. Even after the AI has already formed a rich natural is-language to describe concepts like “thyroid” and “serotonin” that are useful for modeling and predicting human biology, it might still require a long message in this language to exactly describe the wiggly boundary of “important impact” or the even more wiggly boundary of “bad impact”.

This suggests that it might be simpler to try to tell the AI to cure cancer with a minimum of any side effects, and checking any remaining side effects with the human operator. If we have a set of “impacts” \(X_k\) to be either minimized or checked which is broad enough to include, in passing, everything inside the squiggly boundary of the \(X_h\) that humans care about, then this broader boundary of “any impact” might be smoother and less wiggly—that is, a short message in the AI’s is-language, making it easier to learn. For the same reason that a library containing every possible book has less information than a library which only contains one book, a category boundary “impact” which includes everything a human cares about, plus some other stuff, can potentially be much simpler than an exact boundary drawn around “impacts we care about” which is value-laden because it involves caring.

From a human perspective, the complexity of our value system is already built into us and now appears as a deceptively simple-looking function call—relative to the complexity already built into us, “bad impact” sounds very obvious and very easy to describe. This may lead people to underestimate the difficulty of training AIs to perceive the same boundary. (Just list out all the impacts that potentially lower expected value, darn it! Just the important stuff!)

Faithful simulation vs. adequate simulation

Suppose we want to run an “adequate” or “good-enough” simulation of an uploaded human brain. We can’t say that an adequate simulation is one with identical input-output behavior to a biological brain, because the brain will almost certainly be a chaotic system, meaning that it’s impossible for any simulation to get exactly the same result as the biological system would yield. We nonetheless don’t want the brain to have epilepsy, or to go psychopathic, etcetera.

The concept of an “adequate” simulation, in this case, is really standing in for “a simulation such that the expected value of using the simulated brain’s information is within epsilon of using a biological brain”. In other words, our intuitive notion of what counts as a good-enough simulation is really a value-laden threshold because it involves an estimate of what’s good enough.

So if we want an AI to have a notion of what kind of simulation is a faithful one, we might find it simpler to try to describe some superset of brain properties, such that if the simulated brain doesn’t perturb the expectations of those properties, it doesn’t perturb expected value either from our own intuitive standpoint (meaning the result of running the uploaded brain is equally valuable in our own expectation). This set of faithfulness properties would need to automatically pick up on changes like psychosis, but could potentially pick up on a much wider range of other changes that we’d regard as unimportant, so long as all the important ones are in there.

integrate this into a longer explanation of the orthogonality thesis, maybe with different versions for people who have and haven’t checked off Orthogonality. it’s a section someone might run into early on.


Suppose that non-vegetarian programmers train an AGI on their intuitive category “person”, such that:

  • Rocks are not “people” and can be harmed if necessary.

  • Shoes are not “people” and can be harmed if necessary.

  • Cats are sometimes valuable to people, but are not themselves people.

  • Alice, Bob, and Carol are “people” and should not be killed.

  • Chimpanzees, dolphins, and the AGI itself: not sure, check with the users if the issue arises.

Now further suppose that the programmers haven’t thought to cover, in the training data, any case of a cryonically suspended brain. Is this a person? Should it not be harmed? On many ‘natural’ metrics, a cryonically suspended brain is more similar to a rock than to Alice.

From an intuitive perspective of avoiding harm to sapient life, a cryonically suspended brain has to be presumed a person until otherwise proven. But the natural, or inductively simple category that covers the training cases is likely to label the brain a non-person, maybe with very high probability. The fact that we want the AI to be careful not to hurt the cryonically suspended brain is the sort of thing you could only deduce by knowing which sort of things humans care about and why. It’s not a simple physical feature of the brain itself.

Since the category “person” is a value-laden one, when we extend it to a new region beyond the previous training cases, it’s possible for an entirely new set of philosophical considerations to swoop in, activate, and control how we classify that case via considerations that didn’t play a role in the previous training cases.


Our intuitive evaluation of value-laden categories goes through our Humean degrees of freedom. This means that a value-laden category which a human sees as intuitively simple, can still have high algorithmic complexity, even relative to sophisticated models of the “is” side of the world. This in turn means that even an AI understands the “is” side of the world very well, might not correctly and exactly learn a value-laden category from a small or incomplete set of training cases.

From the perspective of training an agent that hasn’t yet been aligned along all the Humean degrees of freedom, value-laden categories are very wiggly and complicated relative to the agent’s empirical language. Value-laden categories are liable to contain exceptional regions that your training cases turned out not to cover, where from your perspective the obvious intuitive answer is a function of new value-considerations that the agent wouldn’t be able to deduce from previous training data.

This is why much of the art in Friendly AI consists of trying to rephrase an alignment schema into terms that are simple relative to “is”-only concepts, where we want an AI with an impact-in-general metric, rather than AI which avoids only bad impacts. “Impact” might have a simple, central core relative to a moderately sophisticated language for describing the universe-as-is. “Bad impact” or “important impact” don’t have a simple, central core and hence might be much harder to identify via training cases or communication. Again, this is hard because humans do have all their subtle value-laden categories like ‘important’ built in as opaque function calls. Hence people approaching value alignment for the first time often expect that various concepts are easy to identify, and tend to see the intuitive or intended values of all their concepts as “common sense” regardless of which side of the is-ought divide that common sense is on.

It’s true, for example, that a modern chess-playing algorithm has “common sense” about when not to try to seize control of the gameboard’s center; and similarly a sufficiently advanced agent would develop “common sense” about which substances would in empirical fact have which consequences on human biology, since this part is a strictly “is”-question that can be answered just by looking hard at the universe. Not wanting to administer poisonous substances to a human requires a prior dispreference over the consequences of administering that poison, even if the consequences are correctly forecasted. Similarly, the category “poison” could be said to really mean something like “a substance which, if administered to a human, produces low utility”; some people might classify vodka as poisonous, while others could disagree. An AI doesn’t necessary have common sense about the intended evaluation of the “poisonous” category, even if it has fully developed common sense about which substances have which empirical biological consequences when ingested. One of those forms of common sense can be developed by staring very intelligently at biological data, and one of them cannot. But from a human intuitive standpoint, both of these can feel equally like the same notion of “common sense”, which might lead to a dangerous expectation that an AI gaining in one type of common sense is bound to gain in the other.

Further reading


  • Reflectively consistent degree of freedom

    When an instrumentally efficient, self-modifying AI can be like X or like X’ in such a way that X wants to be X and X’ wants to be X’, that’s a reflectively consistent degree of freedom.