Mindcrime
“Mindcrime” is Nick Bostrom’s suggested term for scenarios in which an AI’s cognitive processes are intrinsically doing moral harm, for example because the AI contains trillions of suffering conscious beings inside it.
Ways in which this might happen:
Problem of sapient models (of humans): Occurs naturally if the best predictive model for humans in the environment involves models that are detailed enough to be people themselves.
Problem of sapient models (of civilizations): Occurs naturally if the agent tries to simulate, e.g., alien civilizations that might be simulating it, in enough detail to include conscious simulations of the aliens.
Problem of sapient subsystems: Occurs naturally if the most efficient design for some cognitive subsystems involves creating subagents that are self-reflective, or have some other property leading to consciousness or personhood.
Problem of sapient self-models: If the AI is conscious or possible future versions of the AI are conscious, it might run and terminate a large number of conscious-self models in the course of considering possible self-modifications.
Problem of sapient models (of humans):
An instrumental pressure to produce high-fidelity predictions of human beings (or to predict decision counterfactuals about them, or to search for events that lead to particular consequences, etcetera) may lead the AI to run computations that are unusually likely to possess personhood.
An unrealistic example of this would be Solomonoff induction, where predictions are made by means that include running many possible simulations of the environment and seeing which ones best correspond to reality. Among current machine learning algorithms, particle filters and Monte Carlo algorithms similarly involve running many possible simulated versions of a system.
It’s possible that a sufficiently advanced AI to have successfully arrived at detailed models of human intelligence, would usually also be advanced enough that it never tried to use a predictable/searchable model that engaged in brute-force simulations of those models. (Consider, e.g., that there will usually be many possible settings of a variable inside a model, and an efficient model might manipulate data representing a probability distribution over those settings, rather than ever considering one exact, specific human in toto.)
This, however, doesn’t make it certain that no mindcrime will occur. It may not take exact, faithful simulation of specific humans to create a conscious model. An efficient model of a (spread of possibilities for a) human may still contain enough computations that resemble a person enough to create consciousness, or whatever other properties may be deserving of personhood. Consider, in particular, an agent trying to use
Just as it almost certainly isn’t necessary to go all the way down to the neural level to create a sapient being, it may be that even with some parts of a mind considered abstractly, the remainder would be computed in enough detail to imply consciousness, sapience, personhood, etcetera.
The problem of sapient models is not to be confused with Simulation Hypothesis issues. An efficient model of a human need not have subjective experience indistinguishable from that of the human (although it will be a model of a person who doesn’t believe themselves to be a model). The problem occurs if the model is a person, not if the model is the same person as its subject, and the latter possibility plays no role in the implication of moral harm.
Besides problems that are directly or obviously about modeling people, many other practical problems and questions can benefit from modeling other minds—e.g., reading the directions on a toaster oven in order to discern the intent of the mind that was trying to communicate how to use a toaster. Thus, mindcrime might result from a sufficiently powerful AI trying to solve very mundane problems.
Problem of sapient models (of civilizations)
A separate route to mindcrime comes from an advanced agent considering, in sufficient detail, the possible origins and futures of intelligent life on other worlds. (Imagine that you were suddenly told that this version of you was actually embedded in a superintelligence that was imagining how life might evolve on a place like Earth, and that your subprocess was not producing sufficiently valuable information and was about to be shut down. You would probably be annoyed! We should try not to annoy other people in this way.)
Three possible origins of a convergent instrumental pressure to consider intelligent civilizations in great detail:
Assigning sufficient probability to the existence of non-obvious extraterrestrial intelligences in Earth’s vicinity, perhaps due to considering the Fermi Paradox.
Naturalistic induction, combined with the AI considering the hypothesis that it is in a simulated environment.
Logical decision theories and utility functions that care about the consequences of the AI’s decisions via instances of the AI’s reference class that could be embedded inside alien simulations.
With respect to the latter two possibilities, note that the AI does not need to be considering possibilities in which the whole Earth as we know it is a simulation. The AI only needs to consider that, among the possible explanations of the AI’s current sense data and internal data, there are scenarios in which the AI is embedded in some world other than the most ‘obvious’ one implied by the sense data. See also Distant superintelligences can coerce the most probable environment of your AI for a related hazard of the AI considering possibilities in which it is being simulated.
(Eliezer Yudkowsky has advocated that we shouldn’t let any AI short of extreme levels of safety and robustness assurance consider distant civilizations in lots of detail in any case, since this means our AI might embed (a model of) a hostile superintelligence.)
Problem of sapient subsystems:
It’s possible that the most efficient system for, say, allocating memory on a local cluster, constitutes a complete reflective agent with a self-model. Or that some of the most efficient designs for subprocesses of an AI, in general, happen to have whatever properties lead up to consciousness or whatever other properties are important to personhood.
This might possibly constitute a relatively less severe moral catastrophe, if the subsystems are sentient but lack a reinforcement-based pleasure/pain architecture (since the latter is not obviously a property of the most efficient subagents). In this case, there might be large numbers of conscious beings embedded inside the AI and occasionally dying as they are replaced, but they would not be suffering. It is nonetheless the sort of scenario that many of us would prefer to avoid.
Problem of sapient self-models:
The AI’s models of itself, or of other AIs it could possibly build, might happen to be conscious or have other properties deserving of personhood. This is worth considering as a separate possibility from building a conscious or personhood-deserving AI ourselves, when we didn’t mean to do so, because of these two additional properties:
Even if the AI’s current design is not conscious or personhood-deserving, the current AI might consider possible future versions or subagent designs that would be conscious, and those considerations might themselves be conscious.
This means that even if the AI’s current version doesn’t seem like it has key personhood properties on its own—that we’ve successfully created the AI itself as a nonperson—we still need to worry about other conscious AIs being embedded into it.
The AI might create, run, and terminate very large numbers of potential self-models.
Even if we consider tolerable the potential moral harm of creating one conscious AI (e.g. the AI lacks all of the conditions that a responsible parent would want to ensure when creating a new intelligent species, but it’s just one sapient being so it’s okay to do that in order to save the world), we might not want to take on the moral harm of creating trillions of evanescent, swiftly erased conscious beings.
Difficulties
Trying to consider these issues is complicated by:
Philosophical uncertainty about what properties are constitutive of consciousness and which computer programs have them;
Moral uncertainty about what (idealized versions of) (any particular person’s) morality would consider to be the key properties of personhood;
Our present-day uncertainty about what efficient models in advanced agents would look like.
It’d help if we knew the answers to these questions, but the fact that we don’t know doesn’t mean we can thereby conclude that any particular model is not a person. (This would be some mix of argumentum ad ignorantiem, and availability bias making us think that a scenario is unlikely when it is hard to visualize.) In the limit of infinite computing power, the epistemically best models of humans would almost certainly involve simulating many possible versions of them; superintelligences would have very large amounts of computing power and we don’t know at what point we come close enough to this limiting property to cross the threshold.
Scope of potential disaster
The prospect of mindcrime is an especially alarming possibility because sufficiently advanced agents, especially if they are using computationally efficient models, might consider very large numbers of hypothetical possibilities that would count as people. There’s no limit that says that if there are seven billion people, an agent will run at most seven billion models; the agent might be considering many possibilities per individual human. This would not be an astronomical disaster since it would not (by hypothesis) wipe out our posterity and our intergalactic future, but it could be a disaster orders of magnitude larger than the Holocaust, the Mongol Conquest, the Middle Ages, or all human tragedy to date.
Development-order issue
If we ask an AI to predict what we would say if we had a thousand years to think about the problem of defining personhood or think about which causal processes are ‘conscious’, this seems unusually likely to cause the AI to commit mindcrime in the course of answering the question. Even asking the AI to think abstractly about the problem of consciousness, or predict by abstract reasoning what humans might say about it, seems unusually likely to result in mindcrime. There thus exists a development order issue preventing us from asking a Friendly AI to solve the problem for us, since to file this request safely and without committing mindcrime, we would need the request to already have been completed.
The prospect of enormous-scale disaster mitigates against ‘temporarily’ tolerating mindcrime inside a system, while, e.g., an extrapolated-volition or approval-based agent tries to compute the code or design of a non-mindcriminal agent. Depending on the agent’s efficiency, and secondarily on its computational limits, a tremendous amount of moral harm might be done during the ‘temporary’ process of computing an answer.
Weirdness
Literally nobody outside of MIRI or FHI ever talks about this problem.
Nonperson predicates
A nonperson predicate is an effective test that we, or an AI, can use to determine that some computer program is definitely not a person. In principle, a nonperson predicate needs only two possible outputs, “Don’t know” and “Definitely not a person”. It’s acceptable for many actually-nonperson programs to be labeled “don’t know”, so long as no people are labeled “definitely not a person”.
If the above was the only requirement, one simple nonperson predicate would be to label everything “don’t know”. The implicit difficulty is that the nonperson predicate must also pass some programs of high complexity that do things like “acceptably model humans” or “acceptably model future versions of the AI”.
Besides addressing mindcrime scenarios, Yudkowsky’s original proposal was also aimed at knowing that the AI design itself was not conscious, or not a person.
It seems likely to be very hard to find a good nonperson predicate:
Not all philosophical confusions and computational difficulties are averted by asking for a partial list of unconscious programs instead of a total list of conscious programs. Even if we don’t know which properties are sufficient, we’d need to know something solid about properties that are necessary for consciousness or sufficient for nonpersonhood.
We can’t pass once-and-for-all any class of programs that’s Turing-complete. We can’t say once and for all that it’s safe to model gravitational interactions in a solar system, if enormous gravitational systems could encode computers that encode people.
The Nearest unblocked strategy problem seems particularly worrisome here. If we block off some options for modeling humans directly, the next best option is unusually likely to be conscious. Even if we rely on a whitelist rather than a blacklist, this may lead to a whitelisted “gravitational model” that secretly encodes a human, and so on.
Research avenues
Behaviorism: Try to create a limited AI that does not model other minds or possibly even itself, except using some narrow class of agent models that we are pretty sure will not be sentient. This avenue is potentially motivated for other reasons as well, such as avoiding probable environment hacking and averting programmer manipulation.
Try to define a nonperson predicate that whitelists enough programs to carry out some pivotal achievement.
Try for an AI that can bootstrap our understanding of consciousness and tell us about what we would define as a person, while committing a relatively small amount of mindcrime, with all computed possible-people being stored rather than discarded, and the modeled agents being entirely happy, mostly happy, or non-suffering. E.g., put a happy person at the center of the approval-directed agent, and try to oversee the AI’s algorithms and ask it not to use Monte Carlo simulations if possible.
Ignore the problem in all pre-interstellar stages because it’s still relatively small compared to astronomical stakes and therefore not worth significant losses in success probability. (This may backfire under some versions of the Simulation Hypothesis.)
Try to finish the philosophical problem of understanding which causal processes experience sapience (or are otherwise objects of ethical value), in the next couple of decades, to sufficient detail that it can be crisply stated to an AI, with sufficiently complete coverage that it’s not subject to the Nearest unblocked strategy problem.
Children:
- Mindcrime: Introduction
- Nonperson predicate
If we knew which computations were definitely not people, we could tell AIs which programs they were definitely allowed to compute.
Parents:
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.
“Weirdness: Literally nobody outside of MIRI or FHI ever talks about this problem”
…but it does seem to be a popular topic of contemporary SciFi (WestWorld, Black Mirror, etc.)
I have an intuition that says that if you run any sufficiently large computation (even if it’s as simple as multiplication, e.g. (3^^^3)*(3^^^3), you’ll likely accidentally create sentient life within it. Checking for that seems prohibitively expensive, or may be even impossible, since checking itself might run into the same problem.
Eliezer, I find your position confusing.
Consider the first AI system that can reasonably predict your answers to questions of the form “Might X constitute mindcrime?” where X is a natural language description of some computational process. (Well enough that, say, most of a useful computation can be flagged as “definitely not mindcrime,” and all mindcrime can be flagged as “maybe mindcrime.”)
Do you believe that this system will have significant moral disvalue? If that system doesn’t have moral disvalue, where is the chicken and egg problem?
So it seems like you must believe that this system will have significant moral disvalue. That sounds implausible on its face to me. What are you imagining this system will look like? Do you think that this kind of question is radically harder than other superficially comparable question-answering tasks? Do you think that any AI researchers will find your position plausible? If not, what do you think they are getting wrong?
ETA: maybe the most useful thing to clarify would be the kind of computation, and how it relates to the rest of what the AI is doing, that you would find really hard to classify, but which might plausibly be unavoidable for effective computation.
This whole disagreement may be related to broader disagreements about how aligned AI systems will look. But you seem to think that mindcrime is also a problem for act-based agents, so that can’t explain all of it. We might want to restrict attention to the act-based case in order to isolate disagreement specific to mindcrime, and it’s possible that discussion should wait until we get on the same page about act-based agents.
Yes! It sounds close to FAI-complete in the capacities required. It sounds like trying to brute-force an answer to it via generalized supervised learning might easily involve simulating trillions of Eliezer-models. In general you and I seem to have very different intuitions about how hard it is to get a good answer to “deep, philosophical questions” via generalized supervised learning.
The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.
For example, suppose that my utility function U is “how good idealized Eliezer thinks things are, after thinking for a thousand years.” It doesn’t take long to realize that idealized Eliezer would be unhappy with a literal simulation of idealized Eliezer. Moreover, a primitive understanding of Eliezer’s views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).
An AI that is able to crush humans in the real world without being able to do this kind of reasoning seems catastrophic on other grounds. An AI that is able but unmotivated to carry out or act on this kind of reasoning seems even more catastophic for other reasons. (For example, I don’t see any realistic approach to corrigibility that wouldn’t solve this problem as well, and conversely I see many ways to resolve both.)
Edit: Intended as a response to the original post, but no way to delete and repost as far as I can tell.
My worry here would be that we’ll run into a Nearest Unblocked Neighbor problem on our attempts to define sapience as a property of computer simulations.
Let’s say that sapience1 is a definition that covers most of the ‘actual definition of sapience’ (e.g. what we’d come up with given unlimited time to think, etc.) that I’ll call sapience0, relative to some measure on probable computer programs. But there are still exceptions; there are sapient0 things not detected by sapience1. The best hypothesis for predicting an actually sapient mind that is not in sapience1, seems unusually likely to be one of the special cases that is still in sapience0. It might even just be an obfuscated ordinary sapient program, rather than one with an exotic kind of sapience, if sapience_1 doesn’t incorporate some advanced-safe way of preventing obfuscation.
We can’t throw a superhumanly sophisticated definition at the problem (e.g. the true sapience0 plus an advanced-safe block against obfuscation) without already asking the AI to simulate us or to predict the results of simulating us in order to obtain this hypothetical sapience2.
This just isn’t obvious to me. It seems likely to me that an extremely advanced understanding of Eliezer’s idealized views is required to answer questions about what Eliezer would say about consciousness, with extreme accuracy, without
My views about Eliezer’s preferences may depend on the reason that I am running X, rather than merely the content of X. E.g. if I am running X because I want to predict what a person will do, that’s a tipoff. This sort of thing working relies on a matching between the capabilities being used to guide my thinking and the capabilities being used to assess that thinking to see whether it constitutes mind crime.
But so does the whole project. You’ve said this well: “you just build the conscience, and that is the AI.” The AI doesn’t propose a way of figuring out X and then reject or not reject it because it constitutes mind crime, any more than it proposes an action to satisfy its values and then rejects or fails to reject it because the user would consider it immoral. The AI thinks the thought that it ought to think, as best it can figure out, just like it does the thing that it ought to do, as best it can figure out.
Note that you are allowed to just ask about or avoid marginal cases, as long as the total cost of asking or inconvenience of avoiding is not large compared to the other costs of the project. And whatever insight you would have put into your philosophical definition of sapience, you can try to communicate it as well as possible as a guide to predicting “what Eliezer would say about X,” which can circumvent the labor of actually asking.
Ok, Eliezer, you’ve addressed my point directly with sapience0 / sapience1 example. That makes sense. I guess one pitfall for AI might be to keep improving its sapience model without end, because “Oh, gosh, I really don’t want to create life by accident!” I guess this just falls into the general category of problems where “AI does thing X for a long time before getting around to satisfying human values”, where thing X is actually plausibly necessary. Not sure if you have a name for a pitfall like that. I can try my hand at creating a page for it, if you don’t have it already.
Paul, I don’t disagree that we want the AI to think whatever thought it ought to think. I’m proposing a chicken-and-egg problem where the AI can’t figure out which thoughts constitute mindcrime, without already committing mindcrime. I think you could record a lot of English pontification from me and still have a non-person-simulating AI feeling pretty confused about what the heck I meant or how to apply it to computer programs. Can you give a less abstract view of how you think this problem should be solved? What human-understanding and mindcrime-detection abilities do you think the AI can develop, in what order, without committing lots of mindcrime along the way? Sure, given infinite human understanding, the AI can detect mindcrime very efficiently, but the essence of the problem is that it seems hard to get infinite human understanding without lots of mindcrime being committed along the way. So what is it you think can be done instead, that postulates only a level of human understanding that you think can be done knowably without simulating people?
This is discussed under some name or other, by at least the utilitarians and by Paul Christiano.
There’s another difficulty: the nonperson predicate must not itself commit mindcrime while evaluating the programs. This sounds obvious enough in retrospect that it doesn’t feel worth mentioning, but it took me a while to notice it.
Obviously, if you’re running the program to determine if it’s a person by analyzing its behavior (e.g. by asking it if it feels like it’s conscious), you already commited mindcrime by the time you return “Don’t know”.
But if the tested program and the predicate are complex enough, lots of analysis other than straight running the program could accidentally instantiate persons as sub-processes, potentially ones distinct from those that might be instantiated by the tested program itself.
In other words: Assume Π is the set of all programs that potentially contain a person, i.e. for any program π, π in Π iff running π could instantiate a person.
We want a computable safety predicate S such that {S(π): π is a program} implies π ∉ Π, i.e. S(π) means π is safe. (Though !S(π) does not necessarily imply π ∈ Π.)
The problem is that S(π) is also a program, and we need to make sure that S(π) ∉ Π before running it. We can’t use S(S(π)) to check, because we’d need to check first that S(S(π)) ∉ Π…
(Note that a program that implements a sufficietly complex safety predicate S, when executed with another program π as input, might instantiate a person even if just running π directly would not!)
Trying to use what?
Eliezer goes back and forth between “sapient” and “sentient”, which are not synonyms. Neither is obviously a justification for claiming moral status as an agent.
It is important either to state clearly what one presumes gives an agent moral status (and hence what constitutes mindcrime), or to change each occurence of “sapient”, “sentient”, or “personhood” to all use the same word. I recommend stating the general case using personhood(X), a function to be supplied by the user and not defined here. Addressing the problem depends critically on what that function is—but the statement of the general case shouldn’t be bound up with the choice of personhood predicate.
Choosing either “sapient” or “sentient” is problematic: “sentient” because it includes at least all mammals, and “sapient” because it really just means “intelligent”, and the AI is going to be equally intelligent (defined as problem-solving or optimizing ability) whether it simulates humans or not. If intelligence grants moral standing (as it seems to here), and mindcrime means trapping an agent with moral standing in the AI’s world, then the construction of any AI is inherently mindcrime.