# Conservative concept boundary

The problem of conservatism is to draw a boundary around positive instances of a concept which is not only simple but also classifies as few instances as possible as positive.

# Introduction /​ basic idea /​ motivation

Suppose I have a numerical concept in mind, and you query me on the following numbers to determine whether they’re instances of the concept, and I reply as follows:

• 3: Yes

• 4: No

• 5: Yes

• 13: Yes

• 14: No

• 19: Yes

• 28: No

A simple category which covers this training set is “All odd numbers.”

A simple and conservative category which covers this training set is “All odd numbers between 3 and 19.”

A slightly more complicated, and even more conservative category, is “All prime numbers between 3 and 19.”

A conservative but not simple category is “Only 3, 5, 13, and 19 are positive instances of this category.”

One of the (very) early proposals for value alignment was to train an AI on smiling faces as examples of the sort of outcome the AI ought to achieve. Slightly steelmanning the proposal so that it doesn’t just produce images of smiling faces as the AI’s sensory data, we can imagine that the AI is trying to learn a boundary over the causes of its sensory data that distinguishes smiling faces within the environment.

The classic example of what might go wrong with this alignment protocol is that all matter within reach might end up turned into tiny molecular smiley faces, since heavy optimization pressure would pick out an extreme edge of the simple category that could be fulfilled as maximally as possible, and it’s possible to make many more tiny molecular smileyfaces than complete smiling faces.

That is: The AI would by default learn the simplest concept that distinguished smiling faces from non-smileyfaces within its training cases. Given a wider set of options than existed in the training regime, this simple concept might also classify as a ‘smiling face’ something that had the properties singled out by the concept, but was unlike the training cases with respect to other properties. This is the metaphorical equivalent of learning the concept “All odd numbers”, and then positively classifying cases like −1 or 9^999 that are unlike 3 and 19 in other regards, since they’re still odd.

On the other hand, suppose the AI had been told to learn a simple and conservative concept over its training data. Then the corresponding goal might demand, e.g., only smiles that came attached to actual human heads experiencing pleasure. If the AI were moreover a conservative planner, it might try to produce smiles only through causal chains that resembled existing causal generators of smiles, such as only administering existing drugs like heroin and not inventing any new drugs, and only breeding humans through pregnancy rather than synthesizing living heads using nanotechnology.

You couldn’t call this a solution to the value alignment problem, but it would—arguendo—get significantly closer to the intended goal than tiny molecular smileyfaces. Thus, conservatism might serve as one component among others for aligning a Task AGI.

Intuitively speaking: A genie is hardly rendered safe if it tries to fulfill your wish using ‘normal’ instances of the stated goal that were generated in relatively more ‘normal’ ways, but it’s at least closer to being safe. Conservative concepts and conservative planning might be one attribute among others of a safe genie.

# Burrito problem

The burrito problem is to have a Task AGI make a burrito that is actually a burrito, and not just something that looks like a burrito, and not poisonous and that is actually safe for humans to eat.

Conservatism is one possible approach to the burrito problem: Show the AGI five burritos and five non-burritos. Then, don’t have the AGI learn the simplest concept that distinguishes burritos from non-burritos and then create something that is maximally a burrito under this concept. Instead, we’d like the AGI to learn a simple and narrow concept that classifies these five things as burritos according to some simple-ish rule which labels as few objects as possible as burritos. But not the rule, “Only these five exact molecular configurations count as burritos”, because that rule would not be simple.

The concept must still be broad enough to permit the construction of a sixth burrito that is not molecularly identical to any of the first five. But not so broad that the burrito includes butolinum toxin (because, hey, anything made out of mostly carbon-hydrogen-oxygen-nitrogen ought to be fine, and the five negative examples didn’t include anything with butolinum toxin).

The hope is that via conservatism we can avoid needing to think of every possible way that our training data might not properly stabilize the ‘simplest explanation’ along every dimension of potentially fatal variance. If we’re trying to only draw simple boundaries that separate the positive and negative cases, there’s no reason for the AI to add on a “cannot be poisonous” codicil to the rule unless the AI has seen poisoned burritos labeled as negative cases, so that the slightly more complicated rule “but not poisonous” needs to be added to the boundary in order to separate out cases that would otherwise be classified positive. But then maybe even if we show the AGI one burrito poisoned with butolinum, it doesn’t learn to avoid burritos poisoned with ricin, and even if we show it butolinum and ricin, it doesn’t learn to avoid burritos poisoned with the radioactive iodine-131 isotope. Rather than our needing to think of what the concept boundary needs to look like and including enough negative cases to force the simplest boundary to exclude all the unsafe burritos, the hope is that via conservatism we can shift some of the workload to showing the AI positive examples which happen not to be poisonous or have any other problems.

# Conservatism over the causes of sensed training cases.

Conservatism in AGI cases seems like it would need to be interpreted over the causes of sensory data, rather than the sensory data itself. We’re not looking for a conservative concept about which images of a burrito would be classified as positive, we want a concept over which environmental burritos would be classified as positive. Two burrito candidates can cause identical images while differing in their poisonousness, so we want to draw our conservative concept boundary around (our model of) the causes of past sensory events in our training cases, not draw a boundary around the sensory events themselves.

# Conservative planning

A conservative strategy or conservative plan would ceteris paribus prefer to construct burritos by buying ingredients from the store and cooking them, rather than building nanomachinery that constructs a burrito, because this would be more characteristic of how burritos are usually constructed, or more similar to the elements of previously approved plans. Again, this seems like it might be less likely to generate a poisonous burrito.

Another paradigmatic example of conservatism might be to, e.g., inside some game engine, show the AI some human players running around, and then give the AI an object that has the goal of e.g. moving a box to the end of the room. If the AI is given the ability to fly, but generates a plan in which the box-moving agent only moves around on the ground because that’s what the training examples did, then this is a conservative plan.

The point of this isn’t to cripple the AI’s abilities, the point is that if e.g. your low impact measure has a loophole and the AI generates a plan to turn all matter within reach into pink-painted cars, some steps of this plan like “disassemble stars to make more cars and paint” are likely to be non-conservative and hence not happen automatically.

## Flagging non-conservative plan steps

If a non-conservative plan seems better along other important dimensions—for example, there is no other plan that has an equally low impact and equally few side effects compared to just synthesizing the burrito using a nanomachine—then we can also imagine that the critical step might be flagged as non-conservative and presented to the user for checking.

That is, on ‘conservative’ planning, we’re interested in both the problem “generate a plan and then flag and report non-conservative steps” as well as the problem “try to generate a plan that has few or no non-conservative steps”.

# Role in other safety problems

Conservatism and conservative planning seems like it might directly tackle some standard concerns head-on and in a sufficiently basic way to avoid loopholes, and might also be subject to those concerns. E.g.:

• Edge instantiation—if in full generality we don’t go to the edge of the graph but try to stay in the center of what’s already been positively classified, maybe we can avoid this.

• Unforeseen maximum—if we stick to things very similar to already-positively-classified instances, we won’t automatically go into the unimagined parts of the graph.

• Context disaster—a sufficiently conservative optimizer might go on using options previous to similarly whitelisted ones even if large new sections of planning space opened up.

However, to the extent we rely on conservatism to prevent any of these things from happening, it’s a mission-critical component that itself has to be advanced-safe with no loopholes. If a ‘conservatism’ constraint is being applied to very powerful optimization pressures, we need to worry about this seeking out any loophole in what is ‘conservative’. It might be that the central notion of ‘conservatism’ is simple enough to have no loopholes. But it’s also possible that even a simplish and conservative concept would still include some dangerous instances, if there’s enough optimization pressure seeking out a maximal-under-some-criterion instance within everything that’s been classified conservatively.

Two possible meta-approaches to making conservatism even safer:

• Use conservatism to flag non-conservative steps in plans, or expected non-conservative instances of goal achievements, and refer these for user checking before taking action. (Rather than automatically generating a plan containing only ‘conservative’ steps.) This would have the standard problems with user checking.

• Have a definition of conservatism, relative to the AI’s current world-model and conceptual language, which would automatically catch as ‘exceptional’ (hence not conservative) anything which had the weird property of being the only first-order-conservative instance of a concept that had some other special property being sought out by the optimization pressure. This might involve weird reflective problems, such as any planned event being special in virtue of the AI having planned it.

Parents:

An advanced AI that’s meant to pursue a series of limited-scope goals given it by the user. In Bostrom’s terminology, a Genie.

• Would it be fair to summarize the idea of a conservative concept boundary as a classifier that avoids false positives while remaining simple?

• Well, the purpose is to avoid the AGI classifying potential goal fulfillments in a way that, from the user’s perspective, is a “false positive”. The reason why we have to spend a lot of time thinking about really, really good ways to have the AGI not guess positive labels on things that we wouldn’t label as positive, is that the training data we present to the AI may be ambiguous in some way we don’t know about, or many ways we don’t know about. Meaning that the AI does not actually have the information to figure out what we meant by looking for the simplest ways to classify the training cases, and instead has to do something that’s very very similar to the positively labeled training instances to minimize the probability of screwing up.

I’m pushing back a little on this “classifier that avoids false positives” description because that’s what every classifier is in some sense intended to do; you have to be specific about how, or what approach you’re taking, in order to say something that means more than just “classifier that is a good classifier”.

• I’m pushing back a little on this “classifier that avoids false positives” description because that’s what every classifier is in some sense intended to do

Well presumably there’s a trade-off between avoiding false positives and avoiding false negatives. And you want a classifier that tries really hard to avoid false positives, as I understand.

• Suppose there are existing generic techniques for developing classifiers that prioritize avoiding false positives over avoiding false negatives—would you not expect them to find a “conservative concept boundary” by default?

• It seems that classifiers trained on adversarial examples may be finding (more) conservative concept boundaries:

We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable