# Conservative concept boundary

The prob­lem of con­ser­vatism is to draw a bound­ary around pos­i­tive in­stances of a con­cept which is not only sim­ple but also clas­sifies as few in­stances as pos­si­ble as pos­i­tive.

# In­tro­duc­tion /​ ba­sic idea /​ motivation

Sup­pose I have a nu­mer­i­cal con­cept in mind, and you query me on the fol­low­ing num­bers to de­ter­mine whether they’re in­stances of the con­cept, and I re­ply as fol­lows:

• 3: Yes

• 4: No

• 5: Yes

• 13: Yes

• 14: No

• 19: Yes

• 28: No

A sim­ple cat­e­gory which cov­ers this train­ing set is “All odd num­bers.”

A sim­ple and con­ser­va­tive cat­e­gory which cov­ers this train­ing set is “All odd num­bers be­tween 3 and 19.”

A slightly more com­pli­cated, and even more con­ser­va­tive cat­e­gory, is “All prime num­bers be­tween 3 and 19.”

A con­ser­va­tive but not sim­ple cat­e­gory is “Only 3, 5, 13, and 19 are pos­i­tive in­stances of this cat­e­gory.”

One of the (very) early pro­pos­als for value al­ign­ment was to train an AI on smil­ing faces as ex­am­ples of the sort of out­come the AI ought to achieve. Slightly steel­man­ning the pro­posal so that it doesn’t just pro­duce images of smil­ing faces as the AI’s sen­sory data, we can imag­ine that the AI is try­ing to learn a bound­ary over the causes of its sen­sory data that dis­t­in­guishes smil­ing faces within the en­vi­ron­ment.

The clas­sic ex­am­ple of what might go wrong with this al­ign­ment pro­to­col is that all mat­ter within reach might end up turned into tiny molec­u­lar smiley faces, since heavy op­ti­miza­tion pres­sure would pick out an ex­treme edge of the sim­ple cat­e­gory that could be fulfilled as max­i­mally as pos­si­ble, and it’s pos­si­ble to make many more tiny molec­u­lar smiley­faces than com­plete smil­ing faces.

That is: The AI would by de­fault learn the sim­plest con­cept that dis­t­in­guished smil­ing faces from non-smiley­faces within its train­ing cases. Given a wider set of op­tions than ex­isted in the train­ing regime, this sim­ple con­cept might also clas­sify as a ‘smil­ing face’ some­thing that had the prop­er­ties sin­gled out by the con­cept, but was un­like the train­ing cases with re­spect to other prop­er­ties. This is the metaphor­i­cal equiv­a­lent of learn­ing the con­cept “All odd num­bers”, and then pos­i­tively clas­sify­ing cases like −1 or 9^999 that are un­like 3 and 19 in other re­gards, since they’re still odd.

On the other hand, sup­pose the AI had been told to learn a sim­ple and con­ser­va­tive con­cept over its train­ing data. Then the cor­re­spond­ing goal might de­mand, e.g., only smiles that came at­tached to ac­tual hu­man heads ex­pe­rienc­ing plea­sure. If the AI were more­over a con­ser­va­tive plan­ner, it might try to pro­duce smiles only through causal chains that re­sem­bled ex­ist­ing causal gen­er­a­tors of smiles, such as only ad­minis­ter­ing ex­ist­ing drugs like heroin and not in­vent­ing any new drugs, and only breed­ing hu­mans through preg­nancy rather than syn­the­siz­ing liv­ing heads us­ing nan­otech­nol­ogy.

You couldn’t call this a solu­tion to the value al­ign­ment prob­lem, but it would—ar­guendo—get sig­nifi­cantly closer to the in­tended goal than tiny molec­u­lar smiley­faces. Thus, con­ser­vatism might serve as one com­po­nent among oth­ers for al­ign­ing a Task AGI.

In­tu­itively speak­ing: A ge­nie is hardly ren­dered safe if it tries to fulfill your wish us­ing ‘nor­mal’ in­stances of the stated goal that were gen­er­ated in rel­a­tively more ‘nor­mal’ ways, but it’s at least closer to be­ing safe. Con­ser­va­tive con­cepts and con­ser­va­tive plan­ning might be one at­tribute among oth­ers of a safe ge­nie.

# Bur­rito problem

The bur­rito prob­lem is to have a Task AGI make a bur­rito that is ac­tu­ally a bur­rito, and not just some­thing that looks like a bur­rito, and not poi­sonous and that is ac­tu­ally safe for hu­mans to eat.

Con­ser­vatism is one pos­si­ble ap­proach to the bur­rito prob­lem: Show the AGI five bur­ri­tos and five non-bur­ri­tos. Then, don’t have the AGI learn the sim­plest con­cept that dis­t­in­guishes bur­ri­tos from non-bur­ri­tos and then cre­ate some­thing that is max­i­mally a bur­rito un­der this con­cept. In­stead, we’d like the AGI to learn a sim­ple and nar­row con­cept that clas­sifies these five things as bur­ri­tos ac­cord­ing to some sim­ple-ish rule which la­bels as few ob­jects as pos­si­ble as bur­ri­tos. But not the rule, “Only these five ex­act molec­u­lar con­figu­ra­tions count as bur­ri­tos”, be­cause that rule would not be sim­ple.

The con­cept must still be broad enough to per­mit the con­struc­tion of a sixth bur­rito that is not molec­u­larly iden­ti­cal to any of the first five. But not so broad that the bur­rito in­cludes bu­tolinum toxin (be­cause, hey, any­thing made out of mostly car­bon-hy­dro­gen-oxy­gen-ni­tro­gen ought to be fine, and the five nega­tive ex­am­ples didn’t in­clude any­thing with bu­tolinum toxin).

The hope is that via con­ser­vatism we can avoid need­ing to think of ev­ery pos­si­ble way that our train­ing data might not prop­erly sta­bi­lize the ‘sim­plest ex­pla­na­tion’ along ev­ery di­men­sion of po­ten­tially fatal var­i­ance. If we’re try­ing to only draw sim­ple bound­aries that sep­a­rate the pos­i­tive and nega­tive cases, there’s no rea­son for the AI to add on a “can­not be poi­sonous” cod­i­cil to the rule un­less the AI has seen poi­soned bur­ri­tos la­beled as nega­tive cases, so that the slightly more com­pli­cated rule “but not poi­sonous” needs to be added to the bound­ary in or­der to sep­a­rate out cases that would oth­er­wise be clas­sified pos­i­tive. But then maybe even if we show the AGI one bur­rito poi­soned with bu­tolinum, it doesn’t learn to avoid bur­ri­tos poi­soned with ricin, and even if we show it bu­tolinum and ricin, it doesn’t learn to avoid bur­ri­tos poi­soned with the ra­dioac­tive io­dine-131 iso­tope. Rather than our need­ing to think of what the con­cept bound­ary needs to look like and in­clud­ing enough nega­tive cases to force the sim­plest bound­ary to ex­clude all the un­safe bur­ri­tos, the hope is that via con­ser­vatism we can shift some of the work­load to show­ing the AI pos­i­tive ex­am­ples which hap­pen not to be poi­sonous or have any other prob­lems.

# Con­ser­vatism over the causes of sensed train­ing cases.

Con­ser­vatism in AGI cases seems like it would need to be in­ter­preted over the causes of sen­sory data, rather than the sen­sory data it­self. We’re not look­ing for a con­ser­va­tive con­cept about which images of a bur­rito would be clas­sified as pos­i­tive, we want a con­cept over which en­vi­ron­men­tal bur­ri­tos would be clas­sified as pos­i­tive. Two bur­rito can­di­dates can cause iden­ti­cal images while differ­ing in their poi­sonous­ness, so we want to draw our con­ser­va­tive con­cept bound­ary around (our model of) the causes of past sen­sory events in our train­ing cases, not draw a bound­ary around the sen­sory events them­selves.

# Con­ser­va­tive planning

A con­ser­va­tive strat­egy or con­ser­va­tive plan would ce­teris paribus pre­fer to con­struct bur­ri­tos by buy­ing in­gre­di­ents from the store and cook­ing them, rather than build­ing nanoma­chin­ery that con­structs a bur­rito, be­cause this would be more char­ac­ter­is­tic of how bur­ri­tos are usu­ally con­structed, or more similar to the el­e­ments of pre­vi­ously ap­proved plans. Again, this seems like it might be less likely to gen­er­ate a poi­sonous bur­rito.

Another paradig­matic ex­am­ple of con­ser­vatism might be to, e.g., in­side some game en­g­ine, show the AI some hu­man play­ers run­ning around, and then give the AI an ob­ject that has the goal of e.g. mov­ing a box to the end of the room. If the AI is given the abil­ity to fly, but gen­er­ates a plan in which the box-mov­ing agent only moves around on the ground be­cause that’s what the train­ing ex­am­ples did, then this is a con­ser­va­tive plan.

The point of this isn’t to crip­ple the AI’s abil­ities, the point is that if e.g. your low im­pact mea­sure has a loop­hole and the AI gen­er­ates a plan to turn all mat­ter within reach into pink-painted cars, some steps of this plan like “dis­assem­ble stars to make more cars and paint” are likely to be non-con­ser­va­tive and hence not hap­pen au­to­mat­i­cally.

## Flag­ging non-con­ser­va­tive plan steps

If a non-con­ser­va­tive plan seems bet­ter along other im­por­tant di­men­sions—for ex­am­ple, there is no other plan that has an equally low im­pact and equally few side effects com­pared to just syn­the­siz­ing the bur­rito us­ing a nanoma­chine—then we can also imag­ine that the crit­i­cal step might be flagged as non-con­ser­va­tive and pre­sented to the user for check­ing.

That is, on ‘con­ser­va­tive’ plan­ning, we’re in­ter­ested in both the prob­lem “gen­er­ate a plan and then flag and re­port non-con­ser­va­tive steps” as well as the prob­lem “try to gen­er­ate a plan that has few or no non-con­ser­va­tive steps”.

# Role in other safety problems

Con­ser­vatism and con­ser­va­tive plan­ning seems like it might di­rectly tackle some stan­dard con­cerns head-on and in a suffi­ciently ba­sic way to avoid loop­holes, and might also be sub­ject to those con­cerns. E.g.:

• Edge in­stan­ti­a­tion—if in full gen­er­al­ity we don’t go to the edge of the graph but try to stay in the cen­ter of what’s already been pos­i­tively clas­sified, maybe we can avoid this.

• Un­fore­seen max­i­mum—if we stick to things very similar to already-pos­i­tively-clas­sified in­stances, we won’t au­to­mat­i­cally go into the uni­mag­ined parts of the graph.

• Con­text dis­aster—a suffi­ciently con­ser­va­tive op­ti­mizer might go on us­ing op­tions pre­vi­ous to similarly whitelisted ones even if large new sec­tions of plan­ning space opened up.

How­ever, to the ex­tent we rely on con­ser­vatism to pre­vent any of these things from hap­pen­ing, it’s a mis­sion-crit­i­cal com­po­nent that it­self has to be ad­vanced-safe with no loop­holes. If a ‘con­ser­vatism’ con­straint is be­ing ap­plied to very pow­er­ful op­ti­miza­tion pres­sures, we need to worry about this seek­ing out any loop­hole in what is ‘con­ser­va­tive’. It might be that the cen­tral no­tion of ‘con­ser­vatism’ is sim­ple enough to have no loop­holes. But it’s also pos­si­ble that even a sim­plish and con­ser­va­tive con­cept would still in­clude some dan­ger­ous in­stances, if there’s enough op­ti­miza­tion pres­sure seek­ing out a max­i­mal-un­der-some-crite­rion in­stance within ev­ery­thing that’s been clas­sified con­ser­va­tively.

Two pos­si­ble meta-ap­proaches to mak­ing con­ser­vatism even safer:

• Use con­ser­vatism to flag non-con­ser­va­tive steps in plans, or ex­pected non-con­ser­va­tive in­stances of goal achieve­ments, and re­fer these for user check­ing be­fore tak­ing ac­tion. (Rather than au­to­mat­i­cally gen­er­at­ing a plan con­tain­ing only ‘con­ser­va­tive’ steps.) This would have the stan­dard prob­lems with user check­ing.

• Have a defi­ni­tion of con­ser­vatism, rel­a­tive to the AI’s cur­rent world-model and con­cep­tual lan­guage, which would au­to­mat­i­cally catch as ‘ex­cep­tional’ (hence not con­ser­va­tive) any­thing which had the weird prop­erty of be­ing the only first-or­der-con­ser­va­tive in­stance of a con­cept that had some other spe­cial prop­erty be­ing sought out by the op­ti­miza­tion pres­sure. This might in­volve weird re­flec­tive prob­lems, such as any planned event be­ing spe­cial in virtue of the AI hav­ing planned it.

Parents:

An ad­vanced AI that’s meant to pur­sue a se­ries of limited-scope goals given it by the user. In Bostrom’s ter­minol­ogy, a Ge­nie.

• Would it be fair to sum­ma­rize the idea of a con­ser­va­tive con­cept bound­ary as a clas­sifier that avoids false pos­i­tives while re­main­ing sim­ple?

• Well, the pur­pose is to avoid the AGI clas­sify­ing po­ten­tial goal fulfill­ments in a way that, from the user’s per­spec­tive, is a “false pos­i­tive”. The rea­son why we have to spend a lot of time think­ing about re­ally, re­ally good ways to have the AGI not guess pos­i­tive la­bels on things that we wouldn’t la­bel as pos­i­tive, is that the train­ing data we pre­sent to the AI may be am­bigu­ous in some way we don’t know about, or many ways we don’t know about. Mean­ing that the AI does not ac­tu­ally have the in­for­ma­tion to figure out what we meant by look­ing for the sim­plest ways to clas­sify the train­ing cases, and in­stead has to do some­thing that’s very very similar to the pos­i­tively la­beled train­ing in­stances to min­i­mize the prob­a­bil­ity of screw­ing up.

I’m push­ing back a lit­tle on this “clas­sifier that avoids false pos­i­tives” de­scrip­tion be­cause that’s what ev­ery clas­sifier is in some sense in­tended to do; you have to be spe­cific about how, or what ap­proach you’re tak­ing, in or­der to say some­thing that means more than just “clas­sifier that is a good clas­sifier”.

• I’m push­ing back a lit­tle on this “clas­sifier that avoids false pos­i­tives” de­scrip­tion be­cause that’s what ev­ery clas­sifier is in some sense in­tended to do

Well pre­sum­ably there’s a trade-off be­tween avoid­ing false pos­i­tives and avoid­ing false nega­tives. And you want a clas­sifier that tries re­ally hard to avoid false pos­i­tives, as I un­der­stand.

• Sup­pose there are ex­ist­ing generic tech­niques for de­vel­op­ing clas­sifiers that pri­ori­tize avoid­ing false pos­i­tives over avoid­ing false nega­tives—would you not ex­pect them to find a “con­ser­va­tive con­cept bound­ary” by de­fault?

• It seems that clas­sifiers trained on ad­ver­sar­ial ex­am­ples may be find­ing (more) con­ser­va­tive con­cept bound­aries:

We also found that the weights of the learned model changed sig­nifi­cantly, with the weights of the ad­ver­sar­i­ally trained model be­ing sig­nifi­cantly more lo­cal­ized and interpretable

Ex­plain­ing and Har­ness­ing Ad­ver­sar­ial Examples

• As Eric and EY jointly point out, this ar­ti­cle seems to be roughly point­ing at a sim­ple clas­sifier that places a big penalty on false pos­i­tives, e.g.: loss = 100(1-lambda)(false­pos­i­tiver­ate) + (1-lambda)(falsenega­tiver­ate) + lambdaregularization

After all, the pur­pose of reg­u­lariza­tion is to en­sure sim­plic­ity.

To the ex­tent that con­ser­va­tive con­cepts are at all differ­ent, it should run through the no­tion of am­bi­guity de­tec­tion and KWIK learn­ing. At least that’s what ma­chine learn­ing peo­ple will round the pro­posal off to un­til they have some other con­crete pro­pos­als. Though maybe I’m miss­ing some­thing.