Goal-concept identification

The prob­lem of try­ing to figure out how to com­mu­ni­cate to an AGI an in­tended goal con­cept on the or­der of “give me a straw­berry, and not a fake plas­tic straw­berry ei­ther”.

At this level of the prob­lem, we’re not con­cerned with e.g. larger prob­lems of safe plan iden­ti­fi­ca­tion such as not mug­ging peo­ple for straw­ber­ries, or min­i­miz­ing side effects. We’re not (at this level of the prob­lem) con­cerned with iden­ti­fy­ing each and ev­ery one of the com­po­nents of hu­man value, as they might be im­pacted by side effects more dis­tant in the causal graph. We’re not con­cerned with philo­soph­i­cal un­cer­tainty about what we should mean by “straw­berry”. We sup­pose that in an in­tu­itive sense, we do have a pretty good idea of what we in­tend by “straw­berry”, such that there are things that are definitely straw­ber­ries and we’re pretty happy with our sense of that so long as no­body is de­liber­ately try­ing to fool it.

We just want to com­mu­ni­cate a lo­cal goal con­cept that dis­t­in­guishes ed­ible straw­ber­ries from plas­tic straw­ber­ries, or non­toxic straw­ber­ries from poi­sonous straw­ber­ries. That is: we want to say “straw­berry” in an un­der­stand­able way that’s suit­able for fulfilling a task of “just give Sally a straw­berry”, pos­si­bly in con­junc­tion with other fea­tures like con­ser­vatism or low im­pact or mild op­ti­miza­tion.

For some open sub­prob­lems of the ob­vi­ous ap­proach that goes through show­ing ac­tual straw­ber­ries to the AI’s we­b­cam, see “Iden­ti­fy­ing causal goal con­cepts from sen­sory data” and “Look where I’m point­ing, not at my finger”.