Goal-concept identification

The problem of trying to figure out how to communicate to an AGI an intended goal concept on the order of “give me a strawberry, and not a fake plastic strawberry either”.

At this level of the problem, we’re not concerned with e.g. larger problems of safe plan identification such as not mugging people for strawberries, or minimizing side effects. We’re not (at this level of the problem) concerned with identifying each and every one of the components of human value, as they might be impacted by side effects more distant in the causal graph. We’re not concerned with philosophical uncertainty about what we should mean by “strawberry”. We suppose that in an intuitive sense, we do have a pretty good idea of what we intend by “strawberry”, such that there are things that are definitely strawberries and we’re pretty happy with our sense of that so long as nobody is deliberately trying to fool it.

We just want to communicate a local goal concept that distinguishes edible strawberries from plastic strawberries, or nontoxic strawberries from poisonous strawberries. That is: we want to say “strawberry” in an understandable way that’s suitable for fulfilling a task of “just give Sally a strawberry”, possibly in conjunction with other features like conservatism or low impact or mild optimization.

For some open subproblems of the obvious approach that goes through showing actual strawberries to the AI’s webcam, see “Identifying causal goal concepts from sensory data” and “Look where I’m pointing, not at my finger”.