Selective similarity metrics for imitation
A human-imitator (trained using Safe training procedures for human-imitators) will try to imitate all aspects of human behavior. Sometimes we care more about how good the imitation is along some axes than others, and it would be inefficient to imitate the human along all axes. Therefore, we might want to design scoring rules for human-imitators that emphasize matching performance along some axes more than others.
Compare with Mimicry and meeting halfway, another proposed way of making human-imitation more efficient.
Here are some ideas for constructing scoring rules:
Suppose that, given a question, the human will write down a number. We ask some predictor to output the parameters of some Gaussian distribution. We train the predictor to output Gaussian distributions that assign high probability to the training data. Then, we sample from this Gaussian distribution to imitate the human. Clearly, this is a way of imitating some aspects of human behavior (mean and variance) but not others.
The general form of this approach is to estimate moments (expectations of some features) of the predictor’s distribution on human behavior, and then sample from some distribution with these moments (such as an exponential family distribution)
A less trivial example is a variant of inverse reinforcement learning. In this variant, to predict a sequence of actions the human takes, the predictor outputs some representation of a reward function on states (such as the parameters to some affine function of features of the state). The human is modeled as a noisy reinforcement learner with this reward function, and the predictor is encouraged to have this model assign high probability to the human’s actual trajectory. To imitate the human, run a noisy inverse reinforcement learner with the predicted reward function. The predictor can be seen as estimating moments of the human’s trajectory (specifically, moments related to frequencies of state transitions between states with different features), and the system samples from a distribution with these same moments in order to imitate the human.
Combining proper scoring rules
It is easy to see that the sum of two proper scoring rules is a proper scoring rule. Therefore, it is possible to combine proper scoring rules to train a human-imitator to do well according to both scoring rules. For example, we may score a distribution both on how much probability it assigns to human actions and to how well its moments match the moments of human actions, according to some weighting.
Note that proper scoring rules can be characterized by convex functions.
It is unclear how safe it is to train a human-imitator using a selective similarity metric. To the extent that the AI is not doing some task the way a human would, it is possible that it is acting dangerously. One hopes that, to the extent that the human-imitator is using a bad model to imitate the human (such as a noisy reinforcement learning model), it is not bad in a way that causes problems such as Edge instantiation. It would be good to see if something like IRL-based imitation could behave dangerously in some realistic case.
- AI alignment
The great civilizational problem of creating artificially intelligent computer systems such that running them is a good idea.