Selective similarity metrics for imitation

A hu­man-imi­ta­tor (trained us­ing Safe train­ing pro­ce­dures for hu­man-imi­ta­tors) will try to imi­tate all as­pects of hu­man be­hav­ior. Some­times we care more about how good the imi­ta­tion is along some axes than oth­ers, and it would be in­effi­cient to imi­tate the hu­man along all axes. There­fore, we might want to de­sign scor­ing rules for hu­man-imi­ta­tors that em­pha­size match­ing perfor­mance along some axes more than oth­ers.

Com­pare with Mimicry and meet­ing halfway, an­other pro­posed way of mak­ing hu­man-imi­ta­tion more effi­cient.

Here are some ideas for con­struct­ing scor­ing rules:

Mo­ment matching

Sup­pose that, given a ques­tion, the hu­man will write down a num­ber. We ask some pre­dic­tor to out­put the pa­ram­e­ters of some Gaus­sian dis­tri­bu­tion. We train the pre­dic­tor to out­put Gaus­sian dis­tri­bu­tions that as­sign high prob­a­bil­ity to the train­ing data. Then, we sam­ple from this Gaus­sian dis­tri­bu­tion to imi­tate the hu­man. Clearly, this is a way of imi­tat­ing some as­pects of hu­man be­hav­ior (mean and var­i­ance) but not oth­ers.

The gen­eral form of this ap­proach is to es­ti­mate mo­ments (ex­pec­ta­tions of some fea­tures) of the pre­dic­tor’s dis­tri­bu­tion on hu­man be­hav­ior, and then sam­ple from some dis­tri­bu­tion with these mo­ments (such as an ex­po­nen­tial fam­ily dis­tri­bu­tion)

A less triv­ial ex­am­ple is a var­i­ant of in­verse re­in­force­ment learn­ing. In this var­i­ant, to pre­dict a se­quence of ac­tions the hu­man takes, the pre­dic­tor out­puts some rep­re­sen­ta­tion of a re­ward func­tion on states (such as the pa­ram­e­ters to some af­fine func­tion of fea­tures of the state). The hu­man is mod­eled as a noisy re­in­force­ment learner with this re­ward func­tion, and the pre­dic­tor is en­couraged to have this model as­sign high prob­a­bil­ity to the hu­man’s ac­tual tra­jec­tory. To imi­tate the hu­man, run a noisy in­verse re­in­force­ment learner with the pre­dicted re­ward func­tion. The pre­dic­tor can be seen as es­ti­mat­ing mo­ments of the hu­man’s tra­jec­tory (speci­fi­cally, mo­ments re­lated to fre­quen­cies of state tran­si­tions be­tween states with differ­ent fea­tures), and the sys­tem sam­ples from a dis­tri­bu­tion with these same mo­ments in or­der to imi­tate the hu­man.

Com­bin­ing proper scor­ing rules

It is easy to see that the sum of two proper scor­ing rules is a proper scor­ing rule. There­fore, it is pos­si­ble to com­bine proper scor­ing rules to train a hu­man-imi­ta­tor to do well ac­cord­ing to both scor­ing rules. For ex­am­ple, we may score a dis­tri­bu­tion both on how much prob­a­bil­ity it as­signs to hu­man ac­tions and to how well its mo­ments match the mo­ments of hu­man ac­tions, ac­cord­ing to some weight­ing.

Note that proper scor­ing rules can be char­ac­ter­ized by con­vex func­tions.


It is un­clear how safe it is to train a hu­man-imi­ta­tor us­ing a se­lec­tive similar­ity met­ric. To the ex­tent that the AI is not do­ing some task the way a hu­man would, it is pos­si­ble that it is act­ing dan­ger­ously. One hopes that, to the ex­tent that the hu­man-imi­ta­tor is us­ing a bad model to imi­tate the hu­man (such as a noisy re­in­force­ment learn­ing model), it is not bad in a way that causes prob­lems such as Edge in­stan­ti­a­tion. It would be good to see if some­thing like IRL-based imi­ta­tion could be­have dan­ger­ously in some re­al­is­tic case.


  • AI alignment

    The great civ­i­liza­tional prob­lem of cre­at­ing ar­tifi­cially in­tel­li­gent com­puter sys­tems such that run­ning them is a good idea.